Music sound synthesis using : Towards a perceptually relevant control space

Fanny ROCHE PhD Defense 29 September 2020

Supervisor: Laurent GIRIN Co-supervisors: Thomas HUEBER Co-supervisors: Maëva GARNIER Co-supervisors: Samuel LIMIER

Fanny ROCHE - PhD Defense - 29 September 2020 1 / 48 Context and Objectives Context and Objectives

Fanny ROCHE - PhD Defense - 29 September 2020 2 / 48 Context and Objectives Context and Objectives

Fanny ROCHE - PhD Defense - 29 September 2020 2 / 48 Context and Objectives Context and Objectives

Fanny ROCHE - PhD Defense - 29 September 2020 2 / 48 Processed recordings → e.g. sampling, wavetable, granular, ... + computationally efficient − very memory consuming

Spectral modeling → e.g. additive, subtractive, ... + close to human sound perception − numerous & very specific

Physical modeling → e.g. solving wave equations, modal synthesis, ... + physically meaningful controls − very specific sounds

Context and Objectives Context and Objectives

Music sound synthesis Abstract algorithms → e.g. FM, waveshaping, ... + rich sounds − complex parameters

Fanny ROCHE - PhD Defense - 29 September 2020 3 / 48 Spectral modeling → e.g. additive, subtractive, ... + close to human sound perception − numerous & very specific

Physical modeling → e.g. solving wave equations, modal synthesis, ... + physically meaningful controls − very specific sounds

Context and Objectives Context and Objectives

Music sound synthesis Abstract algorithms → e.g. FM, waveshaping, ... + rich sounds − complex parameters

Processed recordings → e.g. sampling, wavetable, granular, ... + computationally efficient − very memory consuming

Fanny ROCHE - PhD Defense - 29 September 2020 3 / 48 Physical modeling → e.g. solving wave equations, modal synthesis, ... + physically meaningful controls − very specific sounds

Context and Objectives Context and Objectives

Music sound synthesis Abstract algorithms → e.g. FM, waveshaping, ... + rich sounds − complex parameters

Processed recordings → e.g. sampling, wavetable, granular, ... + computationally efficient − very memory consuming

Spectral modeling → e.g. additive, subtractive, ... + close to human sound perception − numerous & very specific

Fanny ROCHE - PhD Defense - 29 September 2020 3 / 48 Context and Objectives Context and Objectives

Music sound synthesis Abstract algorithms → e.g. FM, waveshaping, ... + rich sounds − complex parameters

Processed recordings → e.g. sampling, wavetable, granular, ... + computationally efficient − very memory consuming

Spectral modeling → e.g. additive, subtractive, ... + close to human sound perception − numerous & very specific

Physical modeling → e.g. solving wave equations, modal synthesis, ... + physically meaningful controls − very specific sounds

Fanny ROCHE - PhD Defense - 29 September 2020 3 / 48 Context and Objectives Context and Objectives

Fanny ROCHE - PhD Defense - 29 September 2020 4 / 48 Context and Objectives Context and Objectives

Thesis Project . New machine learning methods to tackle these issues and get:

◦ perceptually-meaningful ◦ independent control ◦ accurate sound control parameters parameters modeling

Fanny ROCHE - PhD Defense - 29 September 2020 5 / 48 Context and Objectives Challenges/Research questions

1. Define verbal descriptors adapted to synthetic sounds 2. Find suited method for extracting a high-level representation space & generating high-quality sounds 3. Get perceptually-meaningful control parameters for the synthesis

Fanny ROCHE - PhD Defense - 29 September 2020 6 / 48 Context and Objectives Content

1 Perceptual characterization of synthetic timbre State-of-the-art Free verbalization perceptual test Perceptual scaling test Conclusion

2 Unsupervised representation learning Methodology Comparative study Conclusion

3 Towards weak supervision using timbre perception Methodology Experiments Perceptual evaluation Conclusion

4 Conclusion and perspectives

Fanny ROCHE - PhD Defense - 29 September 2020 7 / 48 Perceptual characterization of synthetic timbre Content

1 Perceptual characterization of synthetic timbre State-of-the-art Free verbalization perceptual test Perceptual scaling test Conclusion

2 Unsupervised representation learning Methodology Comparative study Conclusion

3 Towards weak supervision using timbre perception Methodology Experiments Perceptual evaluation Conclusion

4 Conclusion and perspectives

Fanny ROCHE - PhD Defense - 29 September 2020 8 / 48 Perceptual characterization of synthetic timbre Perceptual characterization of synthetic timbre

Fanny ROCHE - PhD Defense - 29 September 2020 9 / 48 Timbre perception approaches:

Multidimensional studies (MDS) [Grey 1977; Iverson and Krumhansl 1993; McAdams et al. 1995; Lakatos 2000] Qualitative description of timbre: ⇒ No consensus BUT agreement on its spectro-temporal shape:

temporal dynamics vs. spectral content [Schaeffer 1996; Castellengo 2015]

Context-dependent perceptual dimensions → type of sounds, listeners, language, ...

Perceptual characterization of synthetic timbre State-of-the-art

Ambiguous definition of timbre → multidimensional perceptual attribute

Fanny ROCHE - PhD Defense - 29 September 2020 10 / 48 ⇒ No consensus BUT agreement on its spectro-temporal shape:

temporal dynamics vs. spectral content [Schaeffer 1996; Castellengo 2015]

Context-dependent perceptual dimensions → type of sounds, listeners, language, ...

Perceptual characterization of synthetic timbre State-of-the-art

Ambiguous definition of timbre → multidimensional perceptual attribute

Timbre perception approaches:

Multidimensional studies (MDS) [Grey 1977; Iverson and Krumhansl 1993; McAdams et al. 1995; Lakatos 2000] Qualitative description of timbre:

Free verbalization [Faure 2000; Traube 2004; Garnier et al. 2007; Cance and Dubois 2015] Free categorization [Guyot 1996; Gaillard 2000; Bensa et al. 2004; Ehrette 2004]

Semantic differential (SD) method [Faure 2000; Ehrette 2004; Zacharakis et al. 2014]

Fanny ROCHE - PhD Defense - 29 September 2020 10 / 48 Context-dependent perceptual dimensions → type of sounds, listeners, language, ...

Perceptual characterization of synthetic timbre State-of-the-art

Ambiguous definition of timbre → multidimensional perceptual attribute

Timbre perception approaches:

Multidimensional studies (MDS) [Grey 1977; Iverson and Krumhansl 1993; McAdams et al. 1995; Lakatos 2000] Qualitative description of timbre:

Free verbalization [Faure 2000; Traube 2004; Garnier et al. 2007; Cance and Dubois 2015] Free categorization [Guyot 1996; Gaillard 2000; Bensa et al. 2004; Ehrette 2004]

Semantic differential (SD) method [Faure 2000; Ehrette 2004; Zacharakis et al. 2014] ⇒ No consensus BUT agreement on its spectro-temporal shape:

temporal dynamics vs. spectral content [Schaeffer 1996; Castellengo 2015]

Fanny ROCHE - PhD Defense - 29 September 2020 10 / 48 Perceptual characterization of synthetic timbre State-of-the-art

Ambiguous definition of timbre → multidimensional perceptual attribute

Timbre perception approaches:

Multidimensional studies (MDS) [Grey 1977; Iverson and Krumhansl 1993; McAdams et al. 1995; Lakatos 2000] Qualitative description of timbre:

Free verbalization [Faure 2000; Traube 2004; Garnier et al. 2007; Cance and Dubois 2015] Free categorization [Guyot 1996; Gaillard 2000; Bensa et al. 2004; Ehrette 2004]

Semantic differential (SD) method [Faure 2000; Ehrette 2004; Zacharakis et al. 2014] ⇒ No consensus BUT agreement on its spectro-temporal shape:

temporal dynamics vs. spectral content [Schaeffer 1996; Castellengo 2015]

Context-dependent perceptual dimensions → type of sounds, listeners, language, ...

Fanny ROCHE - PhD Defense - 29 September 2020 10 / 48 Perceptual characterization of synthetic timbre State-of-the-art

Ambiguous definition of timbre → multidimensional perceptual attribute

Timbre perception approaches:

Multidimensional studies (MDS) [Grey 1977; Iverson and Krumhansl 1993; McAdams et al. 1995; Lakatos 2000] Qualitative description of timbre:

Free verbalization [Faure 2000; Traube 2004; Garnier et al. 2007; Cance and Dubois 2015] Free categorization [Guyot 1996; Gaillard 2000; Bensa et al. 2004; Ehrette 2004]

Semantic differential (SD) method [Faure 2000; Ehrette 2004; Zacharakis et al. 2014] ⇒ No consensus BUT agreement on its spectro-temporal shape:

temporal dynamics vs. spectral content [Schaeffer 1996; Castellengo 2015]

Context-dependent perceptual dimensions → type of sounds, listeners, language, ...

Fanny ROCHE - PhD Defense - 29 September 2020 10 / 48 Perceptual characterization of synthetic timbre Free verbalization perceptual test

Objective: . Collect verbal descriptors that are frequently and transversally used to describe synthesizer sounds in French

Participants: 101 responses Stimuli: Creation of the ARTURIA dataset → 1,233 samples from ARTURIA’s software synthesizers 50 wisely chosen stimuli

HHHHHH

Fanny ROCHE - PhD Defense - 29 September 2020 11 / 48 Perceptual characterization of synthetic timbre Free verbalization perceptual test

Protocol:

Fanny ROCHE - PhD Defense - 29 September 2020 12 / 48 Observations Both terms commonly used for usual musical instruments and new terms: → e.g. brillant, chaud, métallique, ... see for example [Faure 2000; Traube 2004; Cheminée et al. 2005; Garnier et al. 2007; Lavoie 2013] → e.g. distordu, explosif, rétro-futuriste, robotique, saccadé, spatial, ... 5 most frequent and transversal perceptual categories all related to spectral content of the sound

Perceptual characterization of synthetic timbre Free verbalization perceptual test

Results analysis Pre-processing Collected 784 different terms Semantic clustering based on co-occurrences matrix Frequency & transversality analysis

Fanny ROCHE - PhD Defense - 29 September 2020 13 / 48 Perceptual characterization of synthetic timbre Free verbalization perceptual test

Results analysis Pre-processing Collected 784 different terms Semantic clustering based on co-occurrences matrix Frequency & transversality analysis

Observations Both terms commonly used for usual musical instruments and new terms: → e.g. brillant, chaud, métallique, ... see for example [Faure 2000; Traube 2004; Cheminée et al. 2005; Garnier et al. 2007; Lavoie 2013] → e.g. distordu, explosif, rétro-futuriste, robotique, saccadé, spatial, ... 5 most frequent and transversal perceptual categories all related to spectral content of the sound

Fanny ROCHE - PhD Defense - 29 September 2020 13 / 48 Perceptual characterization of synthetic timbre Free verbalization perceptual test

Verbal descriptors selection Prototype of the scales Covering all timbre dimensions

⇒ These descriptors will serve as semantic labels of the scales for the second perceptual test

Fanny ROCHE - PhD Defense - 29 September 2020 14 / 48 Perceptual characterization of synthetic timbre Perceptual scaling test

Objectives: . Evaluate consensus of the 8 highlighted perceptual dimensions . Annotate a subset of samples along these dimensions

Participants: 71 responses

Stimuli:

ARTURIA dataset Stimuli selection: 80 wisely chosen samples

Stimuli assignment: training vs. main phase samples

Fanny ROCHE - PhD Defense - 29 September 2020 15 / 48 Perceptual characterization of synthetic timbre Perceptual scaling test

Protocol:

Fanny ROCHE - PhD Defense - 29 September 2020 16 / 48 Inter-subject consensus distinguish groups of participants with a shared conception of the dimensions

Figure: Resulting dendrogram of the inter-subject correlation coefficients for the Agressif scale

Perceptual characterization of synthetic timbre Perceptual scaling test

Results analysis Intra-subject consensus evaluate consistency of each participant remove "unreliable" listeners

Figure: Intra-subject correlation matrix

Fanny ROCHE - PhD Defense - 29 September 2020 17 / 48 Perceptual characterization of synthetic timbre Perceptual scaling test

Results analysis Intra-subject consensus evaluate consistency of each participant remove "unreliable" listeners

Figure: Intra-subject correlation matrix

Inter-subject consensus distinguish groups of participants with a shared conception of the dimensions

Figure: Resulting dendrogram of the inter-subject correlation coefficients for the Agressif scale

Fanny ROCHE - PhD Defense - 29 September 2020 17 / 48 Final label vectors computation Selection of the cluster with the largest number of participants Scale-wise annotation of every sound using the mean ratings

Perceptual characterization of synthetic timbre Perceptual scaling test

Observations No significant differences between groups of participants Discrepancy between scales: ◦ métallique ◦ qui vibre ◦ chaud ◦ qui résonne ◦ soufflé ◦ qui évolue ◦ percussif ◦ agressif Consensus degree consistent across both analyses → intra and inter-subject correlations

Fanny ROCHE - PhD Defense - 29 September 2020 18 / 48 Final label vectors computation Selection of the cluster with the largest number of participants Scale-wise annotation of every sound using the mean ratings

Perceptual characterization of synthetic timbre Perceptual scaling test

Observations No significant differences between groups of participants Discrepancy between scales: ◦ métallique ◦ qui vibre ◦ chaud ◦ qui résonne ◦ soufflé ◦ qui évolue ◦ percussif ◦ agressif → very consensual Consensus degree consistent across both analyses → intra and inter-subject correlations

Fanny ROCHE - PhD Defense - 29 September 2020 18 / 48 Final label vectors computation Selection of the cluster with the largest number of participants Scale-wise annotation of every sound using the mean ratings

Perceptual characterization of synthetic timbre Perceptual scaling test

Observations No significant differences between groups of participants Discrepancy between scales: ◦ métallique ◦ qui vibre → least consensual ◦ chaud ◦ qui résonne ◦ soufflé ◦ qui évolue ◦ percussif ◦ agressif → very consensual Consensus degree consistent across both analyses → intra and inter-subject correlations

Fanny ROCHE - PhD Defense - 29 September 2020 18 / 48 Perceptual characterization of synthetic timbre Perceptual scaling test

Observations No significant differences between groups of participants Discrepancy between scales: ◦ métallique ◦ qui vibre → least consensual ◦ chaud ◦ qui résonne ◦ soufflé ◦ qui évolue ◦ percussif ◦ agressif → very consensual Consensus degree consistent across both analyses → intra and inter-subject correlations

Final label vectors computation Selection of the cluster with the largest number of participants Scale-wise annotation of every sound using the mean ratings

Fanny ROCHE - PhD Defense - 29 September 2020 18 / 48 Shared conception of the terms ⇒ although discrepancies between scales Dual aspect of 2 dimensions ⇒ use for commercialized synthesizer open to question Annotation of a subset of the ARTURIA dataset with perceptual scores

Perceptual characterization of synthetic timbre Conclusion

Identification of the most frequent/transversal terms to characterize synthesizer sounds ⇒ 8 verbal descriptors: ◦ métallique ◦ qui vibre ◦ chaud ◦ qui résonne ◦ soufflé ◦ qui évolue ◦ percussif ◦ agressif

Fanny ROCHE - PhD Defense - 29 September 2020 19 / 48 Dual aspect of 2 dimensions ⇒ use for commercialized synthesizer open to question Annotation of a subset of the ARTURIA dataset with perceptual scores

Perceptual characterization of synthetic timbre Conclusion

Identification of the most frequent/transversal terms to characterize synthesizer sounds ⇒ 8 verbal descriptors: ◦ métallique ◦ qui vibre ◦ chaud ◦ qui résonne ◦ soufflé ◦ qui évolue ◦ percussif ◦ agressif Shared conception of the terms ⇒ although discrepancies between scales

Fanny ROCHE - PhD Defense - 29 September 2020 19 / 48 Annotation of a subset of the ARTURIA dataset with perceptual scores

Perceptual characterization of synthetic timbre Conclusion

Identification of the most frequent/transversal terms to characterize synthesizer sounds ⇒ 8 verbal descriptors: ◦ métallique ◦ qui vibre ◦ chaud ◦ qui résonne ◦ soufflé ◦ qui évolue ◦ percussif ◦ agressif Shared conception of the terms ⇒ although discrepancies between scales Dual aspect of 2 dimensions ⇒ use for commercialized synthesizer open to question

Fanny ROCHE - PhD Defense - 29 September 2020 19 / 48 Perceptual characterization of synthetic timbre Conclusion

Identification of the most frequent/transversal terms to characterize synthesizer sounds ⇒ 8 verbal descriptors: ◦ métallique ◦ qui vibre ◦ chaud ◦ qui résonne ◦ soufflé ◦ qui évolue ◦ percussif ◦ agressif Shared conception of the terms ⇒ although discrepancies between scales Dual aspect of 2 dimensions ⇒ use for commercialized synthesizer open to question Annotation of a subset of the ARTURIA dataset with perceptual scores

Fanny ROCHE - PhD Defense - 29 September 2020 19 / 48 Unsupervised representation learning Content

1 Perceptual characterization of synthetic timbre State-of-the-art Free verbalization perceptual test Perceptual scaling test Conclusion

2 Unsupervised representation learning Methodology Comparative study Conclusion

3 Towards weak supervision using timbre perception Methodology Experiments Perceptual evaluation Conclusion

4 Conclusion and perspectives

Fanny ROCHE - PhD Defense - 29 September 2020 20 / 48 Unsupervised representation learning Unsupervised representation learning

Fanny ROCHE - PhD Defense - 29 September 2020 21 / 48 Related work Autoencoder-based models Generative Adversarial Network (GAN)-based models Autoencoders [Sarroff and Casey 2014; Colonel et al. 2017] WaveGAN & SpecGAN [Donahue WaveNet autoencoders [Engel et et al. 2019] al. 2017] GANSynth [Engel et al. 2019] Variational autoencoders → speech [Blaauw and Bonada 2016; Hsu et al. 2017; Akuzawa et al. 2018] → music sounds [Esling et al. 2018]

Unsupervised representation learning Unsupervised representation learning

Objective: . Investigate well-suited algorithm to extract a high-level representation space with interesting interpolation and extrapolation properties from a dataset of sounds

Questions: → possibility to extract such a space automatically from low-level representation of signals? → suitable for synthesis control & perceptually relevant?

Fanny ROCHE - PhD Defense - 29 September 2020 22 / 48 Unsupervised representation learning Unsupervised representation learning

Objective: . Investigate well-suited deep learning algorithm to extract a high-level representation space with interesting interpolation and extrapolation properties from a dataset of sounds

Questions: → possibility to extract such a space automatically from low-level representation of signals? → suitable for synthesis control & perceptually relevant?

Related work Autoencoder-based models Generative Adversarial Network (GAN)-based models Autoencoders [Sarroff and Casey 2014; Colonel et al. 2017] WaveGAN & SpecGAN [Donahue WaveNet autoencoders [Engel et et al. 2019] al. 2017] GANSynth [Engel et al. 2019] Variational autoencoders → speech [Blaauw and Bonada 2016; Hsu et al. 2017; Akuzawa et al. 2018] → music sounds [Esling et al. 2018]

Fanny ROCHE - PhD Defense - 29 September 2020 22 / 48 Unsupervised representation learning Methodology

Analysis-transformation-synthesis methodology

Autoencoder-based models (AE)

AE and deep AE (DAE) [Hinton and Salakhutdinov 2006; Bengio et al. 2007]

Recurrent AE (LSTM-AE) [Hochreiter and Schmidhuber 1997]

Variational AE (VAE) [Kingma and Welling 2014; Rezende et al. 2014]

Baseline

Principal component analysis (PCA) [Bishop 2006]

Fanny ROCHE - PhD Defense - 29 September 2020 23 / 48 Unsupervised representation learning Methodology

Autoencoder framework

. Encoding: z = fenc(Wenc x + benc)

. Decoding: xˆ = fdec(Wdec z + bdec) . Training by minimizing reconstruction error: MSE(xˆ, x)

Fanny ROCHE - PhD Defense - 29 September 2020 24 / 48 Unsupervised representation learning Methodology

Autoencoder framework (AE and DAE)

. Encoding: z = fenc(Wenc x + benc)

. Decoding: xˆ = fdec(Wdec z + bdec) . Training by minimizing reconstruction error: MSE(xˆ, x)

Fanny ROCHE - PhD Defense - 29 September 2020 24 / 48 Unsupervised representation learning Methodology

Autoencoder framework (AE, DAE and LSTM-AE)

. Encoding: z = fenc(Wenc x + benc)

. Decoding: xˆ = fdec(Wdec z + bdec) . Training by minimizing reconstruction error: MSE(xˆ, x)

Fanny ROCHE - PhD Defense - 29 September 2020 24 / 48 Unsupervised representation learning Methodology

VAE Probabilistic framework

. Parametric model: pθ(x, z) = pθ(x|z)pθ(z) N 2 . Prior on latent space: pθ(z) = (z; 0, IL) . Probabilistic decoder: pθ(x|z) = N(x; µθ(z), σ (z)) 2 θ . Probabilistic encoder: qφ(z|x) = N(z; µ˜ (x), σ˜ (x)) . Maximizing log-likelihood: φ φ  log pθ(x) = DKL qφ(z|x)|pθ(z|x) + L(φ, θ, β, x) | {z } ≥0

Fanny ROCHE - PhD Defense - 29 September 2020 25 / 48 Unsupervised representation learning Methodology

VAE Probabilistic framework

. Parametric model: pθ(x, z) = pθ(x|z)pθ(z) N 2 . Prior on latent space: pθ(z) = (z; 0, IL) . Probabilistic decoder: pθ(x|z) = N(x; µθ(z), σ (z)) 2 θ . Probabilistic encoder: qφ(z|x) = N(z; µ˜ (x), σ˜ (x)) . Maximizing log-likelihood: φ φ  log pθ(x) = DKL qφ(z|x)|pθ(z|x) + L(φ, θ, β, x) | {z } ≥0

Fanny ROCHE - PhD Defense - 29 September 2020 25 / 48 Unsupervised representation learning Methodology

VAE Probabilistic framework

. Parametric model: pθ(x, z) = pθ(x|z)pθ(z) N 2 . Prior on latent space: pθ(z) = (z; 0, IL) . Probabilistic decoder: pθ(x|z) = N(x; µθ(z), σ (z)) 2 θ . Probabilistic encoder: qφ(z|x) = N(z; µ˜ (x), σ˜ (x)) . Maximizing variational lower bound (VLB): φ φ    L(φ, θ, β, x) = Eqφ(z|x) log pθ(x|z) − β DKL qφ(z|x)|pθ(z) | {z } | {z } reconstruction accuracy regularization

Fanny ROCHE - PhD Defense - 29 September 2020 25 / 48 Time-frequency data representation Magnitude spectrogram Phase spectrogram reconstruction → Griffin & Lim algorithm [Griffin and Lim 1984] → Linear phase unwrapping [Magron 2016] Tested models: PCA (baseline) AE and DAE (several architectures) VAE (different β values) LSTM-AE Metrics: Root mean squared error (RMSE) PEMO-Q scores [Huber and Kollmeier 2006]

Unsupervised representation learning Comparative study

Datasets:

NSynth dataset (subset of 10, 000 samples with fs = 16 kHz) [Engel et al. 2017] ARTURIA dataset (1, 233 samples with fs = 44.1 kHz)

Fanny ROCHE - PhD Defense - 29 September 2020 26 / 48 Tested models: PCA (baseline) AE and DAE (several architectures) VAE (different β values) LSTM-AE Metrics: Root mean squared error (RMSE) PEMO-Q scores [Huber and Kollmeier 2006]

Unsupervised representation learning Comparative study

Datasets:

NSynth dataset (subset of 10, 000 samples with fs = 16 kHz) [Engel et al. 2017] ARTURIA dataset (1, 233 samples with fs = 44.1 kHz) Time-frequency data representation Magnitude spectrogram Phase spectrogram reconstruction → Griffin & Lim algorithm [Griffin and Lim 1984] → Linear phase unwrapping [Magron 2016]

Fanny ROCHE - PhD Defense - 29 September 2020 26 / 48 Metrics: Root mean squared error (RMSE) PEMO-Q scores [Huber and Kollmeier 2006]

Unsupervised representation learning Comparative study

Datasets:

NSynth dataset (subset of 10, 000 samples with fs = 16 kHz) [Engel et al. 2017] ARTURIA dataset (1, 233 samples with fs = 44.1 kHz) Time-frequency data representation Magnitude spectrogram Phase spectrogram reconstruction → Griffin & Lim algorithm [Griffin and Lim 1984] → Linear phase unwrapping [Magron 2016] Tested models: PCA (baseline) AE and DAE (several architectures) VAE (different β values) LSTM-AE

Fanny ROCHE - PhD Defense - 29 September 2020 26 / 48 Unsupervised representation learning Comparative study

Datasets:

NSynth dataset (subset of 10, 000 samples with fs = 16 kHz) [Engel et al. 2017] ARTURIA dataset (1, 233 samples with fs = 44.1 kHz) Time-frequency data representation Magnitude spectrogram Phase spectrogram reconstruction → Griffin & Lim algorithm [Griffin and Lim 1984] → Linear phase unwrapping [Magron 2016] Tested models: PCA (baseline) AE and DAE (several architectures) VAE (different β values) LSTM-AE Metrics: Root mean squared error (RMSE) PEMO-Q scores [Huber and Kollmeier 2006]

Fanny ROCHE - PhD Defense - 29 September 2020 26 / 48 Unsupervised representation learning Comparative study

Reconstruction accuracy evaluation on NSynth

(a) PCA, AE, DAE and LSTM-AE (b) VAE for different β values (PCA recalled)

Original Ù PCA (enc = 100) Ù PCA (enc = 32) Ù VAE (enc = 32) Ù

Fanny ROCHE - PhD Defense - 29 September 2020 27 / 48 Unsupervised representation learning Comparative study

Cross-correlation of latent dimensions → latent dimension of 16

Fanny ROCHE - PhD Defense - 29 September 2020 28 / 48 Unsupervised representation learning Comparative study

Sound morphing

Source : Ù Target: Ù

VAE : ÙÙÙÙÙ

Fanny ROCHE - PhD Defense - 29 September 2020 29 / 48  AE-based models adapted

 VAEs extracted latent vectors ⇒ good candidates for control

 Latent space poorly related to perceptually relevant dimensions

Unsupervised representation learning Conclusion

 Good reconstruction audio quality ⇒ on both datasets

Fanny ROCHE - PhD Defense - 29 September 2020 30 / 48  VAEs extracted latent vectors ⇒ good candidates for control

 Latent space poorly related to perceptually relevant dimensions

Unsupervised representation learning Conclusion

 Good reconstruction audio quality ⇒ on both datasets

 AE-based models adapted

Fanny ROCHE - PhD Defense - 29 September 2020 30 / 48  Latent space poorly related to perceptually relevant dimensions

Unsupervised representation learning Conclusion

 Good reconstruction audio quality ⇒ on both datasets

 AE-based models adapted

 VAEs extracted latent vectors ⇒ good candidates for control

Fanny ROCHE - PhD Defense - 29 September 2020 30 / 48 Unsupervised representation learning Conclusion

 Good reconstruction audio quality ⇒ on both datasets

 AE-based models adapted

 VAEs extracted latent vectors ⇒ good candidates for control

 Latent space poorly related to perceptually relevant dimensions

Fanny ROCHE - PhD Defense - 29 September 2020 30 / 48 Towards weak supervision using timbre perception Content

1 Perceptual characterization of synthetic timbre State-of-the-art Free verbalization perceptual test Perceptual scaling test Conclusion

2 Unsupervised representation learning Methodology Comparative study Conclusion

3 Towards weak supervision using timbre perception Methodology Experiments Perceptual evaluation Conclusion

4 Conclusion and perspectives

Fanny ROCHE - PhD Defense - 29 September 2020 31 / 48 Towards weak supervision using timbre perception Towards weak supervision using timbre perception

Fanny ROCHE - PhD Defense - 29 September 2020 32 / 48 Related work

Timbre-based perceptual regularization [Esling et al. 2018] Timbre space relying on several MDS studies [Grey 1977; Krumhansl 1989; Iverson and Krumhansl 1993; McAdams et al. 1995; Lakatos 2000] fully-labeled dataset of orchestral instruments Add perceptual regularization to VAE → additional term in the VLB to optimize Weakly ⇒ 2-step semi-supervised learning procedure [Hinton and Salakhutdinov 2007]

Towards weak supervision using timbre perception Towards weak supervision using timbre perception

Objective: . Give a perceptual meaning to the dimensions extracted by the neural model

Questions: → possibility to add perceptual supervision so as to "force" the meaning of the latent dimensions? → possible to change the behavior of the model using very few annotated data?

Fanny ROCHE - PhD Defense - 29 September 2020 33 / 48 Towards weak supervision using timbre perception Towards weak supervision using timbre perception

Objective: . Give a perceptual meaning to the dimensions extracted by the neural model

Questions: → possibility to add perceptual supervision so as to "force" the meaning of the latent dimensions? → possible to change the behavior of the model using very few annotated data? Related work

Timbre-based perceptual regularization [Esling et al. 2018] Timbre space relying on several MDS studies [Grey 1977; Krumhansl 1989; Iverson and Krumhansl 1993; McAdams et al. 1995; Lakatos 2000] fully-labeled dataset of orchestral instruments Add perceptual regularization to VAE → additional term in the VLB to optimize Weakly supervised learning ⇒ 2-step semi-supervised learning procedure [Hinton and Salakhutdinov 2007]

Fanny ROCHE - PhD Defense - 29 September 2020 33 / 48 Towards weak supervision using timbre perception Methodology

Proposed 2-step learning procedure:

Repeat the 2 steps N times

1 Unsupervised pre-training

→ Optimizing classic VLB on XU XL:    L(φ, θ, β, x) = Eqφ(z|x) log pθ(x|z) − β DKL qφ(z|x)|pθ(z) | {z } | {z } reconstruction accuracy regularization

2 Supervised fine-tuning → Optimizing VLB with extra regularization on XL:    L(φ, θ, β, x) = Eqφ(z|x) log pθ(x|z) − β DKL qφ(z|x)|pθ(z) + α R(z, P) | {z } | {z } | {z } reconstruction accuracy regularization perceptual reg.

Fanny ROCHE - PhD Defense - 29 September 2020 34 / 48 Datasets

Unlabeled dataset XU → 1, 233 samples of ARTURIA dataset Labeled dataset XL → 80 samples annotated during scaling perceptual test

Towards weak supervision using timbre perception Methodology

Perceptual regularization metric:

R(z, P) = MSE(z1:8, ps)

Fanny ROCHE - PhD Defense - 29 September 2020 35 / 48 Towards weak supervision using timbre perception Methodology

Perceptual regularization metric:

R(z, P) = MSE(z1:8, ps) Datasets

Unlabeled dataset XU → 1, 233 samples of ARTURIA dataset Labeled dataset XL → 80 samples annotated during scaling perceptual test

Fanny ROCHE - PhD Defense - 29 September 2020 35 / 48 Towards weak supervision using timbre perception Experiments

Reconstruction accuracy evaluation

(a) Different values of α (b) Different number of 2-step procedure iterations

Fanny ROCHE - PhD Defense - 29 September 2020 36 / 48 Towards weak supervision using timbre perception Experiments

Latent space organization as revealed by t-SNE [van der Maaten and Hinton 2008]

Fanny ROCHE - PhD Defense - 29 September 2020 37 / 48 Towards weak supervision using timbre perception Perceptual evaluation using A/B testing model

Objective: . Assess the effectiveness of the perceptual regularization using a simple listening test

Participants: 30 responses

Stimuli:

Remove time-related scales: 60 pairs of generated → percussif stimuli → qui résonne → 2 offset values → qui évolue Métallique Agressif 12 source samples per scale: → 3 very representative → 3 unrepresentative → 6 from the test set HH HH

Fanny ROCHE - PhD Defense - 29 September 2020 38 / 48 Towards weak supervision using timbre perception Perceptual evaluation using A/B testing model

Protocol:

Fanny ROCHE - PhD Defense - 29 September 2020 39 / 48 Two-fold purpose of analysis Analyze participants perception of VAE variations Investigate the influence of the dataset (train or test)

Observations Perceptually-regularized model able to generalize Discrepancy between scales

Towards weak supervision using timbre perception Perceptual evaluation using A/B testing model

Logistic random effects

[Li et al. 2011]

Fanny ROCHE - PhD Defense - 29 September 2020 40 / 48 Observations Perceptually-regularized model able to generalize Discrepancy between scales

Towards weak supervision using timbre perception Perceptual evaluation using A/B testing model

Logistic random effects

regression analysis [Li et al. 2011]

Two-fold purpose of analysis Analyze participants perception of VAE variations Investigate the influence of the dataset (train or test)

Fanny ROCHE - PhD Defense - 29 September 2020 40 / 48 Towards weak supervision using timbre perception Perceptual evaluation using A/B testing model

Logistic random effects

regression analysis [Li et al. 2011]

Two-fold purpose of analysis Analyze participants perception of VAE variations Investigate the influence of the dataset (train or test)

Observations Perceptually-regularized model able to generalize Discrepancy between scales

Fanny ROCHE - PhD Defense - 29 September 2020 40 / 48 Influence of 2-step procedure repetition not clearly evidenced

Preliminary validation of our methodology ⇒ ability to generalize although very few labeled data ⇒ captured very well acoustic properties of Agressif & Qui vibre ⇒ difficulty to model Chaud & Soufflé ⇒ results to be confirmed for Métallique

Towards weak supervision using timbre perception Conclusion

Fair-to-good audio quality achievable for well-chosen α values ⇒ extra regularization slightly degrades quality BUT can be tackled by

enlarging both XU and XL

Fanny ROCHE - PhD Defense - 29 September 2020 41 / 48 Preliminary validation of our methodology ⇒ ability to generalize although very few labeled data ⇒ captured very well acoustic properties of Agressif & Qui vibre ⇒ difficulty to model Chaud & Soufflé ⇒ results to be confirmed for Métallique

Towards weak supervision using timbre perception Conclusion

Fair-to-good audio quality achievable for well-chosen α values ⇒ extra regularization slightly degrades quality BUT can be tackled by

enlarging both XU and XL

Influence of 2-step procedure repetition not clearly evidenced

Fanny ROCHE - PhD Defense - 29 September 2020 41 / 48 Towards weak supervision using timbre perception Conclusion

Fair-to-good audio quality achievable for well-chosen α values ⇒ extra regularization slightly degrades quality BUT can be tackled by

enlarging both XU and XL

Influence of 2-step procedure repetition not clearly evidenced

Preliminary validation of our methodology ⇒ ability to generalize although very few labeled data ⇒ captured very well acoustic properties of Agressif & Qui vibre ⇒ difficulty to model Chaud & Soufflé ⇒ results to be confirmed for Métallique

Fanny ROCHE - PhD Defense - 29 September 2020 41 / 48 Conclusion and perspectives Content

1 Perceptual characterization of synthetic timbre State-of-the-art Free verbalization perceptual test Perceptual scaling test Conclusion

2 Unsupervised representation learning Methodology Comparative study Conclusion

3 Towards weak supervision using timbre perception Methodology Experiments Perceptual evaluation Conclusion

4 Conclusion and perspectives

Fanny ROCHE - PhD Defense - 29 September 2020 42 / 48 Identified 8 perceptual verbal descriptors: ◦ métallique ◦ qui vibre ◦ chaud ◦ qui résonne ◦ soufflé ◦ qui évolue ◦ percussif ◦ agressif Evaluated the degree of consensus of these 8 dimensions Annotated a subset of 80 synthesizer sounds Performed an extensive comparison of AE-based models on 2 different datasets (NSynth & ARTURIA) ⇒ validation of the use of these models and in particular VAEs

Proposed a new methodology to perceptually regularize VAEs ⇒ validation of the 2-step learning method both objectively and perceptually

Conclusion and perspectives Main contributions

Listed 784 terms used to describe synthesizer sounds in French

Fanny ROCHE - PhD Defense - 29 September 2020 43 / 48 Evaluated the degree of consensus of these 8 dimensions Annotated a subset of 80 synthesizer sounds Performed an extensive comparison of AE-based models on 2 different datasets (NSynth & ARTURIA) ⇒ validation of the use of these models and in particular VAEs

Proposed a new methodology to perceptually regularize VAEs ⇒ validation of the 2-step learning method both objectively and perceptually

Conclusion and perspectives Main contributions

Listed 784 terms used to describe synthesizer sounds in French Identified 8 perceptual verbal descriptors: ◦ métallique ◦ qui vibre ◦ chaud ◦ qui résonne ◦ soufflé ◦ qui évolue ◦ percussif ◦ agressif

Fanny ROCHE - PhD Defense - 29 September 2020 43 / 48 Annotated a subset of 80 synthesizer sounds Performed an extensive comparison of AE-based models on 2 different datasets (NSynth & ARTURIA) ⇒ validation of the use of these models and in particular VAEs

Proposed a new methodology to perceptually regularize VAEs ⇒ validation of the 2-step learning method both objectively and perceptually

Conclusion and perspectives Main contributions

Listed 784 terms used to describe synthesizer sounds in French Identified 8 perceptual verbal descriptors: ◦ métallique ◦ qui vibre ◦ chaud ◦ qui résonne ◦ soufflé ◦ qui évolue ◦ percussif ◦ agressif Evaluated the degree of consensus of these 8 dimensions

Fanny ROCHE - PhD Defense - 29 September 2020 43 / 48 Performed an extensive comparison of AE-based models on 2 different datasets (NSynth & ARTURIA) ⇒ validation of the use of these models and in particular VAEs

Proposed a new methodology to perceptually regularize VAEs ⇒ validation of the 2-step learning method both objectively and perceptually

Conclusion and perspectives Main contributions

Listed 784 terms used to describe synthesizer sounds in French Identified 8 perceptual verbal descriptors: ◦ métallique ◦ qui vibre ◦ chaud ◦ qui résonne ◦ soufflé ◦ qui évolue ◦ percussif ◦ agressif Evaluated the degree of consensus of these 8 dimensions Annotated a subset of 80 synthesizer sounds

Fanny ROCHE - PhD Defense - 29 September 2020 43 / 48 Proposed a new methodology to perceptually regularize VAEs ⇒ validation of the 2-step learning method both objectively and perceptually

Conclusion and perspectives Main contributions

Listed 784 terms used to describe synthesizer sounds in French Identified 8 perceptual verbal descriptors: ◦ métallique ◦ qui vibre ◦ chaud ◦ qui résonne ◦ soufflé ◦ qui évolue ◦ percussif ◦ agressif Evaluated the degree of consensus of these 8 dimensions Annotated a subset of 80 synthesizer sounds Performed an extensive comparison of AE-based models on 2 different datasets (NSynth & ARTURIA) ⇒ validation of the use of these models and in particular VAEs

Fanny ROCHE - PhD Defense - 29 September 2020 43 / 48 Conclusion and perspectives Main contributions

Listed 784 terms used to describe synthesizer sounds in French Identified 8 perceptual verbal descriptors: ◦ métallique ◦ qui vibre ◦ chaud ◦ qui résonne ◦ soufflé ◦ qui évolue ◦ percussif ◦ agressif Evaluated the degree of consensus of these 8 dimensions Annotated a subset of 80 synthesizer sounds Performed an extensive comparison of AE-based models on 2 different datasets (NSynth & ARTURIA) ⇒ validation of the use of these models and in particular VAEs

Proposed a new methodology to perceptually regularize VAEs ⇒ validation of the 2-step learning method both objectively and perceptually

Fanny ROCHE - PhD Defense - 29 September 2020 43 / 48 Improve VAE framework → consider different distance metrics for perceptual regularization

→ explore recurrent/dynamical VAE [Girin et al. 2020]

Improve global framework → enlarge ARTURIA dataset → investigate more robust real-time phase reconstruction algorithms → train VAE on complex spectrogram or temporal data [Nugraha et al. 2019] → explore neural vocoders → explore conditional GANs (C-GANs)

Conclusion and perspectives Perspectives

Deeper analysis of the perceptual dimensions → correlation/redundancy analysis → analyze semantic relationships between the terms → investigate underlying acoustic correlates

Fanny ROCHE - PhD Defense - 29 September 2020 44 / 48 Improve global framework → enlarge ARTURIA dataset → investigate more robust real-time phase reconstruction algorithms → train VAE on complex spectrogram or temporal data [Nugraha et al. 2019] → explore neural vocoders → explore conditional GANs (C-GANs)

Conclusion and perspectives Perspectives

Deeper analysis of the perceptual dimensions → correlation/redundancy analysis → analyze semantic relationships between the terms → investigate underlying acoustic correlates

Improve VAE framework → consider different distance metrics for perceptual regularization

→ explore recurrent/dynamical VAE [Girin et al. 2020]

Fanny ROCHE - PhD Defense - 29 September 2020 44 / 48 Conclusion and perspectives Perspectives

Deeper analysis of the perceptual dimensions → correlation/redundancy analysis → analyze semantic relationships between the terms → investigate underlying acoustic correlates

Improve VAE framework → consider different distance metrics for perceptual regularization

→ explore recurrent/dynamical VAE [Girin et al. 2020]

Improve global framework → enlarge ARTURIA dataset → investigate more robust real-time phase reconstruction algorithms → train VAE on complex spectrogram or temporal data [Nugraha et al. 2019] → explore neural vocoders → explore conditional GANs (C-GANs)

Fanny ROCHE - PhD Defense - 29 September 2020 44 / 48 Thank you for your attention!

List of publications: → F. Roche, T. Hueber, S. Limier, and L. Girin (2019). "Autoencoders for music sound modeling: A comparison of linear, shallow, deep, recurrent and variational models". In: Proceedings of the Sound and Music Computing Conference (SMC). Málaga, Spain. → L. Girin, T. Hueber, F. Roche, and S. Leglaive (2019). "Notes on the use of variational autoencoders for speech and audio spectrogram modeling.". In: Proceedings of the International Conference on Digital Audio Effects (DAFx). Birmingham, UK. Journal submission: → F. Roche, T. Hueber, M. Garnier, S. Limier, and L. Girin. Article currently in a blind review process for publication in an international journal.

Fanny ROCHE - PhD Defense - 29 September 2020 45 / 48 References

[Akuzawa et al. 2018] "Expressive via modeling expressions with variational autoencoder". In: Proceedings of the Conference of the International Speech Communication Association (Interspeech). [Bengio et al. 2007] "Greedy layer-wise training of deep networks". In: Advances in Neural Information Processing Systems (NIPS). [Bensa et al. 2004] "Perceptive and cognitive evaluation of a piano synthesis model". In: Proceedings of the International Symposium on Computer Music Modeling and Retrieval (CMMR). [Bishop 2006] and machine learning. Springer. [Blaauw and Bonada 2016] "Modeling and transforming speech using variational autoencoders". In: Proceedings of the Conference of the International Speech Communication Association (Interspeech). [Cance and Dubois 2015] "Dire notre expérience du sonore : nomination et référenciation". In: Langue Française. [Castellengo 2015] Ecoute musicale et acoustique : avec 420 sons et leurs sonagrammes décryptés. Eyrolles. [Cheminée et al. 2005] "Analyses des verbalisations libres sur le son du piano versus analyses acoustiques". In: Proceedings of the Conference on Interdisciplinary Musicology (CIM05). [Colonel et al. 2017] "Improving neural net auto encoders for music synthesis". In: Audio Engineering Society Convention. [Donahue et al. 2019] "Adversarial audio synthesis". In: Proceedings of the International Conference on Learning Representations (ICLR). [Ehrette 2004] "Les voix des services vocaux, de la perception à la modélisation". PhD thesis. Paris 11. [Engel et al. 2017] "Neural audio synthesis of musical notes with autoencoders". In: Proceedings of the International Conference on Machine Learning (ICML). [Engel et al. 2019] "GANSynth: Adversarial neural audio synthesis". In: Proceedings of the International Conference on Learning Representations (ICLR). [Esling et al. 2018] "Bridging audio analysis, perception and synthesis with perceptually-regularized variational timbre spaces". In: Proceedings of the International Society for Music Information Retrieval Conference (ISMIR). [Faure 2000] "Des sons aux mots, comment parle-t-on du timbre musical ?" PhD thesis. Ecole des Hautes Etudes en Sciences Sociales (EHESS). [Gaillard 2000] "Etude de la perception des transitoires d’attaque des sons de steeldrum : particularités acoustiques, transformation par synthèse et catégorisation". PhD thesis. Toulouse 2.

Fanny ROCHE - PhD Defense - 29 September 2020 46 / 48 References

[Garnier et al. 2007] "Characterisation of voice quality in Western lyrical singing: From teachers’ judgements to acoustic descriptions". In: Journal of Interdisciplinary Music Studies 1.2. [Girin et al. 2007] "Dynamical Variational Autoencoders: A Comprehensive Review". arXiv preprint arXiv:2008.12595. [Grey 1977] "Multidimensional perceptual scaling of musical timbres". In: The Journal of the Acoustical Society of America 61.5. [Griffin and Lim 1984] "Signal estimation from modified short-time Fourier transform". In: IEEE Transactions on Acoustics, Speech, and Signal Processing 32.2. [Guyot 1996] "Etude de la perception sonore en termes de reconnaissance et d’appréciation qualitative : une approche par la catégorisation". PhD thesis. Le Mans. [Hinton and Salakhutdinov 2006] "Reducing the dimensionality of data with neural networks". In: Science 313.5786. [Hinton and Salakhutdinov 2007] "Using deep belief nets to learn covariance kernels for Gaussian processes". In: Advances in Neural Information Processing Systems (NIPS). [Hochreiter and Schmidhuber 1997] "Long short-term memory". In: Neural Computation 9.8. [Hsu et al. 2017] "Learning latent representations for speech generation and transformation". In: Proceedings of the Conference of the International Speech Communication Association (Interspeech). [Huber and Kollmeier 2006] "PEMO-Q: A new method for objective audio quality assessment using a model of auditory perception". In: IEEE Transactions on Audio, Speech, and Language Processing (TASLP) 14.6. [Iverson and Krumhansl 1993] "Isolating the dynamic attributes of musical timbre". In: The Journal of the Acoustical Society of America 94.5. [Kingma and Welling 2014] "Auto-encoding variational Bayes". In: Proceedings of the International Conference on Learning Representations (ICLR). [Krumhansl 1989] "Why is musical timbre so hard to understand?". In: Structure and Perception of Electroacoustic Sound and Music. [Lakatos 2000] "A common perceptual space for harmonic and percussive timbres". In: Perception & Psychophysics 62.7. [Lavoie 2013] "Conceptualisation et communication des nuances de timbre à la guitare classique". PhD thesis. Université de Montréal. [Li et al. 2011] "Logistic random effects regression models: a comparison of statistical packages for binary and ordinal outcomes". In: BMC medical research methodology.

Fanny ROCHE - PhD Defense - 29 September 2020 47 / 48 References

[Magron 2016] "Reconstruction de phase par modèles de signaux : application à la séparation de sources audio". PhD thesis. Telecom ParisTech. [McAdams et al. 1995] "Perceptual scaling of synthesized musical timbres: Common dimensions, specificities, and latent subject classes". In: Psychological Research 58.3. [Nugraha et al. 1995] "A deep generative model of speech complex spectrograms". In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). [Rezende et al. 2014] "Stochastic and approximate inference in deep generative models". In: Proceedings of the International Conference on Machine Learning (ICML). [Sarroff and Casey 2014] "Musical audio synthesis using autoencoding neural nets". In: Proceedings of the Joint International Computer Music Conference (ICMC) and Sound & Music Computing conference (SMC). [Shaeffer 1966] Traité des objets musicaux. Le Seuil. [Traube 2004] "An interdisciplinary study of the timbre of the classical guitar". PhD thesis. McGill University. [van der Maaten and Hinton 2008] "Visualizing data using t-SNE". In: Journal of machine learning research 9. [Zacharakis et al. 2014] "An interlanguage study of musical timbre semantic dimensions and their acoustic correlates". In: Music Perception: An Interdisciplinary Journal 31.4.

Fanny ROCHE - PhD Defense - 29 September 2020 48 / 48