Making Music Through Real-Time Voice Timbre Analysis: Machine Learning
Total Page:16
File Type:pdf, Size:1020Kb
Making music through real-time voice timbre analysis: machine learning and timbral control Dan Stowell PhD thesis School of Electronic Engineering and Computer Science Queen Mary University of London 2010 Abstract People can achieve rich musical expression through vocal sound - see for example human beatboxing, which achieves a wide timbral variety through a range of extended techniques. Yet the vocal modality is under-exploited as a controller for music systems. If we can analyse a vocal performance suitably in real time, then this information could be used to create voice-based interfaces with the potential for intuitive and fulfilling levels of expressive control. Conversely, many modern techniques for music synthesis do not imply any particular interface. Should a given parameter be controlled via a MIDI key- board, or a slider/fader, or a rotary dial? Automatic vocal analysis could provide a fruitful basis for expressive interfaces to such electronic musical instruments. The principal questions in applying vocal-based control are how to extract musically meaningful information from the voice signal in real time, and how to convert that information suitably into control data. In this thesis we ad- dress these questions, with a focus on timbral control, and in particular we develop approaches that can be used with a wide variety of musical instruments by applying machine learning techniques to automatically derive the mappings between expressive audio input and control output. The vocal audio signal is construed to include a broad range of expression, in particular encompassing the extended techniques used in human beatboxing. The central contribution of this work is the application of supervised and unsupervised machine learning techniques to automatically map vocal timbre to synthesiser timbre and controls. Component contributions include a delayed decision-making strategy for low-latency sound classification, a regression-tree method to learn associations between regions of two unlabelled datasets, a fast estimator of multidimensional differential entropy and a qualitative method for evaluating musical interfaces based on discourse analysis. Acknowledgements I'd like to thank everyone in the |C4DM| at QMUL, who've been brihiant folks to work with, with such a wide range of backgrounds and interests that have made it an amazing place to develop my ideas. In particular my supervisor Mark Plumbley, whose wide-ranging knowledge and enthusiasm has been invaluable in every aspect of my PhD work. Also I must acknowledge the SuperCoUider community, since it's not just the main tool I use but also the software & community that enabled me to dabble with these ideas before even contemplating a PhD. Thanks to Nick Collins and Julian Rohrhuber, who got me going with it during their summer school, plus all the other friends I've made at symposiums/workshops as well as online. And James McCartney for creating SuperCoUider in the first place, and his forward- thinking approach to language design which made it a pleasure to work with. Philippa has to get major credit for at least some of this thesis, not only for much proofreading and discussion, but also moral & practical support through- out the whole three or four years. It wouldn't have been possible without you. Shout mits to: toplap humanbeatbox.com F-Step massive Ladyfest dorkbot london goto10 openlab london Also a shout out to Matt Sharpies for some inspiration in the 8- bit department which found its way into all this, and to my family for all their support. Most of the tools I use are open-source, meaning that they're maintained by miscellaneous hordes of (often unpaid) enthusiasts around the world. So I'd also like to take the time to acknowledge the thousands who contributed to the excellence of the open-source tools that allowed me to do my PhD research so smoothly. Besides SuperCoUider, these were crucial: |vim[ Python+numpy, ignuplot) |BibDesk( LaTeX, TeXshop, jackdi |Inkscape| [Sonic Visualiser] Licence This work is copyright © 2010 Dan Stoweh, and is hcensed under the Creative Commons Attribution-Share Ahke 3.0 Unported Licence. To view a copy of this licence, visit http : //creativecommons . org/1 icenses/by- sa/3 . 0/| or send a letter to Creative Commons, 171 Second Street, Suite 300, San Fran- cisco, California, 94105, USA. ©0(2) Contents II Introduction] 15 11.1 Motivation! 15 11.2 AimI 16 11.3 Thesis structurel 17 11.4 Contributions! 18 |1.5 Associated publications! 19 \2 Background! 20 !2.1 The human vocal system! 20 !2.2 The beatboxing vocal style] 23 !2.2.1 Extended vocal technique] 24 !2.2.2 Close-mic technique! 27 !2.2.3 Summary! 28 12.3 Research context! 29 !2.3.1 Speech analysis! 29 !2.3.2 Singing voice analysis! 32 12.3.3 Musical timbrel 37 2.3.4 Machine learning and musical applications 43 !2.4 Stratej^ 48 3 Representing timbre in acoustic features 50 !3.1 Features investigated! 53 !3.2 Perceptual relevance] 56 13.2.1 Meth^ 57 13.2.2 ResuTtil 58 13.3 Robustness! 61 !3.3.1 Robustness to variability of synth signals! 61 3.3.2 Robustness to degradations of voice signals 65 !3.4 Independence! ^0 13.4.1 Method! 70 13.4.2 ResuTtil 71 13.5 Discussion and conclusionsl 72 |4 Event-based paradigm| 75 |4.1 Classification experinient| 76 14.1.1 Human beatbox dataset: beatboxsetll 78 14.1.2 Meth^ 79 14.1.3 ResiJtel 82 |4.2 Perceptual experinient| 86 14.2.1 Meth^ 87 14.2.2 ResuTtil 90 14.3 Conclusionsl 91 |5 Continuous paradigm: timbre reinapping| 93 |5.1 Timbre remapping"! 94 15.1.1 Related worki 96 5.1.2 Pitch-timbre dependence issues 96 [5.1.3 Nearest-neighbour search with PCA and warping| 97 5.2 The cross-associative multivariate regression tree (XAMRT) . 105 15.2.1 Auto- associative MRTI 106 15.2.2 Cross-associative MRTI 108 |5.3 Experiments! HI |5.3.1 Concatenative synthesis! HI !5.3.2 Vowel formant analysis! 120 15.4 Conclusioiis! 123 16 User evaluation! 124 !6.1 Evaluating expressive musical systems! 125 !6.1.1 Previous work in musical system evaluation! 126 !6.2 Applying discourse analysis! 128 16.2.1 Meth^ 129 !6.3 Evaluation of timbre remapping! 134 16.3.1 Reconstruction of the described worldl 136 !6.3.2 Examining context! 137 16.4 Discussion] 139 16.5 Conclusioiisl 141 17 Conclusions and further workI 142 !7.1 Summary of contributions! 142 !7.2 Comparing the paradigms: event classification vs. timbre remap- ~! ping! 144 17.3 Further worki 144 |7.3.1 Thoughts on vocal musical control as an interface| 146 17.3.2 Closing remarks! 147 estimation fc-d partitioning! 1^9 I A Entropy by |A.l Entropy estimation] 150 A. 1.1 Partitioning methods 150 |A.1.2 Support issues! 151 !A.2 Adaptive partitioning method! 152 !A.3 Complexity! 154 !A.4 Experiments! 155 IA.5 Conclusion] 155 !B Five synthesisers! 1^^ !B.l simple] 158 !B.2 moogyT] 158 !B.3 grainamenT] 159 !B.4 ayl! 160 !B.5 gendyi] 160 !C Classifier-free feature selection for independence! 1^2 IC.l Information-theoretic feature selectioni 162 !C.2 Data preparation! 164 IC.3 Result^ 165 IC.4 Discussion] 167 !D Timbre remapping and Self-Organising Maps! 1^^ !E Discourse analysis data excerpt^ 175 List of Figures 2.1 Functional model of the vocal tract, after Clark and Yallop 1995 Figure 2.2.1]. 21 |2.2 Laryngograph analysis of two seconds of "vocal scratching" per- formed by the author. The image shows (from top to bottom): spectrogram; waveform; laryngograph signal (which measures the impedance change when larynx opens/closes - the signal goes up when vocal folds close, goes down when they open); and funda- mental frequency estimated from the laryngograph signal. The | recording was made by Xinghui Hu at the UCL EAR histitute | on nth March 2008. I 26 3.1 Variability (normalised standard deviation) of timbre features. measured on a random sample of synth settings and 120 samples of timbre features from each setting. The box-plots indicate the median and quartiles of the distributions, with whiskers extend- ing to the 5- and 95-percentiles 64 |4.1 An approach to real-time beatbox-driven audio, using onset de- ~| I tection and classification. I 76 |4.2 Numbering the "delay" of audio frames relative to the temporal ~| I location of an annotated onsetTI 80 |4.3 Separability measured by average KL divergence, as a function ~| of the delay after onset. At each frame the class separability is summarised using the feature values measured only in that frame. | The grey lines indicate the individual divergence statistics for | each of the 24 features, while the dark lines indicate the median | and the 25- and 75-percentiles of these values. 82 | |4.4 Classification accuracy using Naive Bayes classifier. 83 | |4.5 Waveform and spectrogram of a kick followed by a snare, from | the beatboxsetl data. The duration of the excerpt is around 0.3 seconds, and the spectrogram frequencies shown are 0-6500 Hz7^ 86 |4.6 The user interface for one trial within the MUSHRA listening ~| I tesTI 89 4.7 Results from the listening test, showing the mean and 95% confi- dence intervals (calculated in the logistic transformation domain) with whiskers extending to the 25- and 75-percentiles. The plots show results for the three drum sets separately. The durations given on the horizontal axis indicate the delay, corresponding to 1/2/3/4 audio frames in the classification experiment 90 |5.1 Overview of timbre remapping. Timbral input is mapped to syn- | thesiser parameters by a real-time mapping between two timbre | spaces, in a fashion which accounts for differences in the distri- | bution of source and target timbre?] 95 5.2 Pitch tracking serves a dual role in the timbre remapping process.