Perceptual Synthesis Engine: an Audio-Driven Timbre Generator
Total Page:16
File Type:pdf, Size:1020Kb
Perceptual Synthesis Engine: An Audio-Driven Timbre Generator Tristan Jehan Dipl6me d'Ing6nieur en Informatique et T6l6communications IFSIC - Universite de Rennes 1 - France (1997) Submited to the Program in Media Arts and Sciences, School of Architecture and Planning, in partial fulfillment of the requirements for the degree of Master of Science in Media Arts and Sciences at the Massachusetts Institute of Technology September 2001 @2001 Massachusetts Institute of Technology All rights reserved. A uth or .................. ..... ......... Progr in Media Arts and Sciences September 2001 C ertified by ........... .. ................... Tod Machover 4 Professor of Music and Media Thesis Supervisor A ccep ted by ................. ( ... ............... .. Dr. Andrew B. Lippman Chair, Departmental Committee on Graduate Students Program in Media Arts and Sciences MASSACHUSETTS INSTITUTE OF TECHNOLOGY OCT 1 2 2001 LIBRARIES arf"n an Perceptual Synthesis Engine: An Audio-Driven Timbre Generator Tristan Jehan Submited to the Program in Media Arts and Sciences, School of Architecture and Planning, in partial fulfillment of the requirements for the degree of Master of Science in Media Arts and Sciences at the Massachusetts Institute of Technology September 2001 Abstract A real-time synthesis engine which models and predicts the timbre of acoustic instruments based on perceptual features extracted from an audio stream is presented. The thesis describes the modeling sequence including the analysis of natural sounds, the inference step that finds the mapping between control and output parameters, the timbre prediction step, and the sound synthe- sis. The system enables applications such as cross-synthesis, pitch shifting or compression of acoustic instruments, and timbre morphing between instru- ment families. It is fully implemented in the Max/MSP environment. The Perceptual Synthesis Engine was developed for the Hyperviolin as a novel, generic and perceptually meaningful synthesis technique for non-discretely pitched instruments. Advisor: Tod Machover Title: Professor of Music and Media Perceptual Synthesis Engine: An Audio-Driven Timbre Generator Thesis Committee h sis Supervisor Tod Machover Professor of Music and Media MIT Program in Media Arts and Sciences Thesis Reader Joe Paradiso Principal Research Scientist MIT Media Laboratory Thesis Reader Miller Puckette Professor of Music University of California, San Diego Thesis Reader Barry Vercoe Professor of Media Arts and Sciences MIT Program in Media Arts and Sciences To my Cati... Preface As a concert violinist with the luxury of owning a Stradivarius violin made in 1732, I have always been skeptical of attempts to "electrify" a string instrument. I have tried various electric violins over the years but none have compelled me to bring them to the concert hall. The traditional methods of extracting sound from a violin and "enhancing" it electronically usually result in an unappealing and artificialsound. Recently, though, I have been intrigued by the work being done at the Media Lab by Tristan Jehan. I have had the privilege of working with him in the development of a new instrument dubbed the "hyperviolin." This new instrument uses raw data extracted from the audio of the violin and then fed into the computer. Using Tristan's "sound models," this raw data provided by me and the hyperviolin can be turned into such sounds as the human voice or the panpipes. When I first heard the sound of a singing voice coming from Tristan's computer, I thought it was simply a recording. But when I found out that it was not anyone singing at all, but merely a "print" of someone's voice applied to random data (pitch, loudness, etc.), I got excited by the possibilities. When these sound models are used in conjunction with the hyperviolin, I am able to sound like a soprano or a trumpet (or something in between!) all while playing the violin in a normal fashion. The fact that this is all processed on the By with little delay between bow-stroke and sound is testament to the efficiency of Tristan's software. Tristan Jehan's work is certainly groundbreaking and is sure to inspire the minds of many musicians. In the coming months I plan to apply these new techniques to music both new and old. The possibilities are endless. Joshua Bell Aknowledgements I would like to gratefully thank my advisor Tod Machover for providing me with a space in his group, for supporting this research, and for pushing me along these two years. His ambition and optimism were always refreshing to me. the other members of my comittee, Joe Paradiso, Miller Puckette, and Barry Vercoe, for spending the time with this work, and for their valuable insights. Bernd Schoner for providing his CWM code and for helping me with it. He definitely knows what it means to write a paper, and I am glad he was there for the two that we have written together. Bernd is my friend. my love Cati Vaucelle for her great support, her conceptual insight, and simply for being there. She has changed my life since I have started this project and it would certainly have not ended up being the same without her. My deepest love goes to her, and I dedicate this thesis to her. Joshua Bell for playing his Stradivarius violin beautifully for the purpose of data collection, for his musical ideas, for spending his precious time with us, and for being positive even when things were not running as expected. Youngmoo Kim, Hila Plittman and Tara Rosenberger for lending their voices for the purpose of data collection. Their voice models are very precious material to this work. Nyssim Lefford and Michael Broxton for help with the recordings and sound editing. AKNOWLEDGEMENTS 7 Cyril Drame whose research and clever ideas originally inspired this work and for his friendship. Ricardo Garcia for his valuable insight, refreshing excitement, and for his friendship. Mary Farbood for her help correcting my English and for her support. Mary is my friend. Laird Nolan and Hannes Hdgni Vilhjailmsson for useful assistance regarding the English language. the members of the Hyperinstruments group who helped in one way or an- other, and for providing me with a nice work environment. the Media Lab's Things That Think consortium, and Sega Corporation for making this work possible. my friends and family for their love and support. Thank you all. Contents Introduction 12 1 Background and Concept 14 1.1 What is Timbre? . .... ... .... .... ... .... 16 1.2 Synthesis techniques ... .. ... .. ... .. ... .. .. 17 1.2.1 Physical modeling .. ... ... ... .. ... ... 18 1.2.2 Sam pling .. ... .... ... ... ... .... ... 18 1.2.3 Abstract modeling . ... ... ... .. ... ... 18 1.2.4 Spectral modeling . ... .. ... ... .. ... .. .. 19 1.3 Hyperinstruments ... ... ... .. ... ... ... .. ... 19 1.4 A Transparent Controller .. ... .. ... .. .. ... .. .. 22 1.5 Previous Work .. ... .. ... .. ... .. ... .. .. .. 25 2 Perceptual Synthesis Engine 29 2.1 Timbre Analysis and Modeling .. ... .. ... .. .. ... 29 CONTENTS 9 2.2 Timbre Prediction and Synthesis .. ........... 33 2.3 Noise Analysis/Synthesis ..... ....... ....... 35 2.4 Cluster-Weighted Modeling ......... 38 2.4.1 Model Architecture ..... .......... 38 2.4.2 Model Estimation .... ..... ..... ..... .. 41 2.5 Max/MSP Implementation .... .... ..... .... ... 43 3 Applications 47 3.1 Timbre synthesis .. .... ..... .... ..... .... 47 3.2 Cross-synthesis .... ...... ..... ...... ..... 50 3.3 Morphing . ..... .... ..... .... ..... .... 51 3.4 Pitch shifting .. .... .... ..... .... .... .... 53 3.5 Compression ... .. ... ... ... ... .. ... ... ... 54 3.6 Toy Symphony and the Bach Chaconne ... .... ... 55 3.6.1 Classical piece .. .... ... ... .... ... .... 55 3.6.2 Original piece . ... ... ... ... .. ... ... 57 3.7 Discussion ... ... ... ... .... ... ... ... ... 58 Conclusions and Future Work 60 Appendix A 62 Bibliography List of Figures 1.1 Our controller: a five string Jensen electric violin . ... ... 21 1.2 A traditional digital synthesis system .. ... ... ... ... 23 1.3 Our synthesis system ... .... .... ... .... .... 23 2.1 Spectrum of a female singing voice . .... .... ... 32 2.2 Typical perceptual-feature curves for a female singing voice . 33 2.3 Timbre analysis and modeling using CWM .. ... .. ... 34 2.4 Typical noise spectrum of the violin ... ... ... .. ... 36 2.5 Typical noise spectrum of the singing voice and clarinet .. .. 36 2.6 CWM: One dimensional function approximation . .. .. ... 39 2.7 Selected data and cluster allocation . ... ... ... .. ... 41 2.8 Full model data and cluster allocation .. .. ... .. .. ... 42 3.1 Violin-control input driving a violin model ... ... .. ... 49 3.2 Three prediction results with a female singing voice input . .. 50 3.3 OpenSound Control server and client .. ... .. .. ... .. 55 LIST OF FIGURES 11 3.4 OpenSound Control with the 5-string violin ...... 56 A.1 analyzer- help file . ........... ........... 65 A.2 Perceptual Synthesis Engine Max patch .... ....... 66 A.3 Simple Morphing Max patch .... ...... ..... 67 Introduction From the beginning, with the organ, through the piano and finally to the synthesizer, the evolution of the technology of musical instruments has both reflected and driven the transformation of music. Where it once was only an expression in sound - something heard - in our century music has also become information, data - something to be processed. Digital audio as it is implemented at present, is not at all structured: controllable, scalable, and compact [Casey, 1998]. In the context of musical instruments,