A Parametric Sound Object Model for Sound Texture Synthesis

Daniel Mohlmann¨ A Parametric Sound Object Model for Sound Texture Synthesis Dissertation zur Erlangung des Grades eines Doktors der Ingenieurwissenschaften | Dr.-Ing. | Vorgelegt im Fachbereich 3 (Mathematik und Informatik) der Universitat¨ Bremen im Juni 2011 Gutachter: Prof. Dr. Otthein Herzog UniversitätBremen Prof. Dr. JörnLoviscach Fachhochschule Bielefeld Abstract This thesis deals with the analysis and synthesis of sound textures based on parametric sound objects. An overview is provided about the acoustic and perceptual principles of textural acoustic scenes, and technical challenges for analysis and synthesis are con- sidered. Four essential processing steps for sound texture analysis are identified, and existing sound texture systems are reviewed, using the four-step model as a guideline. A theoretical framework for analysis and synthesis is proposed. A parametric sound object synthesis (PSOS) model is introduced, which is able to describe individual recorded sounds through a fixed set of parameters. The model, which applies to harmonic and noisy sounds, is an extension of spectral modeling and uses spline curves to approximate spectral envelopes, as well as the evolution of parameters over time. In contrast to standard spectral modeling techniques, this repre- sentation uses the concept of objects instead of concatenated frames, and it provides a direct mapping between sounds of different length. Methods for automatic and manual conversion are shown. An evaluation is presented in which the ability of the model to encode a wide range of different sounds has been examined. Although there are aspects of sounds that the model cannot accurately capture, such as polyphony and certain types of fast modula- tion, the results indicate that high quality synthesis can be achieved for many different acoustic phenomena, including instruments and animal vocalizations. In contrast to many other forms of sound encoding, the parametric model facilitates various techniques of machine learning and intelligent processing, including sound clustering and principal component analysis. Strengths and weaknesses of the proposed method are reviewed, and possibilities for future development are discussed. 3 Acknowledgements First and foremost, I wish to thank my advisor, Prof. Dr. Otthein Herzog, for his support, for his advice and encouragement. Our discussions about music, acoustics, methodology and machine learning have always been inspiring for me, and his con- structive criticism has helped to sharpen the focus of this work. I also wish to express my gratitude to the Artificial Intelligence Research Group as a whole, for being helpful and generous in many different ways and giving me the opportunity to travel. In particular, I want to acknowledge the help of Dr. Thomas Wagner, for encouraging me to visit conferences and for his advice regarding a number of machine learning and AI issues. Furthermore, I want to thank my second supervisor, Prof. Dr. JörnLoviscach, for his support and for his help with some rather complicated technical problems. I was lucky to profit from his extensive knowledge in audio processing and his inventiveness. I owe my deepest gratitude to the Klaus Tschira Foundation for its generous support, and to the Graduate School Advances in Digital Media for providing the kind of creative environment and the helpful assistance that is invaluable for a work like this. Finally, I want to thank my parents for their support, and I want to thank Silke Gennies for being so infinitely encouraging and patient during all of this. 5 Contents 1 Introduction and Motivation 13 1.1 Goals and Research Questions . 13 1.2 Definitions . 14 1.3 Applications for Adaptive Sound Textures . 19 1.4 Overview of this Thesis . 23 2 Foundations of Sound Production and Perception 25 2.1 Sound Sources and Acoustics . 26 2.2 Human Hearing . 36 2.3 Temporal Patterns and Structure . 49 3 Sound Synthesis and Manipulation 55 3.1 Model Requirements for Different Applications . 55 3.2 An Overview of Sound Coding Concepts . 59 3.3 Additive Synthesis . 67 3.4 Spectral Modeling . 71 3.5 Spectral Envelope Coding . 74 3.6 Sound Source Separation . 79 4 Sound Texture Analysis and Synthesis Algorithms 85 4.1 Requirements for Sound Texture Synthesis . 85 4.2 Texture Algorithms in Computer Graphics . 89 4.3 Manually Constructed Sound Textures . 93 4.4 Automatic Synthesis-by-Analysis Methods . 95 4.5 Comparison of the Methods . 99 4.6 Chapter Conclusions . 104 5 Proposed Sound Texture Description Model and Workflow 107 5.1 Technical Aspects of the Sound Texture Problem . 107 5.2 A Hierarchical Description Model for Sound Textures . 113 6 Implementation of the Sound Element Analysis and Synthesis 115 6.1 Overview of the Parametric Sound Model . 115 6.2 B-Spline Coding . 116 6.3 Encoding of the Fundamental Frequency Trajectory . 119 6.4 The Spectral Envelope . 121 6.5 Re-Synthesis . 127 6.6 Choice of the Model Resolution . 130 7 8 Contents 6.7 Possible Extensions and Additional Parameters . 133 6.8 Limitations of the Sound Element Model . 136 7 Evaluation 139 7.1 Inter-Subjective Evaluation Methods . 139 7.2 Evaluation Methods of Audio Coding Quality . 142 7.3 Sound Element Test Method . 144 7.4 Survey Results . 149 7.5 Compression Performance . 158 8 Conclusion 161 8.1 Contributions of this Work . 161 8.2 Observations About Sound Modeling . 163 8.3 Future Work . 165 8.4 Final Thoughts . 167 References 169 A Sound Texture Definitions 181 B Spectrograms of Sounds in the Evaluation 185 C PCA-Based Dimensionality Reduction for Sound Elements 193 List of Figures 1.1 Creature design in Spore ........................... 21 2.1 Different types of wave propagation . 27 2.2 Frequency, amplitude and phase parameters of a sinusoid . 28 2.3 Spectrogram of a plucked guitar string . 29 2.4 Harmonics of a guitar string . 29 2.5 The first six vibrational modes of a pitched drum . 31 2.6 Cross-section of the human vocal tract . 35 2.7 Cross-section of the human ear . 37 3.1 Screenshot of the GeneSynth software . 65 3.2 Illustration of the sinusoid tracking procedure . 69 3.3 Illustration of parabolic interpolation . 71 3.4 Interpolation behavior of the LPC model . 77 3.5 Interpolation behavior of the MFCC model . 78 3.6 Interpolation behavior of the spline model . 80 5.1 UML component diagram of the sound texture model . 113 5.2 Example structure of a sound texture with three simultaneous tracks . 114 6.1 Illustration of the effect of the spline order on basis functions . 117 6.2 Knot placement in a B-spline with fixed start and end points . 118 6.3 Comparison of the B-spline model and the Beziérspline model . 119 6.4 Screenshot of the implemented tracking and conversion software . 121 6.5 Correspondence between the envelope coefficient matrix and the spectral envelope model . 122 6.6 Computation of the partial peak amplitudes . 123 6.7 Illustration of the conversion from observed peaks to virtual peaks . 125 6.8 Different methods for improved transient modeling . 132 6.9 Illustration of the correspondence problem . 138 7.1 Screenshot from the online evaluation (participant data) . 148 7.2 Screenshot from the online evaluation (test sample presentation) . 149 7.3 Statistical properties of the group of participants . 150 7.4 Evaluation results for directly converted sounds . 153 7.5 Evaluation results for sounds created from two originals . 156 B.1 Spectrograms of the original and synthetic \bee" sound . 185 B.2 Spectrograms of the original and synthetic \hawk" sound . 185 9 10 List of Figures B.3 Spectrograms of the original and synthetic \cow" sound . 186 B.4 Spectrograms of the original and synthetic \dog bark" sound . 186 B.5 Spectrograms of the original and synthetic \pony" sound . 186 B.6 Spectrograms of the original and synthetic \waterdrop" sound . 187 B.7 Spectrograms of the original and synthetic \cuckoo" sound . 187 B.8 Spectrograms of the original and synthetic \gunshot" sound . 187 B.9 Spectrograms of the original and synthetic \creak" sound . 188 B.10 Spectrograms of the original and synthetic \warbler" sound . 188 B.11 Spectrograms of the original and synthetic “flute" sound . 188 B.12 Spectrograms of the original and synthetic \guitar" sound . 189 B.13 Spectrograms of the original and synthetic \large splash" sound . 189 B.14 Spectrograms of the original and synthetic \rain" sound . 190 B.15 Spectrograms of the original and synthetic \songthrush" sound . 190 B.16 Spectrograms of the original and synthetic \singing" sound . 190 B.17 Spectrograms of the original and synthetic \piano" sound . 191 B.18 Spectrograms of the original and synthetic “traffic” sound . 191 B.19 Spectrograms of the original and synthetic \thunder" sound . 191 B.20 Spectrograms of the original and synthetic \gravel" sound . 192 B.21 Spectrograms of the original and synthetic \rooster" sound . 192 C.1 Illustration of the principal component analysis . 194 List of Tables 2.1 Frequency bands on the Bark scale . 41 4.1 Comparison of sound texture algorithms and software . 101 7.1 List of test samples in the evaluation for directly parametrized sounds . 146 7.2 List of test samples in the evaluation for morphed parametrized sounds 147 7.3 Influence of the equipment quality in the evaluation . 158 A.1 Definitions of the term \sound texture" in existing publications . 183 11 12 List of Tables Chapter 1 Introduction and Motivation In audio processing, music and speech have always been at the center of attention (Athineos & Ellis, 2003). Their relevance for communication and entertainment is immediately obvious. However, in recent years a new discipline has developed that directs its attention to the domain of so-called sound textures, the chaotic multitude of atmospheric weather sounds, traffic noises, singing birds and babbling crowds that make up a great deal of everyone's daily life.

A Parametric Sound Object Model for Sound Texture Synthesis

Art to Commerce: the Trajectory of Popular Music Criticism

Brazilian/American Trio São Paulo Underground Expands Psycho-Tropicalia Into New Dimensions on Cantos Invisíveis, a Global Tapestry That Transcends Place & Time

Separation of Vocal and Non-Vocal Components from Audio Clip Using Correlated Repeated Mask (CRM)

Masterarbeit

One Direction Infection: Media Representations of Boy Bands and Their Fans

Psychological Measurement for Sound Description and Evaluation 11

A Unit Selection Text-To-Speech-And-Singing Synthesis Framework from Neutral Speech: Proof of Concept 39 II.1 Introduction

Design of a Protocol for the Measurement of Physiological and Emotional Responses to Sound Stimuli

Comparison of Frequency-Warped Filter Banks in Relation to Robust Features for Speaker Identification

Fundamentals of Psychoacoustics Psychophysical Experimentation

Abstract 1. Introduction the Process of Normalising Vowel Formant Data To

Perceptual Atomic Noise