Two-dimensional gesture mapping for interactive real-time electronic music instruments

Thomas Grill [email protected]

October 2008 Eidesstattliche Erkl¨arung:

Ich habe die vorliegende Arbeit selbst verfasst. Außer den angefuhrten¨ wur- den keine weiteren Hilfsmittel oder Quellen verwendet. Stellen, welche w¨ortlich oder sinngem¨aß aus anderen Arbeiten ubernommen¨ wurden, sind als solche gekennzeichnet.

Thomas Grill Abstract

Electronic music instruments have become widespread complements to the variety of traditional acoustic instruments, opening up multifaceted pos- sibilities for the creation of sound and structure in the domains of both composition and live performance. Yet, the prevalent dependence on com- plex technologies and -based hardware tends to hide the interesting artistic potentials behind abstract interface and control concepts that are hardly compatible to musical thinking and acting. This thesis presents fundamental thoughts and technical developments in order to make electronic instruments as intuitive and expressive as their acoustic ancestors and explores a novel approach to the transformation of a player’s musical gestures into actual sound. We can build this transformation on two principles: First, the idea of en- ergy propagation and transformation through the instrument from gestural actions to sonic developments, and second, the foundation of the trans- formation process on a representation which is based on perceptually and musically meaningful structure elements. As gestural and sonic developments are complex by their nature, the presented concept constricts the focus of musical action to user-compilable repertories of gesture and sound representation. This allows to break down the high-dimensional complexity to two-dimensional map-like collections of structure elements. By basing the entire transformation process of gestures to sound on such a two-dimensional projection, a more immediate control of transfor- mation strategies and more intuitive visual feedback is possible, since the two-dimensional structures can be easily laid out and controlled on screen. This thesis researches both the theoretical background as well as the technological prerequisites. As a proof of concept an actual implementation of a new -based real-time instrument using such two-dimensional gesture mapping is presented. Zusammenfassung

Elektronische Musikinstrumente bilden eine mittlerweile weit verbreitete Er- g¨anzung zur Vielfalt an traditionellen akustischen Instrumenten. Sie er¨off- nen vielf¨altige M¨oglichkeiten zur Gestaltung von Klang und Struktur in- nerhalb der Bereiche von sowohl Komposition als auch Live-Performance. Die vorherrschende Abh¨angigkeit von komplexen Technologien und compu- terbasierter Hardware fuhrt¨ jedoch zu einer Abschottung der interessanten kunstlerischen¨ Potentiale hinter abstrakten Interfaces und Steuerungskon- zepten, die kaum mit kunstlerischem¨ Denken und Agieren vereinbar sind. Diese Arbeit stellt grundlegende Uberlegungen¨ und technische Entwick- lungen vor, um elektronische Instrumente genau so intuitiv und expressiv wie ihre akustischen Vorfahren zu machen und untersucht einen neuartigen Zugang zur Transformation von musikalischen Gesten in erzeugten Klang. Eine solche Transformation kann auf zwei Prinzipien basieren: Erstens, der Idee der Energiefortpflanzung und -transformation durch ein Instrument – von gestischen Abl¨aufen zu klanglichen Entwicklungen – und zweitens, der Verankerung des Transformationsprozesses in einer Repr¨asentation be- stehend aus perzeptiv und musikalisch sinnvollen Strukturelementen. Weil gestische und klangliche Entwicklungen von Natur aus komplex sind, schr¨ankt das vorgestellte Konzept den Fokus musikalischer Aktion auf durch den Benutzer zusammenstellbare Repertoires ein. Dies erlaubt, die hochdimensionale Komplexit¨at auf zwei-dimensionale landkartenartige Best¨ande von Strukturelementen herunterzubrechen. Die Fundierung des gesamten Transformationsprozesses von Gesten zu Klang auf einer solchen zweidimensionalen Projektion erm¨oglicht unmittel- barere Kontrolle von Transformationsstrategien und intuitiveres visuelles Feedback, da zweidimensionale Strukturen einfach auf dem Bildschirm zu veranschaulichen sind. Diese Arbeit untersucht sowohl die theoretischen Grundlagen wie auch technologische Voraussetzungen. Als ‘proof-of-concept’ wird die Implemen- tierung eines neuen Software-basierten Echtzeit-Instruments unter Verwen- dung eines solchen zweidimensionalen Gesten-Mappings pr¨asentiert. Acknowledgements

My heartfelt thanks go out –

for everything related to perceiving music – to Wolfgang Musil, Gunther¨ Rabl, Don van Vliet, et al.

for everything related to performing music – to Ang´elicaCastell´o,Katharina Klement, Low Frequency Orchestra, et al.

for everything related to researching music – to Volkmar Klien, Martin Gasser, et al.

for everything, really – to Marion P¨otz

1 Contents

Abstract 1

Acknowledgements 1

1 Introduction 4 1.1 Electronic instruments ...... 4 1.2 Personal motivation for this project ...... 5 1.3 Anticipated instrument architecture ...... 8

2 Background 12 2.1 Representation of sound and music ...... 14 2.1.1 Time-domain representation ...... 16 2.1.2 Time-frequency representation ...... 16 2.1.3 Sinusoidals and residuum ...... 20 2.1.4 Acoustic quanta ...... 21 2.1.5 Sound objects and spectromorphologies ...... 24 2.1.6 Representation based on perceptive similarity . . . . . 26 2.2 Sound synthesis ...... 28 2.2.1 Sample-based synthesis ...... 28 2.2.2 Spectral modeling synthesis ...... 30 2.2.3 Physical modeling ...... 32 2.3 Playing an (electronic) music instrument ...... 33 2.3.1 Man-machine interaction ...... 35 2.3.2 Sensoric input ...... 38 2.3.3 Gesture mapping ...... 43 2.3.4 Instrument feedback ...... 46 2.3.5 Software development systems ...... 52

2 CONTENTS 3

3 Instrument design 54 3.1 Basic considerations ...... 54 3.2 Sensoric input ...... 55 3.3 Input analysis ...... 58 3.4 Gesture repertory ...... 60 3.5 Sound repertory ...... 63 3.6 Gesture transformation ...... 66 3.7 Sound synthesis ...... 67 3.7.1 Granular synthesis ...... 68 3.7.2 Spectral modeling synthesis ...... 69 3.8 Instrument feedback ...... 70

4 Implementation and assessment 71 4.1 Repertory preprocessing ...... 73 4.2 Real-time operation ...... 75 4.3 Hardware considerations ...... 79 4.4 Early practical experiences ...... 82

5 Conclusions and future perspectives 83

A patches 85

List of figures 93

Bibliography 96 Chapter 1

Introduction

Which is more musical, a truck passing by a factory or a truck passing by a music school? (John Cage [Cag73])

1.1 Electronic instruments

Traditional acoustic instruments have been with us for a long time. We have become accustomed to the variety of different sounds and to the physical materials they are made of, just as we have become familiar to the (occasion- ally somewhat peculiar) human gestures performed in order to make these instruments play what we call music. And, at the same time, we are embedded in a universe of non-musical sound – a gigantic sea of noises effortlessly absorbing any conception of what be musical sound within a wealth of possibilities in structure and timbre. It is owing to electronic instruments that music is not bound to the acoustics of vibrating sound-boxes any longer – whatever one’s imagination provides, it can be cast into sound. There are no physical fundamentals or material properties to be considered, there are no bodily restraints to be respected, there is no canonical school of techniques to be followed.

One of the things I see as a possibility with these [electronic] instruments is [...] that we could take some of the focus off of physical virtuosity and the kind of athleticism of learning to

4 CHAPTER 1. INTRODUCTION 5

play an instrument, and put as much focus as we could on the mental and emotional activities of music, whether it’s being a better listener or imagining things and making them happen or interpreting things to your liking. (Tod Machover [Mac99])

Yet, how can we keep the complexity and broad spectrum of all these possibilities manageable? How can we make our powerful new instruments instantly playable? Present-day makers of music instruments can choose from two traditions to follow, “a millennial one, as old and rich as the history of music making, and a half-century old one, that of computer mu- sic” [Jor01]. It is an important aspect of this thesis to consider experiences from both branches: To make use of our physical intimacy and technical familiarity with traditional acoustic instruments and look into all those new methods of digital in order to allow for previously unheard sounds and structures.

1.2 Personal motivation for this project

By the year 2001 I was gradually becoming busier playing as a computer musician. Up to this point I followed the habit of modifying my software instruments for every single performance in the belief that the instrument design should in some way reflect the concept of the musical piece to perform and vice versa. Gradually I discovered that the increased complexity of the respective instruments and their perpetual modification overburdened my skills to properly handle them while performing. It became necessary to spend more time on practice rather than on developing new instruments. As a resort I decided to design a more general and long-lasting instrument that should serve all my basic needs without major modifications for each individual project. After some weeks of intense development I was able to perform on my new instrument (figure 1.1): A simple 4-track sample player with access to a sound library, live-recording and mix-in possibility, looping and pitch-bending by resampling, as well as a mixing matrix for multi-channel setups. On the hardware side it uses only the laptop screen, some keyboard shortcuts, mouse selection and – somewhat optionally – a 16-way MIDI controller with knobs or faders. Within seven CHAPTER 1. INTRODUCTION 6

Figure 1.1: ‘PLAY’ 4-track sample-based performance instrument, based on /MSP years this instrument has been used in more than 100 performances with only one minor modification for more complex surround-sound situations. One of the major design considerations was that the instrument itself should exert as little influence on sound or musical structure as possible, thus providing no automatisms. All sound output should rather be defined by the selected sounds from the library and the ad-hoc playing techniques used to manipulate these sounds. With these very limited possibilities, ample practice, and by using only a restricted set of sound samples specific for each ensemble or artistic concept I wanted to establish a unique and recognizable personal style of playing. The relatively narrow scope of the instrument at hand and the consequential necessity of exploring and exploiting all of the possibilities made me familiar with all soft and rough edges of the system. Naturally, by the time it became apparent in which ways the instrument could be used efficiently and in which situations it was rather inadequate. I discovered and musically applied interesting bugs and side effects but also reached the limits in terms of possible interaction speed when trying to keep up with fellow (acoustic) musicians. Additionally some severe interface-related shortcomings have surfaced: CHAPTER 1. INTRODUCTION 7

• Waveform displays do not always adequately represent the audio con- tent

• Input devices are not ergonomic, resulting in cramps while performing intensely (especially with mouse movement or MIDI faders)

• Keyboard shortcuts are easily mixed up

• Mouse selection is important for loops, but while dragging the mouse is easily moved off the mouse-pad or table border

• The MIDI controllers used have only 7 bits of resolution, resulting in rather coarse movements

• Keyboard shortcuts interrupt mouse-dragging, hence these two ges- tures can’t be used concurrently

• It is almost impossible to actively play on one sample track and pre- pare another one at the same time (by loading or positioning the play cursor)

Although some of these issues could be resolved by modifying the input hardware or establishing workarounds in the underlying Max/MSP system a more general problem has emerged by the time: It became obvious that these listed flaws as well as the limited sound manipulation possibilities had an actual impact on the resulting music – partly because some things can not be done, partly because some things are done in distinct ways (e.g. MIDI controller movement). I also felt that the rather technical interface including sliders, toggles, number boxes and waveform display is not helpful in repre- senting what is actually going on in musical action and thinking – instead, dealing with the abstract interface detracts a player from concentration and consumes energy because of necessary translation efforts. Traveling in China in 2004 I was lucky to meet the media artist couple Jia Haiqing and Fu Yu aka 8GG1 in Beijing. When they showed me some of their musical work I was surprised that it was not separable from visual arts as all has been designed as a cohesive whole. Even if the visual interfaces (see figure 1.2) (developed in Shockwave2 and also used as video projections

18GG http://www.8gg.com, last visited 14.10.2008 2 Macromedia Shockwave, now Adobe Director http://www.adobe.com/products/ director, last visited 12.10.2008 CHAPTER 1. INTRODUCTION 8

Figure 1.2: 8GG - ‘dance taiji’ synesthetic performance interface. during the performance) did not use any explicitly musical clues like notes and structures they were so much more musical than what I’ve seen and used before, most probably because of the concentration on content and form rather than on technical production necessities. Later I discovered that I was not well informed and that a number of such synesthetic environments have been existing for a longer time (see sec- tion 2.3.4). On the other hand what I really wanted was an instrument (that is: a counterpart to an acoustic instrument) more than an audiovisual en- vironment with a broader focus. The envisioned instrument concept should unite intuitivity, haptic qualities and portability of a physical instrument with the flexibility and potential abstractness (also via multichannel spa- tialization) of a virtual music instrument and with a visual interface that supports musical thinking.

1.3 Anticipated instrument architecture

There are a large number of varied ‘new interfaces for musical expression’ under current research, as discussed in the following chapter 2. However, the various existing concepts are either instruments with rather fixed relation- ships of gestures and sound (and often also fixed interface device and specific CHAPTER 1. INTRODUCTION 9 stylistic characteristics), or they are very general-purpose toolkits without any explicit commitment to interaction or synthesis strategies. Thirdly, there are a few elaborate music systems with rather complex and costly technical prerequisites. Obviously, there is no flexible instrument concept which is adaptable in terms of interface and sound synthesis but which formulates a definite interaction strategy. A few characteristics seem to be indispensable for such an instrument when following the thoughts of the previous section:

• The instrument should feel like an acoustic instrument in terms of how sounds are shaped by human interaction. There should be an immediate correspondence of the sound output to the input.

• It should be intuitive in a way that the appearance is self-explaining and there is nothing to adjust before playing can commence.

• It should be highly expressive so that even with increasing practice there’s always room to evolve.

• The instrument should allow to devise sounds and structures that are absolutely novel (independent of recorded or otherwise existing sounds) yet repeatable, depending on practice.

• It must be possible to develop a unique and recognizable style.

• It must be possible to produce many different kinds of sounds.

• The system should be self-contained and lightweight, eventual out- board equipment (besides the actual portable computer) should not be an integral part of the instrument, but rather be optional.

The points are very general and it is clear that there may be many inherently different instrument designs able to fulfill them. There is one additional point which is more distinguishing:

• There should be way to rely on pre-existing (e.g. recorded), percep- tively and/or musically meaningful structure elements as a kind of underlying construction material3.

3This can be seen as a nostalgic fallback to the beloved sample library of the old instrument. CHAPTER 1. INTRODUCTION 10

Hence, the central research questions of this thesis are the following:

How can we suitably base the synthesis of electronic music on both a musician’s gestures and on a repertory of musical elements? How can we make this process intuitively comprehensible by the player?

The propositions regarding an envisioned instrument design can be for- mulated as such:

We can follow a two step approach by first associating gestures to musical structure elements, and then use these to synthesize sounds. By finding a flat, map-like form of organization for the repertory of musical elements we can base the whole process of transformation from a musician’s gestures to synthesized sound on a two-dimensional representation. Thus, the process can be directly visualized on screen which will help a player to follow and control the transformation and more intuitively understand the connection of gestural actions and sound output.

Figure 1.3 outlines the basic architecture of the desired instrument. The instrument should be reconfigurable in a way that the basic manner of in- teraction does not change even if the actual input device (sensoric hardware or ) is exchanged or if a different synthesis method (or sound library) is used. Instead, such reconfigurations should be considered as pre- sets of the instrument system. Neither the synthesis repertories, nor the mapping process between input and output should be hardcoded but rather adjustable or incrementally trainable. Preferably, the system should be able to adapt on changes of the vocabularies (e.g. new gestures or additions to a sound library) quasi-automatically and smoothly change its behavior, so that the player can always easily adapt it in the course of practicing. This effectively represents a feedback loop with both the player and the instru- ment continually adjusting to respective changes of techniques or repertoire. Clearly practicing such a kind of instrument is vital for the success of mas- tering it. Detailed control is not built in through fine-grained dials or multi- level menus – it can only be achieved by means of sensible interaction with the whole instrument system. CHAPTER 1. INTRODUCTION 11

musician’s gesture  (touch, movement,sound)

extraction of meaningful musical “features”

repertory of musical structures, toolset for transformation

resynthesis

sound

Figure 1.3: Anticipated instrument architecture Chapter 2

Background

As Jorda states in his articles on “Digital instruments and players” “New instruments possibilities are endless. Anything can be done and many exper- iments are being carried out.” [Jor04a], the whole spectrum of electronic in- struments is broad and includes many technological (e.g. electronics, sound synthesis and processing, computer programming) and human-related (psy- chology, ergonomics, etc.) fields. As “musical devices can take varied forms, including interactive installa- tions, digital musical instruments, and augmented instruments” [BFMW05], many attempts have been made to describe and classify all of those different approaches and their components. Birnbaum et al. propose a visual repre- sentation of the design space of varied musical performance and interaction systems (figure 2.1) with the dimensions of required expertise of a user to interact as intended with the system, the level musical control a user exerts over the resulting musical output, feedback modalities as an indicator of the degree of feedback provided to a user, the degrees of freedom of input controls available, inter-actors representing the number of people involved in the mu- sical interaction, destribution in space as a measure of the physical area of the interaction space, and finally the role of sound using Pressing’s [Pre97] categories of artistic/expressive, environmental or informational. Various instruments and musical environments are listed and described in the works of e.g. Ruschkowski [Rus98], Roads [Roa96], Jorda [Jor05a] or Piringer [Pir01]. In this thesis no attempts are made to cover the whole range in detail although all of those mentioned disciplines are usually contributing to a more

12 CHAPTER 2. BACKGROUND 13 Proceedings of the 2005 International Conference on New Interfaces for Musical Expression (NIME05), Vancouver, BC, Canada

(a) The Hands[22] (b) Hyper-Flute[13]

Figure 2.1:Figure 7-axis dimension 3: The space 7-axis of electronic Dimension instruments Space [BFMW05]

complex electronic music system. I’d rather want to focus on the specific aspectsThe of gesturalDegrees control, of sound Freedom generationaxis and indicates feedback as the they number are found of (c) Theremin (d) Tooka[5] in traditional• input acoustic controls instruments. available The to explicit a user point of of a interest musical lies system. in the extrapolationThis axis of those is conceptscontinuous, for the representing decoupled and abstracted devices domains with few of hapticinputs control, at sound one generation extreme and and sensoric those feedback with as many present at with the computer-basedother extreme. music instruments, and their connection to the universe of sounds and structures that is possible with methods of digital sound synthesis.The Inter-actors axis represents the number of people • involved in the musical interaction. Typically interac- tions with traditional musical instruments feature only one inter-actor, but some digital musical instruments and installations are designed as collaborative inter- faces (see [5], [1]), and a large installation may involve hundreds of people interacting with the system at once [21]. (e) Block Jam[12] (f) Rhythm Tree[14] The Distribution in Space axis represents the total • physical area in which the interaction takes place, with values ranging from local to global distribution. Mu- sical systems spanning several continents via the in- ternet, such as Global String, are highly distributed [20]. The Role of Sound axis uses Pressing’s [17] categories • of sound roles in electronic media. The axis ranges be- tween three main possible values: artistic/expressive, environmental, and informational. (g) Global String[20] (h) Maybe...1910 [25] 3. TRENDS IN DIMENSION PLOTS The plots of Michel Waisvisz’ The Hands (Figure 4(a)) and Todd Winkler’s installation Maybe... 1910 (Figure 4(h)) Figure 4: Dimension Space Plots provide contrasting examples of the dimension space in use. The Hands requires a high amount of user expertise, al- lows timbral control of sound (depending on the mapping (sights, sounds, textures, smells) as is the number of inter- used), and has a moderate number of inputs and outputs. actors. The distribution in space of the interaction, while The number of inter-actors is low (one), the distribution in still local, is larger than most instruments, and the role of space is local, and the role of the produced sound is expres- sound is primarily the exploration of the installation envi- sive. The installation Maybe... 1910, is very different: the ronment. required expertise and number of inputs are low, and only When comparing these plots, and those of other music de- control of high-level musical processes (playback of sound vices, it became apparent that the grouping used caused the files) is possible. The number of output modes is quite high plots of instruments to shift to the right side of the graph,

194 CHAPTER 2. BACKGROUND 14

2.1 Representation of sound and music

When we are talking about computer- or software-based music instruments, we are at the same time talking about a digital representation of sound that these instruments cope with. Figure 2.2 shows the technical steps that are needed to transform audible sound in the form of air pressure variations to such a digital representation inside the computer and back to the analog world, as described in more detail in [Roa96]: After transforming the air pressure to a voltage signal of proper strength, it must be treated by a lowpass filter before being converted to binary num- bers by an analog-to-digital converter (ADC) driven by a sample clock. The rate of this sample clock (usually 44100 or 48000 Hz, i.e. ticks per second) determines how many digital measurement values of the analog signal are generated each second. Since frequency components of the incoming analog audio can be digitally represented correctly only up to the half of the sam- ple frequency (an effect of the sampling theorem as formulated by Harold Nyquist in 1928) the analog signal has to be lowpass-filtered before the con- version to the digital domain in order to eliminate components above this so-called Nyquist frequency. For high-fidelity audio applications the sam- pling frequency is chosen in a way so that the frequency range of human perception (about 20 Hz – 20000 Hz) lies below the Nyquist frequency and it can therefore be fully represented by . Once the stream of digital numbers is stored in computer memory we can use software to analyze the audio for certain perception-related features or transform it with feasible algorithms as is discussed in the following. The path from the digital representation to the analog world lets the stream of numbers taken from computer memory pass a digital-to-analog converter (DAC) which constructs a voltage signal at the given rate of the sample clock. Due to the reconstruction process this signal is jagged which means that it contains a lot of unwanted frequency components above the Nyquist frequency. To eliminate this hiss the analog signal is smoothed by using another lowpass (or reconstruction) filter before it is amplified and transduced by a loudspeaker to become audible. Various strategies of representing audio and its content as it is perceived by human listeners are presented in the following. What humans are actually perceiving, how the hearing system works and how computer-based methods CHAPTER 2. BACKGROUND 15

Figure 2.2: From analog sound input to a digital representation and back to analog output [Roa96] CHAPTER 2. BACKGROUND 16 can mimic these processes is a subject of the field of (computational) auditory scene analysis and covered by the definitive books of Bregman [Bre94] and Wang and Brown eds. [WB06].

2.1.1 Time-domain representation

Sound that we can perceive is equivalent to periodic changes of air pressure caused by some vibrating body. The most straightforward and often-used method of depicting sound waveforms is to draw these changes of air pressure as a function of time, as shown in figure 2.3. We can find this waveform-style visualization in all brands of sound editing programs but also in interactive performance systems (e.g. in figure 1.1)

Figure 2.3: Time-domain representation of a waveform [Roa96]

With some practice it is in many cases possible to deduce some of the more important sonic characteristics from this form of representation: the volume, rhythmic features and also spectral content. On the other hand the visual representation does not fully correspond to human perception. Even a strong sine tone can easily be buried in noise, or musical features at lower volume can not be sufficiently resolved in the graphical display.

2.1.2 Time-frequency representation

Since “the auditory system functions as a spectrum analyzer, detecting the frequencies present in the incoming sound at every moment in time” [Ser89] it makes sense to represent musical structures on a two-dimensional plane spanned by the time and frequency axes. A familiar example for this is the traditional musical notation (e.g. stan- dard Western sheet music). In figure 2.4 we can see three parallel systems of time-frequency notation with a logarithmic frequency scale (i.e. equally CHAPTER 2. BACKGROUND 17

Figure 2.4: Anton Webern: Trio op. 20, 1. Satz, Takte 10/11 (from [Kar66]) tempered tuning) and non-linearly progressing time. Especially for con- temporary music for which composers try to enrich the sonic experience by micro-tonal deviations from the chromatic scale or non-standard instrumen- tal techniques this standard form of notation can be difficult to handle and is often complemented by additional graphical symbols or extended forms of graphical notation diverging from the staves representation. Canonical and also more eccentric forms of these extensions have been compiled by Karkoschka [Kar66]. For note-based electronic music this standard Western notation has been adapted in the 1980s as a part of the MIDI1 standard [Rot96], a con- trol protocol for musical devices. It is still a widely-used standard, but hardly adequate to represent more complex or non-note-based musical struc- tures [Moo88]. Mathematically, the transformation from the time-domain to the frequency- domain is done by a stage of analysis. The most widespread form of transfor- mation uses a sinusoidal waveform as the basis – it is called Fourier Analysis after its discoverer, the French engineer Jean Baptiste Fourier (1768-1830). Fourier’s theory says that any complex periodic signal can be represented as a sum of simple sinusoidal signals, each with individual frequency, phase and amplitude. The term periodic implies that the signal is either explicitly repetitive at some time-interval or that a repetition must be implicitly im-

1Musical Instrument Digital Interface CHAPTER 2. BACKGROUND 18 posed on the signal by a time window of the analysis. For digitally sampled signals only the discrete version of the Fourier analysis (DFT)

N−1 X −j 2πkn X[k] = x[n]e N , k = 0,...,N − 1 (2.1) n=0 is of relevance and here, only variations of the so-called Fast Fourier Transform (FFT) [CT65], an efficient algorithm for calculating the DFT, are actually used in present-day computer applications. The fact that phys- ical signals are always real (instead of complex) can lead to further opti- mizations [Ber68]. In music systems the Fourier transformation is mostly used in the form of the short-time Fourier transform (STFT) [All77, AR77] where it has a broad range of applications, especially for sonograms, the phase vocoder or spectral modeling synthesis. The STFT imposes an analysis window upon the input signal which breaks the signal into a series of segments that are shaped in amplitude by the chosen window function. The window function is typically symmetric and bell-shaped. By taking each of these segments and subjecting it to a DFT (see figure 2.5), one obtains a series of spectrums that constitutes a time-frequency representation of the input signal.

Figure 2.5: Overview of the short-time Fourier transform (from [Roa96]) CHAPTER 2. BACKGROUND 19

The visualization of this representation using the time-frequency array of gray-scale- or color-coded amplitude values is called a sonogram – it is a convenient method to display digital audio material. In figure 2.6 the waveform and the sonogram representations are displayed side-by-side.

Figure 2.6: Typical sonogram (upper part) and waveform (lower part) rep- resentation of human voice.

While amplitude-related features like impulses are easier to track in the waveform view, tonal components are much easier to identify in the sono- gram view. The sinusoidal harmonics of a specific tone (in this case of vowels) form a regular comb-like pattern of spectral peaks with a spacing that is proportional to the fundamental frequency. An important aspect of the DFT is the limited frequency resolution which is dependent on the number of samples N within the period (respec- tively, time window) under analysis. With a given sampling frequency fs the frequencies are spaced linearly at

f f = k s , k = 0,...,N/2 (2.2) k N This linear spacing of the analysis frequencies does not correspond to human perception and the musical octave-oriented conventions which are rather on a logarithmic scale. For an analysis window size of N = 1024 samples and a sampling rate of fs = 44100 Hz this gives a frequency res- olution of ∆f = 43 Hz. In order to e.g. separate tones in a lower octave that are a semitone apart much higher resolutions are necessary, basically resulting in larger analysis time windows. On the other hand the increased CHAPTER 2. BACKGROUND 20 resolution would not be necessary for higher octaves. This problem is tackled by the so-called constant-Q transform where the analysis filter bandwidth Q remains constant across the whole spectrum resulting in a logarithmic frequency spacing (see figure 2.7). Constant-Q transforms can be implemented in a computationally efficient way based on the FFT [Bro91].

Figure 2.7: Analysis filter bands for (a) constant-Q and (b) DFT transfor- mation (from [Roa96])

2.1.3 Sinusoidals and residuum

Spectral modeling synthesis (SMS) as developed by Xavier Serra and de- scribed in every detail in [Ser89] presents a concept of how to decompose, transform and resynthesize perceptually relevant auditory spectral and tem- poral information in the form of time-varying sinusoidal and stochastic (noise) components. As shown in figure 2.9 the analysis part is based on the FFT (resp. STFT) combined with a peak detection mechanism to identify pronounced peaks (i.e. sinusoidal components) within the frequency spec- trum. These are then tracked throughout consecutive analysis time frames of the STFT in order to construct continuing sinusoidal guides (e.g. funda- mental and harmonics of an instrumental tone) along with onsets and ter- minations (see figure 2.8). By subtracting these tonal components from the spectrum (by means of a resynthesis step and subsequent STFT) the residual part of the signal is calculated which consists of all noise-like components. CHAPTER 2. BACKGROUND 21

Figure 2.8: Illustration of the peak continuation process – spectral peaks from subsequent time frames are combined to form guides. [Ser89]

An envelope approximation step can follow to simplify the representation of this residual spectrum. Figure 2.10 shows spectrums of the consecutive steps of the analysis process. From the resulting representation many other interesting high-level at- tributes can be calculated (see figure 2.11) which come very close to musical parameters in a traditional sense.

2.1.4 Acoustic quanta

As we have seen in section 2.1.2 within the concept of the Fourier analy- sis frequency and time resolution are two complementary dimensions. In order to increase frequency resolution one has to lower temporal resolution (by making the analysis window larger). The physicist Dennis Gabor first published a number of articles [Gab46, Gab47, Gab52] with investigations on this principle and its implications, combined with practical acoustic ex- periments, motivated by his insight “though mathematically this [Fourier] theorem is beyond reproach, even experts could not at times conceal an un- easy feeling when it came to the physical interpretation of results obtained by the Fourier method” [Gab46], or as Risset states “Indeed it is paradox to try to express a sound with finite duration in terms of sounds with infinite duration, such as sine waves” [Ris91]. By drawing analogons to quantum mechanics, Gabor reformulated the CHAPTER 2. BACKGROUND 22

Figure 2.9: Block diagram of the analysis into deterministic and residual components [SS90]

Figure 2.10: Magnitude spectrum from a piano tone with identified peaks (top), residual spectrum (middle), residual approximation (bottom) (mon- tage from [SS90])

Musical Sound Modeling

in the spectral domain it is sufficient to calculate the samples of the main lobe of the window transform, with the appropriate magnitude, frequency and phase values. We can then synthesize as many sinusoids as we want by adding these main lobes in the FFT buffer and performing an IFFT to obtain the resulting time-domain signal. By an overlap-add process we then get the time-varying characteristics of the sound.

The synthesis frame rate is completely independent of the analysis one. In the implementation using the IFFT we want to have a high frame rate, so that there is no need to interpolate the frequencies and magnitudes inside a frame. As in all short-time based processes we have the problem of having to make a compromise between time and frequency resolution. The window transform should have the fewest possible significant bins since this will be the number of points to generate per sinusoid. A good window choice is the Blackman-Harris 92dB because its main lobe includes most of the energy. However the problem is that such a window does not overlap perfectly to a constant in the time domain. A solution to this problem (Rodet and Depalle, 1992) is to undo the effect of the window by dividing by it in the time domain and applying a triangular window before performing the overlap-add process. This will give a good time-frequency compromise.

14. Stochastic Synthesis

The synthesis of the stochastic component can be understood as the generation of a noise signal that has the frequency and amplitude characteristics described by the spectral envelopes of the stochastic representation. The intuitive operation is to filter white noise with these frequency envelopes, that is, performing a time-varying filtering of white noise, which is generally implemented by the time- domain convolution of white noise with the impulse response of the filter. But in practice, the easiest and most flexible implementation is to generate the stochastic signal by an inverse-FFT of the spectral CHAPTERenvelopes. 2.As in BACKGROUND the deterministic synthesis, we can then get the time-varying characteristics of 23the stochastic signal by an overlap-add process.

FigureFi 2.11:gure 11:Extraction Extraction of mu ofsica musicall parameter parameterss from the analysi froms repres theentati analysison of a single represen- note of an instrument. tation [Ser97] Before the inverse-FFT is performed, a complex spectrum (i.e., magnitude and phase spectra), has to be obtained from each frequency envelope. The magnitude spectrum is generated by linear Fourier theorem in a way so that a signal can be expressed in terms of both 22 frequency and time, so that an arbitrarily complex sound can be represented as the superposition of elementary grains, also called logons defined by time, frequency, phase and amplitude (figure 2.12). This is also known as the “particle theory of sound” – see Opie [Opi03] for the larger picture. In recent times a broader theory has been developed under the terms of wavelet analysis [KMGY97].

Figure 2.12: Gaussian grain with sinusoidal frequency and phase as well as duration and volume defined by the envelope [Roa02]

Based on these principles, in [Xen92] Xenakis exposes the following ba- sic hypothesis: “All sound is an integration of grains, of elementary sonic CHAPTER 2. BACKGROUND 24 particles, of sonic quanta. [...] A complex sound may be imagined as a multi- colored firework in which each point of light appears and instantaneously disappears against against a black sky. [...] If we consider the duration ∆t of the grain as quite small but invariable, we can ignore it in what follows and consider frequency and intensity only.” Thus, based on the Fletcher-Munson diagram of equal-loudness con- tours [FM33] and on results from information theory by Moles [Mol68] Xenakis constructs a two-dimensional array of elementary audible quanta which he calls a screen (figure 2.13). On such a screen the population by individual grains can be defined by a density function D(F,G).

Figure 2.13: The frequency-loudness area of human auditory perception is divided into a matrix of auditory grains (l.h.s) and subjected to a coordinate transformation, yielding a rectangular screen of sonic quanta (r.h.s) [Xen92]

In a further step, in his theory of “Markovian stochastic music”, Xenakis adds the dimension of time by introducing the book of screens (figure 2.14) for the description of evolving sound. Vector fields and probability distri- butions are used to model the positions and movements of the multitude of grains in a time-frequency-loudness space.

2.1.5 Sound objects and spectromorphologies

There is a lot of ground to cover from the almost ‘haptic’ elementary unit of traditional music theory – the classical note played by an instrument – to the somewhat cursory, stochastic stream-based description of grain fields by Xenakis. In order to extend the concept of the note, Schaeffer and others coined the notion of the objet sonore’ or sound object [Sch66, Chi83] which is dis- cernable as a coherent whole and temporally closed but which may encom- CHAPTER 2. BACKGROUND 25

Figure 2.14: A book of screens equals the life of a complex sound (from [Xen92]) pass complex internal developments in time, frequency and timbre, as de- picted in the schematic representation in figure 2.15. We can see the relation to both the concept of the note in that the object has a kind of closed shape and the book of screens in that the sound evolves throughout the continuous dimensions of time and spectrum.

Figure 2.15: A complex sound-object moving in the continuum (l.h.s). Fre- quency/timbre cross-section of sound at start, mid-point and end (r.h.s) [Wis96].

According to Schaeffer the sound object is defined by what can be heard as such in the process of reduced hearing, therefore the perception of the listener is the ultimate measure. The multitude of possible sound objects in nature and sonic arts caused Schaeffer to compile a collection of sounds and classify them in a morphology [SR67]. CHAPTER 2. BACKGROUND 26

Departing from this system Denis Smalley constructed an extensive frame- work to describe sound, sources of sound and listening experience: “Spec- tromorphology is an approach to sound materials and musical structures which concentrates on the spectrum of available pitches and their shaping in time” [Sma86]. Smalley’s system is based on metaphors to develop a common terminology for electro-acoustic music. A central aspect is the in- sight that “in spectro-morphological music there is no consistent low-level unit, and therefore no simple equivalent of the structure of tonal music with its neat hierarchical stratification from the note upwards, through motif, phrase, period, until we encompass the whole work”. Instead, the terms

84 I LANGUAGE OF ELECTROACOUSTIC MUSIC gesture and texture represent two fundamentalSPECT structuringRO-MORPHOL strategiesOGY I 85 that are defined by how energy is injected, carried and transformed. of directed motion remains, and can be temporarily taken for granted while approach the ear shifts focus to delve into textural motion, perhaps to emerge again emergence once a more urgent sense of directed motion is detected. We can regard these insurgence circumstances as an example of gesture-framing. O n s e t anacrusis The more gesture is stretched in time the more the ear is drawn to expect -4 L downbeat textural focus. The balance tips in texture's favour as the ear is distanced from / memories of causality, and protected from desires of resolution as it turns continuant - maintenance statement prolongation transition inwards to contemplate textural criteria. Gestural events or objects can easily ( be introduced into textures or arise from them. This would be an example e of texture-setting, where texture provides the environment for gestural 1'l ane immersion activities. release Both gesture-framing and texture-setting are cases of an equilibrium resolution capable of being tipped in either direction by the ear. They indicate yet again \ closure areas of border tolerances, ambiguities open to double interpretation and perceptual cross-fadings depending on listening attitude. On the whole, FigureFig 2.16:ure 9: Str Structuraluctural functio functionsns in the theory of spectromorphol- gestural activity is more easily apprehended and remembered because of the compactness of its coherence. Textural appreciation requires a more active ogy [Sma86] aural scanning and is therefore a more elusive aural art. Finally, we should not The onset group covers a range of possibilities, from the more specific and forget that attack-effluvium and pitch-effluvium tolerances are critical in Howtrad theseitional structuraldownbeat a componentsnd anacrusis t areo th ine less relationship specific em withergenc eache and other determining the potential for gestural or textural apprehension. (structuralapproa relationships),ch. The continua whichnt grou rolesp cov theyers a playbroad iner temporalvariety as arrangementsbefits the (structuralambiva functions,lence of its see stat figureus: the 2.16), term m formsainten ofan gesturece is fairlysurrogacy non-comandmitta spatio-l, merely noting that whatever has been set in course continues; statement is Structural functions morphologicalmore defin considerationsitive and self-imp areorta additionalnt, remindin pointsg us of ofthe focus. traditional 'expo- We now resurrect the three linked temporal phases of morphological design - sition'; prolongation expresses functional stretching, a lying-in-wait for more onset, continuant, and termination - as models for structural functions. In 2.1.6definite Representation continuity; and tra basednsition d onescri perceptivebes the intermed similarityiate stage between this way we project morphological design into structural design, thereby other supposedly more important functions. We therefore realize that the uncovering the links between low and high levels of structure. In the Apartco fromntinua classificationsnt function is no thatt neutr areal: ti basedme cann onot s thetand perceptionstill, and real ofstas ais humanis discussion of gesture and texture we have already almost unconsciously not possible. In the termination group, plane is the most ambiguous. A plane listener it is also possible to use computer-based sound and structure analysis attributed these morphologically-based functions. In the directed motion of is felt to be a zone of arrival or a lacuna, a relatively stable state interpreted as gesture the continuity of the three temporal phases was in evidence, or rather, as a basisa goa ofl of classification what has comeand before. visualization. The remaining “Islands terms hav ofe th Music”eir coun byterp Pampalkarts an emphasis on action directed away from the onset, and action directed et al.in is th ae o projectnset grou top, e organizexcept for th sample-basede inexpressive wo soundrd closu snippetsre. on a kind of towards a termination or second onset. Texture, on the other hand, is based We may attribute functions at a variety of structural levels depending on on the continuant model. the aural focus operative at the time, and the stage reached in the unveiling of Figure 9 shows how the three temporal phases may be expanded to denote the total structure. To begin with, our interpretation of structural function structural functions at higher levels. The functions are collected into three will operate on a local scale, but as the music progresses more is revealed so groups, each group covering a range of terms with common attributes. No that we inevitably interpret functions in the context of more regional doubt the terms can be expanded to incorporate different shades of meaning circumstances. When the work is complete we have the opportunity to appropriate to their own interpretation of functional circumstances. reassess its structure in the light of our experience of the whole. The CHAPTER 2. BACKGROUND 27 geographic map by use of similarity measures among those sounds. “The main challenge is to calculate an estimation for the perceived similarity of two pieces of music. [...] The raw audio signals are preprocessed in order to obtain a time-invariant representation of the perceived characteristics following psychoacoustic models [...] To cluster and organize the pieces on a 2-dimensional map display we use the Self-Organizing Map, a prominent unsupervised neural network. This results in a map where similar pieces of music are grouped together” [PRM02].

1 bfmcbfmc!!instereoinstereoinstereo aroundtheworldaroundtheworld bfmcbfmc!!rockingrockingrocking bfmcbfmc!!flygirlsflygirlsflygirls believebelieve bfmcbfmc!!skylimitskylimit bfmcbfmc!!stirupstirup!! eiffel65eiffel65!!blueblue bfmcbfmc!!uprockinguprocking thebassthebassthebass togetheragaintogetheragaintogetheragain letsgetloudletsgetloudletsgetloud latinoloverlatinoloverlatinolover 2 wonderland

bongobongbongobong congaconga themangotreethemangotreethemangotree

5 bfmcbfmc!!12341234 rhcprhcprhcp!!easilyeasily kisskiss rhcprhcprhcp!!othersideotherside bfmcbfmc!!rockrockrock rhcprhcprhcp!!dirtdirt saymynamesaymyname 3 4 rhcprhcprhcp!!getontopgetontop sexbombsexbomb

6 rhcprhcprhcp!!californicationcalifornication rhcprhcprhcp!!emitremmusemitremmus seeyouwhenseeyouwhen bfmcbfmc!!freestylerfreestylerfreestyler rhcprhcprhcp!!scartissuescartissue singalongsongsingalongsong slsl!!summertimesummertime rhcprhcprhcp!!universeuniverse rhcprhcprhcp!!velvetvelvet Figure 2.17:Figure The visualization8: The visualization of a musicof the collectionmusic collection of 359 pieces of mu- sic. The numbersconsisting labelof regions359 pieces whereof m distinctusic trained musicalon characteristicsa SOM have with 14 10 map units. The rectangular boxes mark Figure 10: Close-up of Cluster 1 and 2 depicting grouped together. [PRM02]× areas into which the subsequent figures zoom into. 3 4 map units. The islands labeled with numbers from 1 to 6 are × The importantdiscussed pointin ismore thatdetail the organizationin the text. on the two-dimensional map Bomfunk MCs (bfmc) are located here but also songs with (see figure 2.17) is accomplished solely by defining an appropriate distancemore moderate beats such as Blue by Eiffel 65 (eiffel65- measure between two sounds. This measure can be based on ablue) varietyor Let’s get loud by Jennifer Lopez (letsgetloud). All but three songs of Bomfunk MCs in the collection are lo- of audio features like characteristics of rhythm (as in [PRM02]) orcated timbreon the west side of this island. One exception is the (as in [PHH04]) as long as the principles of human perception (frequencypiece Freestyler (center-bottom Figure 10) which has been the group’s biggest hit so far. Freestyler differs from the characteristics, critical bands, masking etc. [FZ06]) are considered.other Thepieces by Bomfunk MCs as it is softer with more mod- actual arrangement by training the neural network is then performederate quasi-beats and more emphasis on the melody. Other songs which can be found towards the east of the island are Around automatically by iterating through an algorithm [Koh88, KSH01]. the World by ATC (aroundtheworld), and Together again by Janet Jackson (togetheragain) which both can be cat- One limitation is that these concepts only work for temporallyegorized pre- as a Electronic/Dance. Around the island other segmented sound snippets. The ability to identify “sound objects”songs withinare located which have stronger beats, for example towards the south-west, Bongo Bong by Mano Chao (bon- complex musicalFigure structures9: Simplified is a taskweather that ischarts. still largelyWhite reservedindi- togob humanong) and Under the mango tree by Tim Tim (theman- listeners althoughcates thisareas iswith subjecthigh ofv ongoingalues while researchdark (forgra ay surveyindi- of relevantgotree), both with male vocals, an exotic flair and similar cates low values. The charts represent from left instruments. techniques seeto e.g.right, [Bro06,top to WB06]).bottom the maximum fluctuation strength, bass, non-aggressiveness, and domination In the Figure 10 Cluster 2 is depicted in the south-east. This of low frequencies. island is dominated by pieces of the rock band Red Hot Chili Peppers (rhcp). All but few of the band’s songs which are in the collection are located on this island. To the west of the possible to obtain a first impression of the styles of music island a piece is located which, at first does not appear to be which can be found in specific areas. For example, music similar, namely Summertime by Sublime (sl-summertime). with strong bass can be found in the west, and in particular This song is a crossover of styles such as Rock and Reg- in the north-west. The bass is strongly correlated with the gae but has a similar beat pattern as Freestyler. However, maximum fluctuation strength, i.e. pieces with very strong Summertime would make a good transition in a play-list beats can also be found in the north-west, while pieces with- starting with Electro/House and moving towards the style out strong beats nor bass are located in the south-east, to- of Red Hot Chili Peppers which resembles a crossover of dif- gether with non-aggressive pieces. Furthermore, the south- ferent styles such as Funk and Punk Rock, e.g. In Stereo, east is the main location of pieces where the lower frequen- Freestyler, Summertime, Californication. Not illustrated in cies are dominant. However, the north-west corner of the the close-up but also interesting is that just to the south of map also represents music where the low frequencies dom- Summertime another song of Sublime can be found namely inate. As we will see later, this is due to the strong bass What I got. contained in the pieces. A close-up of Cluster 3 is depicted in the south-west of Fig- A close-up of Cluster 1 in Figure 8 is depicted in the north ure 11. This cluster is dominated by aggressive music such of the map in Figure 10. This island represents music with as the songs of the band Limp Bizkit (limp) which can be very strong beats, in particular several songs of the group categorized as Rap-Rock. Other similar pieces are Freak on

576 CHAPTER 2. BACKGROUND 28

2.2 Sound synthesis

Following Piccialli [PPR91] and Bilbao [Bil08] there are two distinct ap- proaches on which digital sound synthesis can be based: nonparametric or abstract methods which generally ignore traditional (resp. natural) sound production mechanisms, and parametric methods that are predicated on certain (e.g. mechanical) models of sound production. Since for the latter a-priori information in form of an underlying model (e.g. of the vocal tract, a resonant string or instrument body) is already given, the actual number of sound-defining coefficients is usually much lower than for nonparametric models. On the other hand it is precisely the absence of a given nature-mimicking model that can be interesting for compositional work, allowing for completely hypothetical sounds and structures. In any case, for complex sound, there is always a trade-off between the complexity of the synthesis model or the complexity of the parameter set used to drive it. With a fixed model like a traditional music instrument only a few notes can be sufficient to render the sound. With a generic synthesis algorithm much more information is needed for the definition of a specific sound: “[...] there is a continuing problem of finding the proper control data to drive the synthesis engine. One of the challenges of synthesis is how to imagine and convey to the machine the parameters of the sounds we want to produce” [Roa96]. From the broad spectrum of synthesis methods (covered in textbooks like [Roa96, DJ97, Bou00, Coo02, Bil08]) I will only shortly present a selection of those that are generic in the sense that instrument-like, natural and all kinds of abstract sounds should be possible with high quality.

2.2.1 Sample-based synthesis

The term sampling in digital music refers to the act of taking measurements (i.e. samples) of an analog signal at regular intervals – this is what an analog- to-digital (AD) converter does, as discussed in section 2.1. The manipulation of recorded sounds goes back to recording, developed in the 1930s [Cha97] which allowed cutting and splicing the tape and therefore editing, rearranging and repeating sequences of recorded sounds. The advent of digital electronics has facilitated this handling of concrete sounds as far as today they are no longer bound to a sequential physical carrier but rather CHAPTER 2. BACKGROUND 29 kept in various flavors of magnetic or phase-change-based random-access memory storage. The techniques of tape manipulation like looping, pitch- shifting, mixing have been emulated in the digital domain and been made available within convenient multi-track audio editing software environments. An overview of the historic development from the first - and tape- centered experiments over commercial keyboard-based analog and digital samplers to the current software-based virtual instrument plug-ins can be found in [Roa96]. Despite the omnipresence of sampling both as a technique of digital audio and as a stylistic term (see [Rod03]) today there is still scope for development when it comes to experimental usage. Granular synthesis as developed as a computer-based method by Curtis Roads [Roa02] combines the conceptually important (and sampling-inherent) notion of source-bonding [Sma97] in con- nection with the flexibility of stochastic or algorithmic collage or spraying techniques. Granular synthesis is based on the Gabor grain (from section 2.1.4) or variations of it as the elementary sound carrier: “A single grain serves as a building block for sound objects. By combining thousands of grains over time, we can create animated sonic atmospheres. The grain is an apt representation of musical sound because it captures two perceptual dimen- sions: time-domain information(starting time, duration, envelope shape) and frequency-domain information (the pitch of the waveform within the grain and the spectrum of the grain)” [Roa02]. Clearly, as stated by Roads “granular synthesis requires a massive amount of control data”, since there are several parameters that can be specified on a grain-by-grain basis like the duration, frequency (by pitch-shifting), spatial location and some more. This is why generator mechanisms are used to calculate the various param- eters according to predefined rules and/or stochastic distributions as shown in figure 2.18. The traditional strategies of tape-based editing can also be formulated by use of algorithms and applied within granular synthesis, even in real- time on streams of grains. Figure 2.19 shows some variations on temporal restructuring (time granulation). For the application of granular synthesis within the framework of a live performance instrument and detailed theoretical and historical coverage of granular techniques see Opie [Opi03]. CHAPTER 2. BACKGROUND 30

Figure 2.18: Granular synthesis clouds in time-frequency representation: (a) Cumulus, (b) Glissandi, (c) Stratus. Each small circle stands for an elementary grain. [Roa96]

Figure 2.19: Three approaches to time granulation from one or more grain sources. [Roa96]

2.2.2 Spectral modeling synthesis

The analysis part of spectral modeling synthesis (SMS) has already been discussed in section 2.1.3. In fact the analysis and synthesis of deterministic (sinusoidal) and residual (stochastic) components within this framework is tightly interconnected as for the calculation of the residual part an interme- diate synthesis of the sinusoidal components is required. In any case, the synthesis part alone (as shown in figure 2.20) can be used to either resynthe- size previously analyzed components of audio or synthesize artificially (e.g. algorithmically) constructed descriptions of tonal components and residual envelopes. Not strictly belonging to the actual audio synthesis process a number of transformations can be performed on the frequency and magnitude coef- ficients of both the sinusoidal and residual components. Due to the easily manageable representation of audio by these components very powerful and yet straightforward transformations are possible like pitch-shifting, dynamic processing, articulation modeling etc. (see e.g. [ABLS01, SB98]) before ac- tually changing back to the time-domain although “the perceptual link be- tween the deterministic and stochastic parts is delicate; editing the two parts CHAPTER 2. BACKGROUND 31

Figure 2.20: Block diagram of the synthesis part of SMS [SS90] separately may lead to a loss of perceived fusion between them” [Roa96]. The additive synthesis of the sinusoidal components can be done by using an oscillator bank which basically consists of a number of parallel samplers that synthesize volume- and frequency-varying sinusoidals from a single pe- riod of a pre-sampled sine wave. The residual synthesis is performed by an overlap-add stage where spectral frames are first individually transformed to the time domain by an inverse Fourier transformation (IFFT), overlapped to match the correct temporal position of each frame and summed up (eventu- ally by multiplication with a normalizing reconstruction window) as shown in figure 2.21.

Figure 2.21: Overlap-add resynthesis. The gray areas indicate overlapping spectrum frames. [Roa96] CHAPTER 2. BACKGROUND 32

2.2.3 Physical modeling

As the naming already suggests physical modeling uses physical descriptions of music instruments as the basis for sound-synthesizing algorithms. Citing Bilbao [Bil08] “for most musical instruments, this will be a coupled set of partial differential equations, describing, e.g., the displacement of a string, or membrane, bar, plate, or the motion of the air in a tube, etc. The idea, then, is to solve the set of equations, invariably through a numerical ap- proximation, to yield an output waveform, subject to some input excitation (such as glottal vibration, bow or blowing pressure, etc.). [...] The control parameters, for a physical modeling sound synthesis algorithm are typically few in number, and physically and intuitively meaningful, such as material properties, instrument geometry, and input forces and pressures”. Since simulating the physical behavior often of a more complex body involves hundreds or thousands of arithmetic operations at audio sample rate, physical modeling is computationally costly. Its strengths – compared to abstract synthesis methods as discussed above – lie in the fact that the instrument geometry, the form or intensity of excitation and many more parameters can be easily adapted, very much like an instrument builder or a player of an instrument would. It is beyond the scope of this thesis to cover this broad field – a review of the various techniques, most notably lumped mass-spring networks, dig- ital waveguides and modal synthesis is given in the books and articles of Florens [FC91], Smith [III96], De Poli [PR98] and Bilbao [Bil08]. more ambitious. Second, even if we considered variable, Output Efficiency = increasingly complex actions, how could we evaluate their cost? Input Since we are considering real-time scheduled activities, these can hardly be measured in time (i.e. the time it takes to Macleod el al. [12] studied efficiency in a Human- accomplish the task), as the common practice is. Computer-Interaction context (HCI), affirming that in a system in which a human is interacting with a computer, the In a musical context, Levitin [10] describes the learning effectiveness with which the user and the computer working curve as the amount of time it takes to a novice to gain enough together successfully complete a task, is a measure of the useful skill with an instrument so the experience of playing it becomes CHAPTER 2. BACKGROUND work33 performed, or the output. The nature of the output being rewarding. Essential as it is, this concept does obviously not in this case quite different from the input, no adimensional ratio define a curve but a (very important) point of this curve, we is possible. According to the authors, this does not imply could call the ‘rewarding point’. Wanderley and Orio [24] when 2.3 Playing an (electronic) music instrument however that we are not able to express the efficiency of task studying the usability of different controllers, define learnability performance numerically; it simply means that efficiency as the time needed to learn how to control a performance with a The comment of a user “I don’t feel like I’m playing a digital instrumentmeasures will be now expressed in some units. After which they given controller. A time, they suggest, longer than ten years for quantify the amount of effort input from a user's viewpoint, as so muchmost as operating traditional it” instruments [MM07] makes [9]. us This want statement to ask which defines qualities we the mental/physical effort required to complete it: expect fromtherefore a decent another music important instrument point of (asthe curve opposed that towe acouldtoy) call in general – the ‘mastering point’. Effectiveness and what would be the implications for makers of digital instruments then? HumanEfficiency = Effort Jord`acallsWe on researchersintuitively grasp and the developers learning curve to concept; we know it can tell us a lot of information about the relation between a Our context is not that different. We will start with a quite player... design and an and instrument, build instruments on how thatthe relation can appeal starts to and profes- similar and simple quotient (with quite fuzzy terms, we must evolves, but we do not know how to clearly define this concept, sional musicians as well as to complete novices; efficient instru- admit!) that will allow us to roughly estimate the efficiency of a much less how to evaluate it. Could we, in order to compare music instrument according to the following ratio: mentsdifferent that instruments, like many traditionalcompare their ones learning can offer curves? a low We entry can fee MusicalOutputComplexity withcompare no ceiling their onlearning virtuosity curves; systems shapes and in which tendencies, basic but principles we MusicInstrumentEfficiency = ControlInputComplexity ofcannot operation evaluate are their easy asymptotes to deduce, or while,their inflexion at the samepoints time,since so- we do not know the ordinates absolute values, not even what The ‘Musical output complexity’ of a music instrument phisticatedthey represent. expressions A serious are possible attempt and at mastery defining is the progressively music depends on all its available sound aspects and variability attainableinstruments [Jor03]. learning curve concept falls outside the pretension properties; it should include for example, both the microsonic of this article, but that will not dismiss us from trying to manage richness of the violin and the mid-sonic richness of the piano, Thus,the a concept music more instrument intuitively. system should be self-explaining to some ex- plus all the potential macrostructural complexities that could be tent while atThe the pia sameno has timea steeper be revealing (i.e. better!) all learning of its possibilites curve that the only bymanaged the by any kind of intelligent instrument. This term can be time asviolin. familiarity The with kalimba the instrument’s has an even concept steeper increases. learning Therefore curve, learn-related to the ‘musical range’ defined by Blaine and Fels [3] although its asymptote may remain lower. The learning curve of ing (or practice) plays an important role in the relation between playerand to the ‘expressive range’ described by Settel and Lippe the kazoo is almost a straight horizontal line, very low when [19]. instrument.compared Jord`a[Jor04a] to the other elaboratesthree asymptotes on the (since concept the abscissa of the oflearning its curve The ‘Control input complexity’, depends on factors like the ‘mastering point’ is very close to the time origin, and the in order to describe this relationship, how it starts and evolves. Figure 2.22degrees of freedom of control, the correlation between these ordinate is also close to cero given the reduced capabilities of compares the approximate learning curves of four different acoustic instru-controls and the convenience of the applied mapping. It could the instrument). All these approximate criteria are suggested in also be linked to Orio [13] ‘explorability’, which is defined as ments. Figure 1. the number of different gestures and gestural nuances that can be applied and recognized, and which relates to controller features such as precision and range and also to the mapping strategy. Under different names and with slightly diverse approaches, both concepts are clearly being studied. In particular, a thorough understanding of the control input complexity would lead us to areas associated with engineering psychology, closely related to the work of Rubine [18] or Wanderley [24] and other Figure 1. Approximate learning curve for the (a) kazoo, (b) researchers, and completely out of the scope of this exposition. Figure 2.22: Approximate learning curve for the kazoo, kalimba, piano and kalimba, (c) piano and (d) violin, within a period of 10 years. The point is that the idea of ‘music instrument efficiency’ -a violin, within a period of 10 years. [Jor04a] term that to our knowledge, has never been used outside of the

acoustic domain- as vague as it still remains, could constitute, We can5. seeEFFICIENCY that the Kazoo and OF the A Kalimba MUSIC both have very steep learningwhen plotted vs. studying or practicing time, a better-than- curves inINSTRUMENT the beginning which then saturate quickly within only the firstnothing one first approximation for describing the learning curve. As a performer learns to play an instrument, the attainable or two years.We There ended is up not section much 4.2 more claiming to be that learned the kalimba about thosecould instruments.be considered a more efficient instrument than the piano, without output complexity increases (while staying below the On thesaying other handa word there’s about what the pianowe should and consider the violin. by efficiency Obviously of the a learningtheoretically limited maximum complexity the instrument is music instrument. In engineering, the term ‘efficiency’ is capable of outputting). Meantime, control complexity can also commonly understood as the ratio of useful energy output to augment, which explains why after a certain period of time, energy input. Timoshenko and Young [22] define for example efficiency does not necessarily follow a clear ascending curve. the efficiency of a machine, as the ratio of the useful work It happens in fact, that after a long period of studying (e.g. 10 performed to the total energy expended. years), the input cost needed for maintaining a high-valued output becomes so elevated that many professionals-to-be decide to abandon their instrument. This new approach to CHAPTER 2. BACKGROUND 34 curve of the piano is steeper (that is, more efficient) than that of the violin but players of both instruments don’t run out of new challenges and rewards for many years. Based on findings of other authors from the fields of engineering and human-computer interaction Jord`aestimates the efficiency of a music in- strument according to the ratio

MusicalOutputComplexity × P erformerF reedom InstrumentEfficiency = ControlInputComplexity (2.3) with

P erformerF reedom = P erformerF reedomOfMovement×P erformerF reedomOfChoice (2.4) The meaning of the various variables is intuitively clear. The Mu- sicalOutputComplexity describes the richness of sounds and structures on microscopic and macroscopic scales, while the ControlInputComplexity de- pends on the degrees of freedom of control and their correlation as well as how these controls influence the musical behavior (i.e. the mapping of parameters). The InstrumentEfficiency is also positively influenced by the PerformerFreedom which includes all the even and odd ways to handle an instrument, in terms of possibilities and expressiveness (PerformerFreedo- mOfMovements) as well as whether the instrumentalist can choose which kind (or style) of music she or he wants to produce (PerformerFreedomOf- Choice):

“A good instrument should not be able to produce only good music! A good instrument should also be able to produce ‘ter- ribly bad’ music, either at the player’s will or at the player’s misuse.” (Jord`a[Jor04a])

Other important aspects mentioned by Jord`ainclude variability, repro- ducibility, linearity and confidence. Without going into detail variability (also flexibility or diversity) and reproducibility are recognized as canonical quality features of traditional music instruments which seem to be key for virtuosity. For new digital instruments the importance of reproducibility is less obvious as, following [Jor04b] citing Chadabe, “the essence of in- telligent instruments is precisely their ability to surprise and provoke the CHAPTER 2. BACKGROUND 35 performer”. Practically all acoustic instruments also incorporate a certain amount of non-linearity or randomness (e.g. noise) which inhibits absolute predictability. Clearly, such chaotic behavior

...should not inhibit the performer from being able to pre- dict the outputs related with small control changes, which seems necessary for the development of a finely tuned skill and an ex- pressive control. [...] Non-linearity should not mean uncontrol. In a good performance, the performer needs to know and trust the instrument, and be able to push it to the extremes, to bring it back and forth of these non-linearity zones. (Jord`a[Jor04b])

It can be argued that exactly the approach or even excess of the limits of control is responsible for the tension and power in many a performance or style of music.

2.3.1 Man-machine interaction

When talking about electronic music one need not necessarily associate the music instrument with an actual physical instrument body as is the case with traditional acoustic instruments. With the virtualization of digital sound processing by use of software algorithms rather than electronic hard- ware the disembodiment has lead to a situation where one can practically independently choose the parts of the instrument machine that are perform- ing the sound processing (analysis, transformation and synthesis) and those which provide the physical interfaces to the human player which all can be quite heterogeneous and be linked together with means of wired or wireless communication. On the other hand it can still be advantageous to comply with the assortment of ready-made consumer hardware (especially laptop ) because of lower cost and immediate availability. What seems essential is the principle of “Designing interaction, not inter- faces” [BL04] in the first place. The three interaction paradigms discussed by Beaudouin-Lafon [BL04] are: computer-as-tool paradigm: extends human capabilities through a (very sophisticated) tool. Direct manipulation (interaction through metaphoric objects with specific actions attached to them [Shn83]) and WIMP (i.e. ‘Window Icon Menu Pointer’, the classical computer desktop) inter- faces fall into this category. CHAPTER 2. BACKGROUND 36

computer-as-partner paradigm: embodies anthropomorphic means of communication in the computer, such as natural language or human gestures. Agent-based interaction and speech-based interfaces fall into this category. Proceedings of the 2007 Conference on New Interfaces for Musical Expression (NIME07), New York, NY, USA computer-as-medium paradigm: uses the computer as a medium by Real-Time Featurewhich-Ba humanssed Sy communicatenthes withis eachfor other. Live MusicObviously,al Per thefor applicationmance of the computer as a musical instrument is covered predominantly by the category of the computer-as-partner paradigm Matt Hoffman which is in turn connected to thePe researchrry R. C onooartficialk intelligence. When re- Dept. of Computer Science, Princeton University Depts. of Computer Science, Music, 35 Olden St. lying on the actual computer bodyPrince aston an U interfaceniversity (as advocated in [FWC07]) Princeton, NJ, USA 08540 the computer-as-tool paradigm is35 of O equallden S importancet. and “requires under- mdhoffma at cs.princeton.edu standing of both computerPr algorithmsinceton, NJ, and USA human 08540 perception-and-action in order to harness the powerprc of a thet cs. user+computerprinceton.edu ensemble” [BL04]. In this context the fundamental question seems to be “how best to map ABSTRACT from performerto a sm intentionall number to of sonic param output”eters, or a [HK00].small num Whenber of co therentrolle isr user inter- A crucial set of decisions in digital musical instrument actiondesign withpa (inram contrasteters are tomap algorithmicped onto a controllarger nu of)mbe ther of audiosynthe synthesissis this deals with choosing mappings between parameters controlled by parameters. In the former case, the full expressive range of the the performer and the synthesis algorithms that actually g“problemenerate oftencontro reducesller-synthe tosiz hower pai bestring m toay mapnot be from explo someited. In set the oflat continuouster or case, a satisfactory one-to-many mapping must be determined. sound. Feature-based synthesis offers a way to parameterizdiscretee audio control signals to a set of synthesis parameters that provide control synthesis in terms of the quantifiable perceptual characteristics, or This can be done by hand by the instrument designer, using what features, the performer wishes the sound to take on. Techovernique thes signalis like processingly to be a ti algorithms”me-consuming [HC07].(although Inpot theential casely ins ofigh simplet- audio provoking) process of trial and error. Alternatively, the mapping for accomplishing such mappings and enabling featuresynthesis-based algorithms it might be possible to establish a direct connection synthesis to be performed in real time are discussed. An example can be determined automatically by the computer, according to is given of how a real-time performance system might be defromsigned an interfacesome se elementt of criteri (sucha. Whe asn d aes knobigning ora s fader)ystem th toat ape synthesisrforms an parameter automatic mapping of this kind, the crucial question becomes how to take advantage of feature-based synthesis’s ability to p(suchrovide as the frequency or volume) as shown in figure 2.23. However, for perceptually meaningful control over a large number of synthesis to define the criteria that the computer uses to design the mapping. parameters.

Keywords Feature, Synthesis, Analysis, Mapping, Real-time.

1. INTRODUCTION How best to map from performer intention to sonic output is a fundamental question in the design of musical instruments intended to be played in real time [4]. In those cases where an electronic interface is used to control a software , the problem often reduces to how best to map from some set of continuous or discrete control signals to a set of synthesis parameters that provide control over the signal processing algorithms that generate the synthesizer's audio. In some cases, Figure 1. A performer can control a few synthesizer these synthesis parameters have a clear, one-dimensional,Figure easily 2.23: A performer can control a few synthesis parameters directly, learned effect on the perceptual content of the sound. An example parameters directly, continuously making adjustments to might be the frequency of an oscillator, which (often)continuously maps making adjustmentsmake perceptio ton ma maketch inte perceptionntion. match intention. clearly to the percept of pitch. Sophisticated synthesis algo[HC07]rithms, however, may have parameters whose effects may be individually quite subtle, unpredictable, or dependent on the settings of other parameters. Furthermore, there may be a large number of these parameters - more than can be explicitly controlled in real-time by a human being [7]. As a result, real-time control signals are frequently mapped only

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Nime’07, Jun 7-9, 2007, New York, NY, USA. Figure 2. Alternatively, a performer can control many Copyright remains with the author(s). parameters using only a few control signals by means of an intervening mapping layer.

309 Proceedings of the 2007 Conference on New Interfaces for Musical Expression (NIME07), New York, NY, USA

Real-Time Feature-Based Synthesis for Live Musical Performance Matt Hoffman Perry R. Cook Dept. of Computer Science, Princeton University Depts. of Computer Science, Music, 35 Olden St. Princeton University Princeton, NJ, USA 08540 35 Olden St. mdhoffma at cs.princeton.edu Princeton, NJ, USA 08540 prc at cs.princeton.edu

ABSTRACT to a small number of parameters, or a small number of controller A crucial set of decisions in digital musical instrument design parameters are mapped onto a larger number of synthesis deals with choosing mappings between parameters controlled by parameters. In the former case, the full expressive range of the the performer and the synthesis algorithms that actually generate controller-synthesizer pairing may not be exploited. In the latter sound. Feature-based synthesis offers a way to parameterize audio case, a satisfactory one-to-many mapping must be determined. synthesis in terms of the quantifiable perceptual characteristics, or This can be done by hand by the instrument designer, using what features, the performer wishes the sound to take on. Techniques is likely to be a time-consuming (although potentially insight- for accomplishing such mappings and enabling feature-based provoking) process of trial and error. Alternatively, the mapping synthesis to be performed in real time are discussed. An example can be determined automatically by the computer, according to is given of how a real-time performance system might be designed some set of criteria. When designing a system that performs an to take advantage of feature-based synthesis’s ability to provide automatic mapping of this kind, the crucial question becomes how perceptually meaningful control over a large number of synthesis to define the criteria that the computer uses to design the mapping. parameters.

Keywords Feature, Synthesis, Analysis, Mapping, Real-time.

1. INTRODUCTION How best to map from performer intention to sonic outpCHAPTERut is a 2. BACKGROUND 37 fundamental question in the design of musical instruments intended to be played in real time [4]. In those cases where an electronic interface is used to control a software synthesizmoreer, the complex synthesis systems like the ones discussed in section 2.2 there problem often reduces to how best to map from some set of continuous or discrete control signals to a set of syarenthes inis most cases too many parameters to explicitly control in real-time. In parameters that provide control over the signal procthisessin moreg realistic situation (figure 2.24) an intervening mapping layer is algorithms that generate the synthesizer's audio. In some cases, these synthesis parameters have a clear, one-dimensional,used easily to translateFig theure human1. A perf controlormer ca effortsn contro intol a fe aw meaningfulsynthesizer parameter set parameters directly, continuously making adjustments to learned effect on the perceptual content of the sound. An exforamp thele employed synthesis method. This mapping aspect is examined in might be the frequency of an oscillator, which (often) maps make perception match intention. clearly to the percept of pitch. Sophisticated synthesis algorithms, however, may have parameters whose effects may be individually quite subtle, unpredictable, or dependent on the settings of other parameters. Furthermore, there may be a large number of these parameters - more than can be explicitly controlled in real-time by a human being [7]. As a result, real-time control signals are frequently mapped only

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Figure 2. Alternatively, a performer can control many Nime’07, Jun 7-9, 2007, New York, NY, USA. Figure 2.24: Alternatively, a performer can control many parameters using Copyright remains with the author(s). parameters using only a few control signals by means of an only a few control signals byinte meansrvening of m anapp interveninging layer. mapping layer. [HC07]

more detail in section 2.3.3. 309 Roads states that “in the early years of computer music, the user inter- face was an afterthought – the last part of a system to be designed. Today ’ the importance of good user interface design is well recognized” [Roa96]. He describes the user interface, particularly the musician’s interface as a “combination of hardware and software that determines how information is presented to and gathered from the user” [Roa96]. Following Roads, infor- mation can be conveyed to the musician through several senses:

Sight: via displays on graphics terminals, display panels, meters, and indi- cator lights, as well as switch and fader settings

Touch: via settings of knobs, switches, potentiometers, and the ‘play’ and ‘feel’ of the feedback from responsive input devices

Hearing: via sounds and silences generated by the system

It is pretty clear that the given listing of touch-sensitive input devices is influenced by the selection of 1996. Today, much more interesting possibili- ties of gesture control are available, as discussed in section 2.3.2. In addition CHAPTER 2. BACKGROUND 38 to figures 2.23 and 2.24 Roads adds the augmentation by visual senses, both as a feedback of the instrument’s state as well as a control instance for the input devices. Piringer [Pir01] presents a classification of existing electronic music in- struments described by five parameters:

Immersion: Describes to which extent a player is connected to the instru- ment – by her or his gestures as controlling instances and feedback given by the music system.

Genericness: The level of flexibility of musical instrument – can it (like some traditional acoustic instruments or a simple oscillator) only pro- duce one sound character, or can it (like an analog synthesizer or a cd player) produce a wide range of musical output?

Intelligence/Mapping: Does the employed input-to-output mapping use a direct connection or is there some more complex translation applied?

Feedback: Which individual senses (auditory, visual, tactile) or combina- tions are incorporated into feedback from the musical system to the player?

Expressivity: Describes to which extent an instrument can be controlled which is linked to the breadth of musical expression resp. the richness of detail.

From the multitude of interactive music systems analyzed by Piringer in 2001 only a vanishing percentage has a high expressivity or involves feedback comprising all three senses according to the listed criteria.

2.3.2 Sensoric input

By the time of this writing there exists an overwhelming variety of sensoric input devices (see e.g. [Pen85, Roa96, Pir01, Jor05a]) and new ones are presented in short intervals, e.g. at every NIME2 conference. At the same time there is a lot of informal do-it-yourself activity going on, employing hacked electronics or cheap sensor boards like the Arduino3. See [VSM01, BA05, Col96] for a survey of various lab developments.

2New Interfaces for Musical Expression http://www.nime.org, last visited 11.10.2008 3Arduino http://www.arduino.cc, last visited 18.10.2008 CHAPTER 2. BACKGROUND 39

In the referenced literature there are already many attempts of classifying the various approaches to gesture capturing which can be represented as multidimensional spaces [Pir01, BFMW05] such as shown in figure 2.1. All kinds of physical parameters and technologies are used for detect- ing and quantifying human gestures with the common limitation that, as Levin states “any digitized representation of human movement is necessarily lossy, as it reduces both the fidelity and dimensionality of the represented gesture” [Lev00]. While mentioned fidelity is related to expressitvity and should just be as high as possible (as it is always easily requantizable or scalable in terms of range or curvilinearity) the choice of proper dimensions to pick up is an important design choice which is related to the very general architecture of the instrument and to the level of immersion to be aimed at. In addition to the classification criteria already mentioned we can intro- duce just another one, namely the separability of output parameters. There is a fundamental difference between the output of a classical MIDI control box with knobs and switches (figure 2.25) and the video camera used in Rokeby’s famous ‘Very Nervous System’ [Rok86] (figure 2.26) which is that while output data from the former can be directly used to control synthesis parameters (like volume or frequency) this can’t be done in the case of the latter where a video camera captures a dancer’s movements on pixel-per- pixel basis. There, subsequent processing and intelligent gesture mapping (see section 2.3.3) is indispensible for meaningful results. While these ex-

Figure 2.25: Typical MIDI controller ‘Doepfer Pocket Control’ [doe08] amples are on the extremes of the scale, there are many other sensor devices covering the whole range. CHAPTER 2. BACKGROUND 40

Figure 2.26: Source image and motion as seen by the ‘Very Nervous Sys- tem’ [Rok86]

Many computer peripheral or gaming devices like joysticks, mice, gamepads, graphics tablets4, multi-axes controllers5 can have some dimensions of con- trol that are grouped together (like x-y movement of a mouse), while others (like the mouse wheel) can be used independently of the other ones. For interfaces like the Overholt MATRIX [Ove01] (figure 2.27) inde- pendence of dimensions is useful only to some limited extent. While the rods of the 3D surface can be moved independently from each other it is clear that the more interesting applications (and the strength of the whole sensor system) lies in gesture patterns of groups of those rods that are moved jointly. The same is the case for data glove interfaces like Waisvisz’s Hands [DEA05], Sonami’s Lady’s glove6 (figure 2.28) and also for Nakra’s conductor’s jacket [Nak01]. These sensor systems (containing a number of individual sensors) are built to capture extensive gestures rather than individual adjustments of a single control parameter. Multi-touch interfaces like the FingerWorks IGesture pad7, the various Tactex multi-touch interfaces8 (e.g. used for control of tabla synthesis in [JS07]), the Apple IPhone or IPod touch9 or Han’s projector- and camera-based screens [Han05, DH06, WAFW07] fall into this category. Tangible acoustic interfaces [CP05] represent yet another

4e.g. WACOM graphics tablets http://www.wacom.com, last visited 15.10.2008 5e.g. 3DConnexion SpaceNavigator http://www.3dconnexion.com, last visited 15.10.2008 6Laetitia Sonami, lady’s glove http://www.sonami.net/lady_glove2.htm, last visited 15.10.2008 7Fingerworks http://www.fingerworks.com, last visited 15.10.2008 8Tactex http://www.tactex.com, last visited 15.10.2008 9Apple http://www.apple.com, last visited 15.10.2008 The MATRIX: A Novel Controller for Musical Expression

Dan Overholt MIT Media Laboratory 20 Ames St., Cambridge, MA 02139 USA [email protected] 2. BACKGROUND 41

ABSTRACT The MATRIX (Multipurpose Array of Tactile Rods for Interactive eXpression) is a new musical interface for amateurs and professionals alike. It gives users a 3- dimensional tangible interface to control music using their hands, and can be used in conjunction with a traditional musical instrument and a , or as a stand-alone gestural input device. The surface of the MATRIX acts as a real-time interface that can manipulate the parameters of a synthesis engine or effect algorithm in response to a performer's expressive gestures. One example is to have the rods of the MATRIX control the individual grains of a granular synthesizer, thereby "sonically sculpting" the microstructure of a sound. In this way, the MATRIX provides an intuitive method of manipulating sound with a very high level of real-time control. Figure 2.27: User interacting with the MATRIX [Ove01] Keywords Musical controller, tangible interface, real-time expression, Figure 1 User interacting with the MATRIX interface. audio synthesis, effects algorithms, signal processing, 3-D interface, sculptable surface In this paper I give a conceptual and technical overview of the MATRIX, including the hardware design and the INTRODUCTION software implementation for a variety of musical mappings. The use of computers today as part of the musical I begin with a discussion of the MATRIX at the system experience has allowed inventors and designers to create level, covering the design criteria, design approach, new types of controllers for musical expression. The goal physical interface, and musical applications from a practical of this work is to develop the MATRIX, an instrument that point of view. Finally I discuss user feedback and future gives musicians and aspiring musicians a novel interactive directions for the project. performance and composition environment. While it is impossible to expect everyone to become proficient at RELATED WORK composing music or playing an instrument, it is the The MATRIX results from work in the Hyperinstruments author's hope that interfaces such as the MATRIX will and Interactive Cinema groups at the MIT Media engage people in new and active relationships with music. Laboratory [9]. Many interactive performance-driven The MATRIX is a tactile interface that controls a multi- instruments came out of the work of Tod Machover, who coined the term ‘Hyperinstrument’ to describe a number of modal musical instrument. It functions as a versatile Figure 2.28: Laetitia Sonami’s lady’s glove. controller that can be adapted to new tasks and to the the interfaces that were developed for projects such as the preferences of individual performers in a variety of different Brain Opera [5]. Also from the Hyperinstruments group, environments. The interface provides musicians with a Gil Weinberg and Seum-Lim Gan have created a series of sculptable 3-D surface to control sound and music through musical instruments called squeezables [4,14], which a bed of push rods (see figure 1). Based on a user's real- capture the gestures of squeezing and pulling foam, gel and plastic balls, and Teresa Marrin built a conductor’s jacket time input, musical mapping algorithms control different that maps gestures to musical intentions [6]. Finally, in the types of signal processing algorithms. Responsive Environments group, Dr. Joe Paradiso has created a number of musical interfaces, some of which include the Expressive Footwear project [11], and Musical Trinkets [10] that use a swept-frequency tag reading interface. CHAPTER 2. BACKGROUND 42 very interesting category of input devices. By using contact microphones and advanced mathematical methods of tracking human touch on the surfaces of almost any kind of object many new possibilities open up. There are also interfaces that consist of spatially separated units which can be sensitive to interaction with each other, like Weinberg and Gan’s pressure-sensitive SQUeeZABLEs [Gan98, Wei99] (figure 2.29) or Schiette- catte’s intercommunicating audiocubes [SV08].

Figure 2.29: The foam ball cluster as the SQUeeZABLE [Wei99]

Since for these more complex sensoric systems gesture mapping and con- trolled sensoric feedback becomes an indispensible part of the whole it is in most cases impossible to separate the sensoric hardware from the processing software as both are parts of a larger holistic concept – this can also be clearly seen with the example of the reacTable [JGAK07] (see figure 2.36 below) and the following statement of its main developer:

If we believe that new musical instruments can be partially responsible for shaping some of the music of the future, a parallel design between controllers and generators is needed with both design processes treated as a whole. Possibilities are so vast, exciting and possibly too idiosyncratic, that it would be foolish to imprison ourselves with controllers-generators and mapping toolkits. (Jord`a[Jor05b]) CHAPTER 2. BACKGROUND 43

2.3.3 Gesture mapping

The term mapping stands for the process of translating sensoric data from a player’s gestures to parameter values as understood by the actual synthesis mechanism employed within an electronic instrument. Ideally, by this pro- cess, the intention (resp. the gesture energy) of the player is translated into a well corresponding musical experience.

“The mapping ‘layer’ has never needed to be addressed directly before, as it has been inherently present in acoustic instruments courtesy of natural physical phenomena. Now that we have the ability to design instruments with separable controllers and sound sources, we need to explicitly design the connection be- tween the two. This is turning out to be a non-trivial task.” [HWP02]

In the meantime a huge amount of literature has assembled dealing with all conceivable aspects of mapping – a detailed survey over the existing litera- ture on mapping is presented in e.g. [CW00, Wan01]. The overall importance of mapping as an instrument-defining component has been researched and underlined by Hunt et al. [HWP02]: “[...] the psychological and emotional response elicited from the performer is determined to a great degree by the mapping”. On the whole, two different approaches are under discussion:

• Implementation of a mapping strategy within a single layer, basically translating N sensoric parameters to M synthesis parameters [BPMB90]. “In this case, a change of either the gestural controller or synthesis al- gorithm would mean the definition of a different mapping” [Wan02]

• Implementation of mapping in the form of two (or more) indepen- dent layers: “A mapping of control variables to intermediate param- eters and a mapping of intermediate parameters to synthesis vari- ables” [WSR98] as seen in figure 2.30.

As discussed in the previous section mapping strategies are often strongly connected to the architecture of an instrument and fine-tuned to fulfill spe- cific needs within that setting. Various types of fixed mappings are discussed in [WWS02] with a set of ‘useful metaphors’ that can be used for musical interfaces, such as scrubbing, drag and drop, catch and throw or dipping. values control the development of sound features Composed Instruments according to continuous controller input, events In ESCHER the notion of a generic "composed trigger transition processes and abrupt changes. instrument" is used to describe an instrument where Events are often accompanied as well by abrupt the gestural controller is independent from the sound changes of the continuous parameters and can thus synthesis model, both related by interCHAPTERmediate 2. BACKGROUNDhave an additional parameterization. 44 mapping strategies. "Composed instruments" typically use two layers of parameter mapping on top of a more-or-less arbitrary synthesis engine to match data glove various controller devices played by the performer to the sound synthesis result.

An abstract gestural model is used, which differentiates between continuous features of sound development and transitions between them. This C.P. Control parameters model derives from acoustic instruments, where i continuous modulations of the sound by the player (pitch variations, dynamics, etc.) alternate with abrupt changes (note attacks, bow changes, etc.), A.P. Abstract parameters which cause transients. I The modular concept of composed instruments proves useful in choosing the interaction level desired by the user. Depending on the user's technical skills or specific musical aims, one can arrange the mapping in order to allow different degrees of control complexity. - Being strongly dependent on the controller one uses (number of available output parameters, nature of the available parameters - continuous/discrete, range, etc.), the mapping can be configured in either a micro mode, where the performer has access to each Figure 1: ESCHER system overview individual parameter in detail, or in a macro mode, Figure 2.30: ESCHER system overview [WSR98] where some parameters may be kept constant to default values, allowing the user to concentrate on In the context of a simple musical instrument model, higher-level sound features. Examples of compoInsed order to makecontinu thingsous parameters more easily could adjustable be pitch, todyn changingamics, instrument instruments have been implemented in ESCHER using setups or gestureinh repertoryarmonicity, severaldensity, approachesformant spec toifica adaptivetions, etc. mappings have both additive and granular synthesis. The beginning and end of a note would be discrete been researchedev andents, implemented,each having individual e.g. usingattribut aes neuralin order network to interface Control between a data-glovedifferent andiate various a speech transit synthesizerion types such [FH93], as legato foror more generic A controller is connected to the ESCHER system staccato. through an initial mapping layer. This laygestureer can recognition [CCH04] or in order to relate timbre to natural language consist of simple one-to-one relationships, Parameter Spaces and Road-Maps instrument-like mappings, where one codescriptionsnsiders [JG06].Timbre The evol quasi-automatedution in ESCHER i generations accomplish ofed mappingby achieved parameter dependencies specific to an insbytrum adaptivelyent optimizingnavigating synthesisthrough different parameters space tos aby set means of perception-oriented of model, or any other metaphor for simplifying the continuous movements and transitions. The second manipulation of simultaneous parameters, suchaudio as the features ismapping presented laye byr thus Hoffman consists etof al.a li [HC07],near inter whilepolatio Mandelisn applies one presented in [6]. In general, this mapping evolutionary techniquesof parameter in [Man02].sets determining the synthesized sound involves the transformation and conjunction of the for continuous movements. A set of D continuous raw controller data via functions (typically by taAble geometric parameters approach forms to mapping a D-dimen bysion useal sp ofaceGaussian, where the mixture models lookups), conditions and linear filters. (GMM) has beensyn researchedthesis parameter by Momenisets are associated et al. [MW03].to the nodes Figure of 2.31 shows a grid each representing a particular quasi stationary This first mapping represents an adapter betweenthe interface a andsy schematicsnthesized sound. of this Whe softwaren the vector implementation. of values of the The employed particular controller's outputs and a set of abstract continuous abstract parameters point to a position in parameters defined by the user during the corepresentationmposed canthe beD used-dimen tosio accessnal parameter and transform space, high-dimensionalthe actual param- instrument definition. These abstract parameterseter c spacesan in ansyn intuitivethesized manner.sound isBencina determined [Ben05] via presentslinear another geo- be understood as the interface to the composed instrument synthesis module. While conmetrictinuous approach by using natural neighbor interpolation based on Voronoi tessellation10, to smoothly interpolate between various mapping schemes, as shown in figure 2.32. Additional articles on geometric considerations are by Van Nort et al. [NWD04, NW07] exploring the interrelation between

10Wolfram research, Voronoi diagram http://mathworld.wolfram.com/ VoronoiDiagram.html, last visited 11.10.2008

1081 Proceedings of the 2005 International Conference on New Interfaces for Musical Expression (NIME05), Vancouver, BC, Canada bilinear interpolation between four of nine analysis-derived models representing an expressive timbre subspace [12]. Of significant interest here is the work of Momeni and Wessel whose approach “gives emphasis to software tools that provide for the simple and intuitive geometric organisa- tion of sound material” [9]. They describe a generalised sys- tem supporting the spatial arrangement of objects (sounds and numerical parameter values) each of which influence a global field using a Gaussian kernel. They also present a range of applications developed with this system. Watson offers a wide ranging introduction to classical sur- face interpolation approaches including discussions of the characteristics of various methods [19]. Goudeseune reviews the (in)appropriateness of many classical interpolators for musical applications citing properties such as discontinuities and overshoot [7]. Spain and Polfreman discuss the limita- tions of gravitational field models for interpolation and pro- pose a parameter interpolation model based on physical light mixing [15]. Van Nort et al. introduce the application of Regularised Spline with Tension (RST) interpolation for 2- to-many and 3-to-many mappings. They also note some rel- evant mathematical formalisms and use these to to compare RST against several other interpolated mapping schemes, CHAPTER 2. BACKGROUND 45 including some of those cited above [11]. Figure 1: The overSYTE preset interpolator circa 2.2 The overSYTE Preset Interpolator 1995. OverSYTE (figure 1) is a real-time performance granula- tor developed by the author in 1995 to support improvised processing of instrumental and vocal performers. The user the surface using drag-and-drop and their locations can be interface includes a number of on screen sliders for control- altered using direct manipulation. Separate edit and inter- ling various parameters of the granulation process. Support polate modes dictate whether dragging the mouse performs for storing and recalling named presets is provided. Of rele- interpolation or moves points on the surface. vance to the present discussion is a feature called the preset Each preset has an associated colour to provide a visual interpolator: a rectangular region of the screen with drop- cue about the sonic properties of different areas of the sur- down lists at each corner allowing a different preset (or the face. Colours are smoothly interpolated across the surface current parameter values) to be associated with each cor- to provide feedback about the surface’s curvature. Initially ner. Clicking and dragging the mouse cursor within the the colours are random, however they may be altered by region articulates an interpolation between the four presets the user to expressspecific colour/sound associations. Al- using bilinear interpolation. A checklist allows selection of though the coloured representation may sometimes have low a subset of parameters to participate in the interpolation.Figure 1. InterpolationperceptualFigure Spaces. correlation 2.31: Left: TheThe with ‘ali.jToop’interface the of sonicthe interpolator ‘ali.jToop’ output, interpolator. [Mom05] it was deemedLeft: The black area with five colored areas is the interpolation space; the brightness of each color represents the degree of influence that point The preset interpolator is an important feature of over-on the interpolatedadequate mix. The given rectangles the associated difficulties with ofeach visualising colored area allow an arbitrary one to adjust N- the shape of each SYTE, facilitating interesting multiparametric modulationscolor area. Centedimensionalr: The colors parameter for the map on space. in the left image are calculated by five corresponding Gaussian controlled only by the mouse. An unexpected featurefunctions of of varying mean (position of the center in the x-y plane), height and variance. Center: The color of each Gaussian surface corresponds with the color of its 2-dimensional representation on the map in the left image this type of interpolation interface, also noted by Momenicorrespond. Right: a snapshot of the helpfile for the ‘ali.jToop’ interpolater. Note that there is a list of numbers and Wessel, is that novel parameter combinations are oftenassociated with each point in the space (the five numbered set of sliders in the top of the image). Navigating in the found at intermediate locations on the surface. In spiteinterpolation of space in the bottom right produces a interpolated mix (red sliders in the bottom) of the five given lists of the musical expressivity of the preset interpolator it’s limi-numbers. tations are clear: it allows working with at most four presets3.3 Mixed sensing (audio + gesture input) at a time, and the geometry is fixed to a rectangle, severely limiting the possibility of designing a musical space. It is A novel technique that appears underutilized in the literature as well as in the repertoire clear that an intuitive interpolation interface which over- comes these limitations is highly desirable. of new media artists working with controllers is what I call "mixed sensing paradigms." The

3. THE METASURFACE technique, used in both of the described projects in this dissertation, involves combining two The Metasurface (figure 2) has been implemented forsensing the schemes in order to build a richer controller. The technique has a number of AudioMulch real-time musical signal processing environment, the successor to overSYTE. The Metasurface offers the sameadvantages: first, the shortcomings of one sensing scheme can be countered with the strength of baseline functionality as the overSYTE preset interpolator, the main improvement being that instead of using fouranother, pre- while their strengths can be combined. The example used in both projects below, sets placed at the corners of a square, the Metasurface sup- ports placing an arbitrary number of presets anywhereinvolves on coupling the gesture input from Wacom drawing tablet (Wacom 2004) with the audio the surface. As a consequence, detailed surfaces with user input from a contact microphone placed on the tablet. The audio-rate/synchronous input from defined geometry may be specified. Points are inserted on FigureFigure 2: 2.32: The The Metasurface Metasurface and and AudioMulch AudioMulch. [Ben05] the contact microphone aids in countering the high latency and timing jitter weakness of the

Wacom USB controller. This instrument is described in detail in the section titled Project I.

9

102 CHAPTER 2. BACKGROUND 46 mathematical representations and the organization of sonic spaces. As also emphasized by Jorda [Jor05a] too deterministic or linear map- ping may not be satisfying in many performance contexts. Chadabe states that “In the functioning of a slightly indeterministic instrument, a relatively small amount of unpredictable information can simulate a performer’s tal- ented assistants, automatically supplying creative details while the macro- music remains completely under the performer’s control. Depending upon the amount of unpredictable detail and the way it is triggered, such an instrument may become a powerful performance enhancement for a profes- sional” [Cha02]. For this reason Menzies [Men02] and others advocated dy- namic control processing instead of linear, fully predictable mapping strate- gies.

2.3.4 Instrument feedback

When we grab a music instrument and try (or even know how) to make music with it, we expect it to make some sounds, more or less controlled, depending on our practice. Hearing the sounds that we produce we can alter our techniques and try around to achieve a more pleasant or inter- esting musical experience. We are thus immersed into the music system (see section 2.3.1), or in other words, part of a sensoric feedback loop with the instrument (see figure 2.33). It is obvious that this feedback channel

Figure 2.33: Musician as a feedback controller. [O’M01] is not only auditory in most cases, “but also visual, tactile or kinesthesic. Auditory feedback concerns the learning of musical quality; visual, tactile CHAPTER 2. BACKGROUND 47 and kinesthesic feedback concerns the integration of relationships between gesture and produced sound, which help to understand the behavior of the instrument as well as rules and limits induced by gestures. [...] Aural, haptic and visual feedback may reinforce each other in the case of an instrumental improviser” [Jor05a]. For electronic instruments this kinesthetic experience is not as naturally given as with acoustic instruments, which “resonate (i.e. they are ‘conscious’ of the sound they produce)” [Jor01] but rather has to be considered and designed purposefully. It is no coincidence that many of the early electronic instruments built on familiar interface concepts like the piano keyboard or drum heads and still designs from acoustic instruments are used and remodeled for electronic sound synthesis (e.g. [KEDC02, Ove05]) – the sensoric feedback channels are known and proven in these cases, while new feedback concepts require specific practice. A way to reintroduce common feedback patterns is to imitate instru- ment vibration as explored by Marshall et al. [MW06] and implemented in a generic music controller by Bowen [Bow05]. Another approach is to use metaphoric objects as familiar haptic interfaces, like in the case of the already mentioned SQUeeZABLES [Wei99], the Scrubber [EO05], the Peb- bleBox and the CrumbleBag (both [OE04], see figure 2.34). The latter two interface systems consist of physical objects that are recorded when being handled by the player – the conditioned audio is in turn used to control a synthesis process. Haptic feedback can also be modeled by using explicit force-feedback. Verplank [Ver05] lists several haptic events that can be experienced: A wall, bumps and valleys, pluck, friction, ring and spin. Simple force-feedback sys- tems are readily built into various types of consumer joysticks and gamepads, or can be self-engineered with modest means as in [VGM02]. More complex applications involve the use of multi-axis ball- or pen-shaped controllers like the PHANTOM haptic devices11. Considering visual feedback, Levin [Lev00] describes three main strategies for sound/image relationships on the computer:

Score displays, which are – derived from the traditional music notation in

11SensAble FreeForm modeling systems http://www.sensable.com/ products-freeform-systems.htm, last visited 15.10.2008 CHAPTER 2. BACKGROUND 48

Figure 2.34: The CrumbleBag with cereal, coral and styrofoam fill- ings. [OE04]

figure 2.4 – two-dimensional timeline diagrams relating the dimension of time, along one axis, to some other dimension of sound, such as pitch or amplitude, on the other.

Control-panel displays, providing a technical interface for controlling mu- sical parameters, often imitating the look of vintage analog synthesiz- ers. (figure 2.35)

Interactive widget displays, built on the metaphor of a group of virtual objects (widgets) which can be manipulated, stretched, collided, etc. by a performer in order to shape or compose music. Generally, it may seem that visual feedback is [...] not essential for playing traditional instruments, and the list of first rank visually impaired musicians and instrumental- ists (e.g. Ray Charles, Roland Kirk, Tete Montoliu, Joaqu´ın Rodrigo, Stevie Wonder, etc.) testifies so. However, no matter what these channels are, feedback redundancy is indeed an im- portant component in music performing. There seems to be no reason why digital instruments should not use at their advantage anything that could broaden the communication channel with its player. [Jor05a] In the case of Jorda’s own ambitious patchable music interface project – the reacTable (figure 2.36) – visual feedback is an essential component in order CHAPTER 2. BACKGROUND 49

Figure 2.35: The control panel of Propellerhead’s ReBirth synthesizer. to achieve “a more intuitive interface for music performance” [Jor05a]. A comparable approach is undertaken in Iwai’s Tenori-On12 [NI06], a hand- held device featuring an interactive touch-sensitive display and some ad- ditional interface components allowing the user to create evolving musical soundscapes. Even more pronounced synesthetic experience was explored by Levin in his variety of painterly interfaces for audiovisual performance [Lev00], like Yellotail, Loom, Warbo, Aurora, or Floo (figure 2.38). These systems per- mit the simultaneous performance of both abstract animation and synthetic sound, with the two media components being equally important and com- plementary. In his work Levin relates directly to visual music systems of the pre-computational era, like Oskar Fischinger’s Lumigraph, a color organ instrument that allowed one to play light, to abstract film, or to optical soundtrack techniques employed to use entirely visual means to synthesize sound. This strong emphasis on audio-visual synesthesia is also represented by screen-based musical interfaces such as the IXI software instruments

12Yamaha coorporation http://www.global.yamaha.com/tenori-on, last visited 11.10.2008 CHAPTER 2. BACKGROUND 50

Figure 2.36: The reacTable playing surface with visual feedback. [Jor05a]

Figure 2.37: The tenori-on playing surface with visual feedback. [Jor05a] CHAPTER 2. BACKGROUND 51

Figure 2.38: A screenshot of Levin’s Floo in use. [Lev00]

[Mag05, Mag06b, Mag06a, MM07] (see figure 2.39) where so-called actors are used as parts of generating machines and graphical representations of temporal processes.

2.3.5 Software development systems

The development of software-based music instruments can be accomplished by a large variety of programming languages, either general-purpose or specifically tuned for audio usage. The latter, so called audio programming languages have the advantage that interfaces to audio or MIDI hardware and to various services of the respective operating system (e.g. networking, timers etc.) are often readily integrated so that development can focus on the actual sound or music shaping processes. While groundbraking music systems like MUSIC-N [Mat69] or csound [Bou00] have been designed to process instrument and score definition files to pro- duce audio files, other systems like Kyma [SJ88] or Max/FTS [Puc91] used specialized hardware in order to perform digital signal processing in real- time. Max was the first system that featured a graphical editor for building dataflow graphs, (i.e. graphical representations of program logic) at first CHAPTER 2. BACKGROUND 52

Figure 2.39: The IXI software SpinDrum. Each wheel contains from 1 to 10 pedals. The wheels rotate in various speeds and when a pedal hits the 12 o’clock position it triggers a sample. [Mag06b] for MIDI messages, later also for DSP processing13. In the last decade the variety of purely software-based real-time systems has broadened consid- erably, ranging from more consumer-oriented systems like Reaktor14, Au- diomulch 15 or Plogue Bidule16 to powerful programming environments like Supercollider [McC96, sup08] or ChucK [Wan08]. While the original Max system has been commercialized and extended by signal and video process- ing as well as scripting capabilites [Zic98, Cyc08] its original inventor Miller Puckette has reformulated and enhanced his ideas into the free software package Pure Data [Puc96, Puc08]. These two systems are similar in many architectural aspects – Max/MSP representing the more streamlined appli- cation, Pure Data the more experimental one, and both being supported by very active online communities. In contrast to Max/MSP Pure Data can be deployed with a variety

13IRCAM, A brief history of MAX http://freesoftware.ircam.fr/article.php3?id_ article=5, last visited 10.10.2008 14Native Instruments, http://www.native-instruments.com/index.php?id= reaktor5, last visited 11.10.2008 15Ross Bencina, Audiomulch http://www.audiomulch.com, last visited 11.10.2008 16Plogue, Bidule http://www.plogue.com/index.php?option=content&task= view&id=21&Itemid=35, last visited 11.10.2008 CHAPTER 2. BACKGROUND 53 of operating systems, not only or Apple MacOS, but also on many flavors of Unix and on PDA handheld devices like smart- phones. Pure Data allows re-use of functionality by the use of abstractions and it comes with an API that allows the development of plug-ins (so- called external objects, or short externals) for non-standard functionality or highly optimizable customization. Libraries of such externals readily exist for graphical interfaces and video processing (e.g. GEM OpenGL17-based graphics [Dan97, Dan08]), interface objects to general-purpose scripting lan- guages like Python [van97, pyt08] or lua [lua08] and many other purposes. The Pure Data core and many (if not most) of these extensions have been compiled into the Pd-extended distribution18. Due to many improvements19 in the last years Pure Data has become a stable and commonly-used platform for the prototyping of audio-related real-time systems, like interactive instruments.

17OpenGL http://www.opengl.org, last visited 11.10.2008 18Hans Christoph Steiner, Pd-extended http://puredata.info/downloads, last visited 11.10.2008 19Pure Data mailing list http://puredata.info/community/lists, last visited 10.11.2008 Chapter 3

Instrument design

After laying out all the basics in the previous chapter, we can now come back to our anticipated instrument design (figure 1.3) and try to detail the outlined instrument specifications from section 1.3 using the available options from chapter 2.

3.1 Basic considerations

Looking at the sketch of the anticipated instrument architecture (figure 1.3) it becomes clear that for meaningful translation of a musician’s gestures to sound two individual translation steps are needed, equivalent to a two-layer mapping using an intermediate representation as discussed in section 2.3.3 – a first step from the gesture to a repertory of musical structures, where there should also be a possibility for transformation of musical properties, and another one from this modified gesture representation to actual synthesis parameters to be passed on to the synthesis method of choice. This second step is the more straightforward one – depending on the actual synthesis method there usually exists a more or less intuitive way to turn specific musical structures into actual sound, as this is what audio synthesis is supposed to accomplish. This is the case for at least the methods we discussed in section 2.2 (especially the sample-based ones) but might be more difficult for FM synthesis and others – we are leaving these latter methods aside since they have mostly been technologically superseded. The first step of associating gestures and music structures is a bit more difficult. We can try to establish a connection by looking at the basics of

54 CHAPTER 3. INSTRUMENT DESIGN 55

Smalley’s spectromorphology (section 2.1.5): According to [Sma97], “a hu- man agent produces spectromorphologies via the motion of gesture, using the sense of touch or an implement to apply energy to a sounding body.A gesture is therefore an energy-motion trajectory which excites the sound- ing body, creating spectromorphological life”, therefore building a chain of activity which connects cause to source. We can interpret this reasoning in the way that characteristics of (at least) the flow of energy need to be preserved in order to keep up a causal connection between the gesture and the sound output. What the notion of energy within a gesture or a sound is exactly, is certainly debatable – however, there is reason to believe that the transformation of gestural actions (or features) into sonic actions is sensible as long as this energy trace remains intact. Consequently, a large part of our attention will be devoted to the anal- ysis and the transformation of gestural activities. As we want to get visual feedback on our activities, and would like to have our gesture and sound repertories laid out in a flat (two-dimensional) and therefore easily visu- alizable manner, another important part is how gestural trajectories can be represented in two dimensions. In the following sections the instrument architecture (as shown in figure 3.1) will be worked out in detail. Within the focus of this thesis we will confine the exploration to interac- tion, sound-production and basic feedback properties. We will leave out all aspects of actual acoustic deployment in performance spaces (like spatial- ization) and therefore be content with a monophonic instrument with only one source of synthesized sound.

3.2 Sensoric input

The first question is how the player’s gestures and movements should be captured. The prerequisite that outboard equipment should not be strictly necessary makes many of the technologically more complex input solutions from section 2.3.2 impossible. The desire for high fidelity and expressiveness is not easily achievable with many of the available input devices (especially MIDI-based ones), but it is no problem for audio-based input like with Tan- gible acoustic interfaces. For these, basically any surface equipped with contact microphones can be used to capture human touch or interaction CHAPTER 3. INSTRUMENT DESIGN 56

 acoustic/sensoric gestures

feature extraction

real-time trajectory within a two-dimensional map representation of the gesture repertory

translation/transformation from gesture to synthesis repertory

real-time trajectory within a two-dimensional map representation of the synthesis repertory

sound synthesis

sound output

Figure 3.1: Flow of gestural patterns through the envisioned instrument architecture. CHAPTER 3. INSTRUMENT DESIGN 57 with scratching devices, bows etc.. An important criteria is that the inter- action does not produce too much noise itself which would interfere with the actually intended instrument sound. In the project ‘The Sound of Touch’ [MR07, MRA08] by David Merrill et al. a very elegant means of audio input is used. It is not the surface that is recorded, but rather a wand (actually a painter’s knife, see figure 3.2) that can be “be brushed, tapped, scratched, or otherwise manipulated against physical objects. [...] The wand’s flat shape allows it to be used as a probe to touch, strike, or scrape objects [of] the world, or it can be held against another object which is itself manipulated to pick up textures” [MR07].

Figure 3.2: The Painting Knife tool is scraped against shag carpet. [MRA08]

Another quite obvious way of player’s interaction with the musical sys- tem (in the sense of [FWC07]) is to use the laptop body in combination with the built-in microphone itself as a an interaction device. The richness of sounds that can thus be produced depends very much on to what extent one allows the laptop casing to be stressed with scratching, knocking and rubbing. Of course there are many more conceivable possibilities for sound input, like variations on the CrumbleBag or PebbleBox interfaces already shown in section 2.3.2 or capturing interaction with bowed structures as with Hans Reichel’s Daxophone1.

1Hans Reichel, Daxophone http://www.daxo.de, last visited 14.10.2008 CHAPTER 3. INSTRUMENT DESIGN 58

3.3 Input analysis

Whatever technique is used to capture human gestures by use of an , the resulting data is highly complex and by no means easily separa- ble into a set of orthogonal parameters, like many of the common human interface devices shown in section 2.3.2 would be. In order to get hold of useful audio content, that is, perceptually meaningful structures, methods of audio analysis need to be used to decompose the complex signal into audio features and generate a more comprehensive representation. As discussed in section 2.1.6 general high-level structure recognition within audio content according to human perception is still a very active and changing field of research, whereas various methods of timbre recognition are already very much established. For the limited scope of this thesis and the research focus on the general feasibility of low-dimensional representation it seems to be an acceptable compromise to largely leave out transformation possibilities for temporal aspects of musical structures. Microscopic aspects like roughness should be recognized, though. Having in mind an interactive real-time instrument as a counterpart to a traditional acoustic instrument it seems legitimate to leave macroscopic temporal structures to the explicit control of the musician by use of appropriate input techniques e.g. with selected textures or movement techniques using a painting knife device as shown in the previous section. Pampalk et al. [PHH04] present a method for timbre analysis which takes into account psychoacoustic considerations. Following this article we first model the frequency characteristics of the outer and middle ear as proposed by Terhardt [Ter79] and as shown in figure 3.3:

−0.8 2 −3 4 AdB(fkHz) = −3.64f + 6.5 exp(−0.6(f − 3.3) ) − 10 f (3.1)

This filter model reduces very high and low frequencies whereas frequencies between 3 and 4 kHz are emphasized. According to Zwicker and Fastl [FZ06] the hearing range (about 20 Hz to 20 kHz) is divided into critical bands which are equidistant on the bark scale. The conversion from frequency in Hz to bark is given by

2 Zbark(fkHz) = 13 arctan(0.76f) + 3.5 arctan(f/7.5) (3.2) CHAPTER 3. INSTRUMENT DESIGN 59

Figure 3.3: Terhardt’s outer and middle-ear model. The dotted lines mark the center frequencies of the 24 critical bands. [PHH04]

From this, the 24 center frequencies of the critical bands as shown in figure 3.3 can be calculated by numerically finding the roots for Z = 0.5, 1.5, 2.5 etc.. The psycho-acoustic effect that the perception within one frequency band can be masked by neighboring frequencies is calculated according to Schroeder et al. [SAH79] (for medium input levels, as used by Pampalk et al.).

2 1/2 BdB(∆zbark) = 15.81 + 7.5(∆z + 0.474) − 17.5(1 + (∆z + 0.474) ) (3.3) where ∆z denotes the frequency difference on the bark scale. The resulting curves are asymmetric due to the fact that lower neighboring frequencies have a stronger masking effect than higher ones. Finally, the loudness in sone can be calculated according to Bladon and Lindblom [BL81]:  2(l−40)/10, for l ≥ 40dB Ssone(ldB-SPL) = (3.4) (l/40)2.642, otherwise

In order to use this loudness measure we have to define a maximum level on the dB-SPL scale which would correspond to the maximum level for normalized audio for our loudspeaker output. Since we are talking about an instrument for live performance we can assume the maximum level to be at about 90 dB-SPL. From the calculations above we get a vector of 24 values on the bark frequency scale in sone units that describes the timbre of a sound at a certain CHAPTER 3. INSTRUMENT DESIGN 60 time. Typically the calculation is done on time windows which are defined by the STFT analysis window size (see section 2.1.2) with the resulting bins of the magnitude spectrum summed up to the respective critical bands. Alternatively (for improved time resolution and/or computing efficiency) this can also be done using a variation on the Constant-Q transform with the center frequencies and filter bandwidths corresponding to the critical bands. In order to reduce the feature vector size (and therefore the amount of overall data) it is convenient to do an estimation for the spectral shape by calculating the cosine transform of this 24-vector. This is known as the mel-frequency cepstral coefficients (MFCCs)2. By taking only the first few coefficients (typically 5-10, but leaving out the very first as this represents a static component equivalent to the overall loudness) of the cosine transform one gets a measure of the coarse shape of the spectrum while omitting the finer details. An introduction into the application of MFCCs for music mod- eling is given by Logan [Log00], while a couple of differing implementations are compared in an article by Zheng et al. [ZZS01]. As MFCC values are not very robust in the presence of additive noise some researchers propose modifications to the basic MFCC algorithm to account for this (see [TW05, Cho06]) – e.g. by raising the log-mel-amplitudes to a suitable power (around 2 or 3) before taking the DCT, which reduces the influence of low-energy components. The omitted 0th MFCC component which is equivalent to the overall loudness of an analysis frame can be used to normalize the timbre repre- sentation. Thus, volume and spectral shape are treated separately – this volume measure can later be used to scale the resulting sound output.

3.4 Gesture repertory

Our gesture repertory should be a representation of all possible (or feasi- ble) gestures that one can produce with the given interface device – in our case (but not limited to) the extracted feature sets of audio input from the painter’s knife. We would like to have all possible feature sets spread out over a two-dimensional map (figure 3.1) with the additional condition that

2The mel scale is another psycho-acoustically motivated frequency scale which is closely related to the bark scale (1 bark = 100 mel). CHAPTER 3. INSTRUMENT DESIGN 61 feature sets that are similar (according to some appropriate measure) to other sets should be close to each other, while sets which are very different from each other, should be farther apart on the map. Since we are dealing with continuous multi-dimensional data it is more than likely that for a specific feature of the gesture input there is no exactly corresponding repertory element: In this case a nearest-neighbor should be found, if feasible even with some interpolation among a few such nearest- neighbor candidates. Thus, we envision that smooth variations in the audio input (resp. its extracted features) should produce smooth temporal trajectories on the map whereas harsh contrasts in the audio input should result in abrupt jumps to other regions of the map. This attribute is known as topology preservation. There are many methods for dimensionality reduction, that is, algorithms to project the multi-dimensional feature sets down to a lower dimensionality – in our case onto a two-dimensional map. Fodor [Fod02] gives a broad survey on various techniques. For the sake of generality, we do not assume that our data is a linear superposition of independent components, hence we go for the non-linear methods. If we impose the additional condition of topology preservation we have the choice of two approaches:

• Kohonen’s self-organizing maps as used by Pampalk et al. [PHH04]

• Density networks which are based on Gaussian mixture models and the expectation maximation algorithm.

Because of the positive experiences of Pampalk et al. with self-organizing maps (SOMs) and because of the accessibility of this method due to various existing implementations we take on this approach for our representation of the gesture repertory. Kohonen’s self-organizing maps (see [Koh88, KSH01]) represent a type of artificial neural networks providing a mapping of high-dimensional input data to a discretized low-dimensional form, in our case a two-dimensional lattice. In order to build such a mapping, a process of so-called unsupervised learning must be carried out. This means that an iterative algorithm is run which tries to organize the data itself, without any examples of the desired output form.

Following Fodor [Fod02] we must first define two distance measures, dL for the distance on our two-dimensional map, which is Euclidean and dD CHAPTER 3. INSTRUMENT DESIGN 62 for the distance in the multi-dimensional space of the audio feature set – Pampalk et al. [PHH04] state that Euclidean distance is also appropriate for the timbre vectors. On the basis of the 2D-distance measure we define a neighborhood function hij on our lattice of nodes which describes the in- fluence of one node i on another node j. A Gaussian function is a common choice, as in d2 (i, j) h = exp − L (3.5) ij 2σ2 where σ defines the range of the neighborhood. ∗ For any given audio feature set tn we can now retrieve the position i of m the node on the lattice {µi}i=1 which is closest to the feature set according to our distance measure dD:

∗ i = arg max dD(µj, tn) (3.6) j∈lattice

The corresponding node µi∗ is commonly called best matching unit (BMU). When building the SOM we can perform a training step to ‘inform’ the ∗ network about an audio feature tn by moving the neighborhood of i a distance ρi = αhi∗i in the direction of this feature vector:

new old old old µi = µi + ρi(tn − µi ) = (1 − ρi)µi + ρitn (3.7)

Training is performed by starting with randomly seeded nodes, itera- tively presenting feature set vectors to the network and repeating the steps (3.6) and (3.7) while gradually decreasing the range of the neighborhood σ and the learning rate α by the time. Thus, certain regions within the 2D lattice are developing which represent respective similarities within the training data. As soon as the SOM has been trained, for any input set the BMU can be looked up again by use of equation (3.6). Note that this retrieved BMU vector does usually not exactly reflect a feature set that has been previously presented to the network, but depending on the size of the 2D-lattice, the variance within the training sets and the training sequence, it is a more or less fitting interpolation within similar sets of training data. CHAPTER 3. INSTRUMENT DESIGN 63

3.5 Sound repertory

As depicted in figure 3.1 our instrument system consists of two two-dimensional maps: one for the gesture repertory and one for the sound repertory. Then, what is this sound repertory composed of? In order to produce sound we need some synthesis method (see section 2.2) which uses algorithms to cal- culate digital sound from a stream of synthesis parameters. As discussed, these are usually a larger number of coefficients passed to the synthesis en- gine at regular, usually short time intervals in order to define the actual sound output in real-time. That means, in order to realize sound output from a real-time trajectory as in figure 3.1 we need to pick up synthesis pa- rameters from the 2D-map along our trajectory as illustrated in figure 3.4.

Figure 3.4: Sounds (that is synthesis parameters) are evenly scattered on our map of the sound repertory. The gesture trajectory (from figure 3.1) picks up these parameters in real-time – they are in turn passed to the synthesis engine and transformed to digital audio.

The first question that arises when looking at this sketch is: How can we meaningfully and evenly group our synthesis parameters so that the whole idea of 2D transformation makes sense and the actual sound output is not a mess? The latter demand implies that neighboring sounds of the map must have some similarity, so that a continuous gesture trajectory produces a somewhat continuous (or smooth sound), while a jumpy trajectory need not. There is a simple and already established solution in case that for the CHAPTER 3. INSTRUMENT DESIGN 64 actual synthesis method used, the synthesis parameters reproducibly corre- spond to a specific and static sound. If so, we can take the sounds (produced from synthesis parameters) that we would like to be part of the repertory, ex- tract meaningful audio features and train a SOM, just like in the case of the gesture map in section 3.4. When the map has been fully trained, we browse through all our sounds a second time, again calculate the audio features, and find a BMU position in the SOM for each of our sounds – with this position in the 2D-map we associate the synthesis parameters used to produce the respective sound. This way (through the specific properties of the SOM) we obtain a situation as in figure 3.4: the available 2D-space becomes evenly populated with synthesis parameters which are grouped together according to similarities in the corresponding sounds. This solution works for granular synthesis where the synthesis parame- ters are mainly a position within a sound file as well as amplitude and pitch, and for spectral modeling synthesis where the synthesis parameters consist of coefficients defining a static timbre made from sinusoidal and residual components, as discussed in sections and . For physical modeling synthesis where parameters can e.g. correspond to the action of plucking a string which then gradually dies (which is cer- tainly non-static sound), other similarity measures would have to be found in order to use SOMs for mapping the parameters into 2D space. It would for instance be conceivable to introduce a distance measure which groups together all transient sounds and all static sounds and also classifies other characteristics in suitable ways. The second issue with the desired functionality in figure 3.4 is the amount of data points which will be placed on such a map. If we want to represent sounds on the map that make up a total duration of one minute, we would have to populate the 2D-space with about 60000 points (for an analysis resolution of 1 millisecond). While playing we will have to look up nearest neighbors along our trajectory in real-time, again with a desired time reso- lution of about 1 millisecond. This means that a very efficient way of finding nearest neighboring points has to be employed – an exhaustive search would otherwise let explode computing time. A well-established way (above all in fields of computer graphics and ge- ography) to organize data efficiently in 2D-space is using so-called quad tree data structures, thus named first by Finkel and Bentley [FB74]. See figure CHAPTER 3. INSTRUMENT DESIGN 65

Figure 3.5: Region-based quad tree: Data points are inserted subsequently – as soon as a region contains more than a predefined number of data points it is divided up into four sub-regions [Sam84] CHAPTER 3. INSTRUMENT DESIGN 66

3.5 how data is stored within such a data structure. As long as the data points are evenly spread (as they are in our case) searching is an operation of order O(log n) which shrinks down computing time by several magnitudes.

3.6 Gesture transformation

When determining a best matching unit within our representation of the gesture repertory for a given audio input feature set we basically obtain coordinates into the 2D lattice of the SOM. As we are doing this continuously in real-time, we are tracking the audio equivalent of the player’s gestures according to our intention sketched in figure 3.1. In order to keep the user interface clean and tidy the transformation strategy in 2D space between the maps of gesture and sound repertory should be simple but effective. A general 2D coordinate transformation can be carried out by translation, scaling and rotation as shown in figure 3.6.

Figure 3.6: Two-dimensional transformation, by subsequent translation, scaling and rotation

In sonic reality the situation is not as simple. There are usually many zones of attraction within a SOM of audio features, corresponding to very different timbral characteristics. With only a single 2D transformation all these zones would be transformed uniformly – for a sensible sound trans- formation it is clear that we need to be able to specify multiple coexistent transformations so that the overall gesture to sound transformation is more purposeful. This can be done by layering multiple transformations as shown in figure 3.7. By defining these zones as two-dimensional Gaussian prob- ability distributions it is not necessary to cover the whole map with such transformation specifications. Instead, we can use the weight factors re- sulting from the affiliation with the individual Gaussians either as mixing CHAPTER 3. INSTRUMENT DESIGN 67 factors for the subsequent sound synthesis step or simply take the largest contribution and assign it to the corresponding transformation zone.

A A B

B C C

Figure 3.7: Three coexistent zones of transformation, translating from the gesture to the sound repertory. The BMU in the gesture repertory is close to two of the regions (a bit closer to A than to B) and is therefore trans- formed twice, weighted by factors according to the affiliation with the two Gaussians. The third zone C is far away, therefore the probability and the related weighting factor are both very low.

3.7 Sound synthesis

At regular short intervals (typically about 1 ms) we now get position in- formation for gesture trajectories within our map for the sound repertory. Dependent on the strategy we choose for handling the weighting factors of the coexisting transformation zones we can have one or more points in 2D- space at each instance of time, where we can look up respective synthesis parameters as discussed in section 3.5. As also said above, these synthesis parameters depend on the actual synthesis method we are using, hence we need to discuss possible options for the instrument presented in this thesis. Within this scope we’ll only take into account sample-based methods (as introduced in section 2.2) which rely on an existing collection of sounds. CHAPTER 3. INSTRUMENT DESIGN 68

3.7.1 Granular synthesis

Roads [Roa96, Roa02] distinguishes between synchronous granular synthe- sis, where the period between individual grains is more or less fixed, and asynchronous granular synthesis, where grains are scattered irregularly into clouds as if distributed with a spray jet. As in the case of our instru- ment a fixed rate (of magnitude 1 ms, determined by the hop size of the analysis window) is used for the feature extraction from audio input and in turn for the transformation for the individual points on the gesture trajec- tory, we end up with a fixed rate for synthesis control data, corresponding to synchronous granular synthesis. In principle there is just one single synthesis parameter to be given at each cycle which is a respective position into the collection of sounds – the sound excerpt at one such position is used to build one sonic grain – and over time a synchronous stream of grains is adding up to a continuous sound as shown in figure 3.8. Grain envelope and need not necessarily

1 2 3

4 1 4 2 3

gesture trajectory sound collection

1 2 3 4

synthesized sound output

Figure 3.8: Synthesis parameters are picked up along the gesture trajectory. This positional data is used to look up the sound collection, grains are formed and these in turn added up to the final sound output. CHAPTER 3. INSTRUMENT DESIGN 69 be modulated with the gesture movement, but can rather be fixed or set as global instrument presets. As said, we can have more than one point at each time because of the ges- ture transformation process discussed above. Each of these points is looked up in the sound collection in parallel, mixed together using the respective weight factors calculated in the transformation. It is debatable whether synthesis should be carried out using normalized sound fragments (in which case the resulting volume is dictated only by the intensity of the gestural action) or if the loudness and musical dynamics of the underlying material is an important quality that should be retained. In the actual instrument implementation either choice is selectable.

3.7.2 Spectral modeling synthesis

In the case of granular synthesis there is no way to process the inner details of the resulting sound as the sonic quanta ripped out from original sound are just assembled but not touched up in any way. For example, if the original sound consists of clear tones and is reassembled in time in the course of the synthesis, the original phase information is destroyed, and the resulting sound will unavoidably sound blurry. This can be improved by using the decomposed representation of sinu- soidal guides and residual noise within the systematics of the spectral mod- eling synthesis (SMS). As discussed in sections 2.1.3 and 3.5 this is a frame- based representation of the original sound which can also be resynthesized using the described process. Since these synthesis parameters correspond to the original sound (with some artefacts due to the limitations of the SMS) we can train a SOM with our desired sound repertory and then use the quad tree as above to store the coefficients for the sinusoidals and the residuum of a sound frame at the position of its BMU. Thus, we obtain a sound repertory of all the coefficients necessary to resynthesize the original sounds. In contrast to granular synthesis this decomposed representation also allows to process the sound while it is reassembled frame-by-frame along the gesture trajectory. We can apply various algorithms, e.g. calculate the correct phase information for perpetual sinusoidals (and therefore avoid the roughness experienced with granular synthesis), use a low-pass filter algorithm so that the residuum changes smoothly over time or reconstruct the pattern of spectral peaks by reassembling the spectral guides (basically CHAPTER 3. INSTRUMENT DESIGN 70 just as illustrated in figure 2.8).

3.8 Instrument feedback

The proposed instrument can feature haptic feedback if the input device (like e.g. the painter’s knife) incorporates it; but since sound generation is detached from the actual input device, it is important to give the player a clue what is actually happening with his/her input and what he/she can ex- pect as an output (e.g. by following developments) without being too reliant on the audio – simply to “broaden the communication channel between the instrument and its player” [Jor03]. Consequently it is only straightforward to visualize the gesture trajec- tories within the repertories of input and sound in real-time. A feasible representation of the 2D-space in which these trajectories are contained can be achieved by color-coding the audio features as stored in the SOM maps so that similarly sounding regions can be visually identified. Additional visual superposition of the transformation zones helps to intuitively understand the process of gesture transformation. Chapter 4

Implementation and assessment

One of the most important prerequisites of the instrument system under examination is the fact that it must be playable in real-time with utmost immediacy. Within the domain of music technology the latter translates to the buzzword low latency which means that the time passing between gesture input and sound output is short enough so that the delay does not noticeably influence musical interaction or exact synchronization with other musicians. Magnusson [Mag06a] states that “In few areas of computing is time and latency as important as in music. Latency above 20 milliseconds is noticeable to the musician and can be frustrating when playing a digital instrument. Effective controllers and fast algorithms are very important.” Real-time behavior implies that all of the instrument parts exposed in the previous chapter need to work virtually in parallel, in a manner that any component does not block others. The overall system (as sketched in figure 4.1) consists of audio input (e.g. by contact microphones) capturing a musician’s gestures and subsequent analysis of relevant audio features. This feature sets are continuously looked up in the map of the gesture repertory, yielding two-dimensional coordinates. The resulting trajectory is subjected to a transformation in two-dimensional space. In turn, the transformed trajectory is used to retrieve synthesis parameters within the map of the sound repertory. A synthesis algorithm (either granular or spectral modeling synthesis) is employed to transform the parameters into actual sound that is output to the analog world.

71 CHAPTER 4. IMPLEMENTATION AND ASSESSMENT 72

 gestures making sound (recorded)

audio feature extraction

nd matching feature set in 2D map of gesture repertory

gesture transformation based on 2D coordinates

look up synthesis parameters in 2D map of sound repertory

sound synthesis

volume adjustment

sound

Figure 4.1: Technical building blocks CHAPTER 4. IMPLEMENTATION AND ASSESSMENT 73

There are parts of the instrument’s logic as laid out in the previous chap- ter that can not be implemented to work in real-time on available computing platforms. One of the drawbacks of the self-organizing feature maps is ex- actly the fact that building them takes many training iterations through the data repertory and therefore a lot of computing power, respectively execu- tion time. Looking up features in the SOMs on the other hand can (and – in our case – must) be performed in real-time, although depending on the size of a map, this can also be a demanding operation. The system logic must consequently be split up in two parts: a prepro- cessing step to build the two-dimensional maps out of repertory collections, and the real-time part of the instrument which processes the gesture flow from the input device to the sound output.

4.1 Repertory preprocessing

Both two-dimensional repertory maps, the one for input gestures as well as as the one for sound output must be prepared prior to actual usage within the instrument. The term repertory already implies that both maps represent selected gesture or sound instances that should be taken into account in the instrument’s context. It would in theory be well possible to represent all possible input gestures or sounds within the maps, but in this case the SOM lattices would be huge with the disadvantage of increased lookup times and memory consumption as well as overloaded visual presentation. The constriction on only relevant gesture and sound items within the maps makes the whole system more controllable and expressive because of an increased resolution of detail. In practice, the repertories are originating from (monophonic) audio files of any common audio file format – one for all gesture variations that the system should be able to react to and one for sound samples produced from all relevant synthesis parameter sets. Due to the employed methods for input feature analysis as well as sound synthesis, both repertories are only used for timbre characteristics, while disregarding any temporal contexts. The preprocessing step is performed by script programs written in the Python programming language [pyt08] using the NumPy1 extension pack- age for scientific computing, a module for audio feature extraction and a

1NumPy http://numpy.scipy.org, last visited 14.10.2008 CHAPTER 4. IMPLEMENTATION AND ASSESSMENT 74 custom-made binary extension module for the quad-tree data structure as discussed in section 3.5. The Python programming language and the em- ployed extension is fully cross-platform and free of charge. gesture repertory sound repertory

synthesis parameters

audio le of audio le with recorded gestures synthesized sound

frames (~1ms) frames (~1ms)

MFCCs MFCCs

SOM SOM

data le graphics MFCC lookup

quad-tree

data le

Figure 4.2: Preprocessing of gesture and sound repertories

Figure 4.2 shows the preprocessing logic for the two repertories. When running the Python script, the audio files are subjected to the process of audio feature extraction (see section 3.3), resulting in MFCC vectors every 64 audio samples which is roughly equivalent to a frame duration of 1.5 ms at a sample rate of 44.1 kHz. These MFCC vectors are used to train a SOM. The implementation of the SOM algorithm as a Python module is based on the work of Marilen Corciovei2. A standardized training procedure has been established based on experiments for steering the neighborhood distance and

2Marilen Corciovei, som.py http://www.len.ro/work/ai/som.py/view, last visited 06.10.2008 CHAPTER 4. IMPLEMENTATION AND ASSESSMENT 75 learning rate parameters discussed in section 3.4. By default, the size of the quadratical map is chosen so that each node on the grid could in theory represent one data frame, hence the side length equals the square root of the number of frames – this can be manually adjusted, though. The completed SOM, consisting of a two-dimensional grid of MFCC- sized vectors, is dumped to a file to be loaded into the real-time instrument. At the same time a color-coded representation of the SOM is generated which can be used as a visualization of the repertory. For the sound repertory, the SOM is used to look up all MFCC vectors for the two-dimensional BMU coordinates. These are stored in a quad-tree along with the synthesis parameters that correspond to the respective MFCC frames. The quad-tree data is dumped to a file to be used in the real-time part of the instrument. As requested in section 1.3 the system should be incrementally trainable in order to upgrade an existing setup with new gestures or new sounds. This is possible by taking existing SOM data and adjust the learning rate and neighborhood parameters in a way that only minor, respectively local modifications to the existing database are carried out.

4.2 Real-time operation

The real-time parts of the instrument are based on the Pure Data media development system. As discussed in section 2.3.5 Pure Data provides a data-flow oriented programming environment with audio inputs and outputs as well as add-ons for OpenGL-based graphics and many other purposes. Recent versions of Pure Data3 feature a callback-based audio back-end which can achieve a minimal analog-digital-analog latency of about 5 ms, high- quality audio hardware provided. Like most other real-time audio systems Pure Data calculates DSP using sample blocks – by default 64 samples long – for performance reasons. This granularity of about 1.5 ms is irrelevant for our application. Figure 4.3 shows the entire real-time system. Since graphics processing in Pure Data – respectively the GEM subsystem – can disturb the real-time behavior, it is advantageous to let the audio and the interface parts run on two separate Pure Data processes and connect them via network sockets.

3Starting with version 0.41, for MacOS and linux CHAPTER 4. IMPLEMENTATION AND ASSESSMENT 76

audio

audio in (block size = 64 samples)

volume trace MFCCs graphical interface

MFCC lookup in color-coded picture of SOM gesture SOM

visualization and adjustment of 2D transformation Gaussian transformation zones

lookup of color-coded picture of synthesis parameters sound SOM in quad-tree

sinusoidals + residuum composition

SMS synthesis

audio out

Figure 4.3: Program flow for the real-time parts of the system: audio pro- cessing (employing spectral modeling synthesis) on the l.h.s. and the graph- ical user interface on the r.h.s.. CHAPTER 4. IMPLEMENTATION AND ASSESSMENT 77

Pure Data and GEM are both fully cross-platform and covered by open- source-licenses, however due to said limitations on the audio subsystem the Windows operating system is currently not optimally supported. The audio part consists of three separate components: Input processing, repositories and transformation, and sound synthesis. The gesture input as well as the sound output components can be exchanged in order to use different input devices or different audio synthesis methods. The logic of the core component is not touched by modifications of the input or output components. The input stage performing a timbre analysis using MFCCs uses the exact same algorithms as the preprocessing stage, although this time im- plemented partly as C++-coded binary Pure Data externals, partly using the usual patcher-based programming paradigm. See appendix A for ex- cerpts of some of the main patchers. It is decisive that the MFCC vector data is fully compatible because of the subsequent lookup in the SOM map (loaded with the map data file from the preprocessing stage) representing the gesture repository. The BMU retrieval from the SOM is implemented as another optimized binary external. Apart from the MFCC generation this is the most performance-hungry part of the system. The result of the map lookup is a stream of two-dimensional coordinates every 64 samples. These points are sent over to the graphical interface side and visualized with the color-coded SOM representation of the gesture repertory. Figure 4.4 shows an example for visualization of the gesture trajectories. These coordinates are transformed using Gaussian-shaped zones as de- scribed in section 3.6. The individual zones can be edited (and also modified while playing) using the graphical interface and a SpaceNavigator 3D mouse4 – currently the definition of 10 individual zones is supported. The existence of multiple transformation zones can result in multiple transformed positions per input position which are each accompanied by a weight factor according to the proximity to the respective zones. Each of these new positions is used to look up synthesis parameters in a quad-tree data structure which has been loaded with the data file from the preprocessing step. The quad-tree algorithms are implemented using another custom binary Pure Data external. The transformed positions are

4 3Dconnexion SpaceNavigator 3D mouse http://www.3dconnexion.com/3dmouse/ spacenavigator.php, last visited 14.10.2008 CHAPTER 4. IMPLEMENTATION AND ASSESSMENT 78

Figure 4.4: Visualization of a gesture trajectory on the graphical interface. The color-coded SOM representation is overlaid by a trace of gesture posi- tions and one Gaussian transformation zone. CHAPTER 4. IMPLEMENTATION AND ASSESSMENT 79 also sent to the graphical interface and displayed on the color-coded SOM representation of the sound repertory, fully analogous to figure 4.4. The last component is the synthesis engine, which is exchangeable – the instrument’s implementation has been equipped with both granular and spectral modeling synthesis. The realization of granular synthesis is very straightforward: the accord- ing synthesis parameters are simply positions into the original audio file of the sound repertory. The grain duration and overlap is fixed by the syn- chronous architecture of the system – it is equivalent to the audio block size, respectively the MFCC analysis rate of the system. If there is more than one synthesis parameter per time frame (due to the 2D transformation process), the individual sonic grains are weighted in volume and superposed. As discussed in section 3.7 spectral modeling synthesis holds more po- tential of blending the multiplicity of sonic grains resulting from the trans- formation of the gesture trajectory. Two binary Pure Data externals have been implemented to combine sinusoidal and residual components separately to temporally developing spectral guides and noise shapes. The subsequent actual synthesis step is implemented partly as patcher logic and partly using a few binary externals. As mentioned in section 3.7 we may or may not use normalized sound synthesis which means that the resulting sound either has no or has some inherent dynamics. Anyway, in order to fulfill the requirement of gesture energy conservation (see section 3.1) it is necessary to modulate the volume of the audio output by a volume trace of the input gesture (as shown in figure 4.1).

4.3 Hardware considerations

The real-time system as shown in figure 4.3 has been fully implemented and tested on a common modern laptop computer with an Intel Core2Duo dual-core CPU5 and a dedicated graphics subsystem capable of OpenGL hardware acceleration6. It is advantageous to use a dual-core CPU since audio and graphics processing run as two separate processes which can then

5 Apple Macbook Pro, Intel Core2Duo 2x2.16 GHz with 4 MB on-die cache, 2 GB RAM 6ATI Radeon X1600 with 128 MB VRAM CHAPTER 4. IMPLEMENTATION AND ASSESSMENT 80 be nicely balanced by the operating system. On this system, the CPU con- sumption is about 50% for the audio part and about 25% for the graphics (of 200% total CPU) for a representative gesture SOM of 325x325 nodes and three transformation zones. It is possible to reduce the temporal resolution of the input feature analysis to less than 1.5 ms in order to reduce CPU con- sumption. The visual feedback might be unnecessary with sufficient practice on a specific setup – consequently this can also be switched off. As said above, input/output latency is decisive for the feel of an elec- tronic instrument – the latency of the software part is in the order of 6 ms for impulse sounds due to the FFT window sizes of analysis and synthesis. The additional latency of the audio system is dependent on the audio hardware used – for professional Firewire- or PCI-based audio solutions this is also in the range of about 3-10 ms depending on the hardware buffer size. With the audio interface at hand7 the total input/output latency of the instrument system was found to be about 10 ms which is more than sufficiently low for sensible musical interaction. Figure 4.5 shows the instrument ready for performance. All the parts of the system apart from the computer and the audio interface are rather noncritical. Input devices like a painter’s knife equipped with a contact microphone are easy and cheap to build, along with a variety of surfaces and textures – for a meaningful feature extraction the frequency response of the device should be broad with a low noise floor. This can be checked by listening to the audio signal of the input device. The SOM learning procedure will adapt the system to the specific characteristics of the device in use. Naturally many other audio-based input devices are conceivable, e.g. for utmost simplicity the plain internal microphone of a laptop computer. The existence of a multi-axis controller like a 3D mouse is not absolutely necessary – however it makes adapting the transformation zones much easier and it can also be used to rock the system by shifting or twisting the trans- formation parameters while performing, comparable to a guitar tremolo. It makes sense to use a standard guitar volume pedal to have a convenient possibility to mute the sound output when e.g. putting down the painter’s knife. 7 RME Fireface 400, http://rme-audio.de/products_fireface_400.php, last visited 15.10.2008 CHAPTER 4. IMPLEMENTATION AND ASSESSMENT 81

Figure 4.5: The entire performance system with painter’s knife input device, screen, 3D mouse and audio interface (from right to left). Computer and sound reinforcement and an optional volume pedal are not visible. CHAPTER 4. IMPLEMENTATION AND ASSESSMENT 82

4.4 Early practical experiences

Although the instrument sports a very bare appearance its possibilities are ample. The flexibility with respect to the input device and playing tech- niques, to the gesture and sound repertories as well as to transformation strategies requires a great deal of experimentation and practice in order to come up with a useful hardware setup and software presets.. Concerning the painter’s knife interface, the blade should be very thin and light so that it picks up the texture but doesn’t resonate too much by it- self, tainting the audio signal. The contact microphone has to be firmly glued to the blade so that there is no air gap and no possibility of loose rattling. The audio cable should be thin and flexible to reduce the amount of sound travel over the cable to the contact microphone. The variety of surfaces and textures for interaction should allow both noisy and tonal excitation of the blade. The blade can also be bowed while its tip is fixed, comparable to a singing saw. In any case the practice with a specific instrumental technique is important for repeatability and according transformation strategies from gestures to sounds. Chapter 5

Conclusions and future perspectives

An electronic instrument system has been designed and implemented based on the propositions and basic thoughts as outlined in section 1.3, with further refinements and constrictions as discussed in chapter 3. Although the system is heavily technology-based the actual performance interface is simple and intuitive, not distracting a player from musical think- ing and providing visual feedback for gestural actions. Like almost any traditional acoustic instrument, this new system de- mands practice – finding the right interface device, exploring playing tech- niques and selecting a sound repertory. This will take some time, therefore the eventual success or failure of the instrument and its suitability for actual musical performance cannot be veritably judged at this early stage. While feature extraction and sound synthesis work quite straightforward the multi-zone transformation strategy needs some familiarization and care- ful setup by the player. This part of the system can easily be modified at a later time when actual playing experience identifies more serious inconve- niences. Another point that needs further improvement is the preprocessing of the gesture and sound repertories which should be integrated into the running system in some way. Ideally, both repertories should be upgradeable in real-time by picking up momentary gestures or live sounds, e.g. from other musicians. One of the constraints of the current implementation is the absence of

83 CHAPTER 5. CONCLUSIONS AND FUTURE PERSPECTIVES 84 temporal transformation possibilities of musical structure elements. It would be an interesting and deep research topic to find out how and to which extent temporal transformations are possible at all for real-time gestural control. As a closer goal for improvement, gesture trajectories could be split up into temporal segments controlled by energy-flow considerations and in turn whole movements be associated to musical developments. Generally, intensified research on the identification of higher-level gestu- ral and musical structure elements will be necessary to have available more adequate means of representation of musical acting and thinking within computer-based frameworks. This will keep us busy for quite a while. Appendix A

Pure Data patches

Due to the large number of text- and patcher-based programs produced, only the high-level functionalities of the real-time part of the instrument are presented here.

Figure A.1: Main Pure Data patch, including analysis, transformation, syn- thesis as well as communication with the visualization process.

85 APPENDIX A. PURE DATA PATCHES 86

Figure A.2: Patcher abstraction for input analysis. The volume trace and the MFCC vectors are generated here. APPENDIX A. PURE DATA PATCHES 87

Figure A.3: Transformation abstraction, comprising 10 transformation zones APPENDIX A. PURE DATA PATCHES 88

Figure A.4: Transformation zone abstraction, using some math to translate, scale and rotate 2D coordinates. APPENDIX A. PURE DATA PATCHES 89

Figure A.5: SMS synthesis abstraction, including parameter retrieval, sinu- soidal and residuum combination, guide generation and actual FFT-based sound synthesis. APPENDIX A. PURE DATA PATCHES 90

Figure A.6: Network socket communication patch using senders and re- ceivers to talk to the second Pure Data process. APPENDIX A. PURE DATA PATCHES 91

Figure A.7: Main patch of the Pure Data visualization process, displaying a GEM window and two panels therein. APPENDIX A. PURE DATA PATCHES 92

Figure A.8: Panel abstraction showing SOM graphics, trajectory points and 10 transformation zones. Mouse interaction is processed and sent to the Pure Data audio process. List of Figures

1.1 ‘PLAY’ 4-track sample-based performance instrument, based on Max/MSP ...... 6 1.2 8GG - ‘dance taiji’ synesthetic performance interface. . . . .8 1.3 Anticipated instrument architecture ...... 11

2.1 7-axis dimension space of electronic instruments [BFMW05] . 13 2.2 From analog sound input to a digital representation and back to analog output [Roa96] ...... 15 2.3 Time-domain representation of a waveform [Roa96] ...... 16 2.4 Anton Webern: Trio op. 20, 1. Satz, Takte 10/11 (from [Kar66]) ...... 17 2.5 Overview of the short-time Fourier transform (from [Roa96]) 18 2.6 Sonogram of human voice – Rendered using the software pack- age ‘Sonagram Visible Speech’ ...... 19 2.7 Analysis filter bands for (a) constant-Q and (b) DFT trans- formation (from [Roa96]) ...... 20 2.8 Illustration of the peak continuation process – spectral peaks from subsequent time frames are combined to form guides. [Ser89] ...... 21 2.9 Block diagram of the analysis into deterministic and residual components [SS90] ...... 22 2.10 Magnitude spectrum from a piano tone with identified peaks (top), residual spectrum (middle), residual approximation (bot- tom) (montage from [SS90]) ...... 22 2.11 Extraction of musical parameters from the analysis represen- tation [Ser97] ...... 23

93 LIST OF FIGURES 94

2.12 Gaussian grain with sinusoidal frequency and phase as well as duration and volume defined by the envelope [Roa02] . . . 23 2.13 The frequency-loudness area of human auditory perception is divided into a matrix of auditory grains (l.h.s) and subjected to a coordinate transformation, yielding a rectangular screen of sonic quanta (r.h.s) [Xen92] ...... 24 2.14 A book of screens equals the life of a complex sound (from [Xen92]) ...... 25 2.15 A complex sound-object moving in the continuum (l.h.s). Fre- quency/timbre cross-section of sound at start, mid-point and end (r.h.s) [Wis96]...... 25 2.16 Structural functions in the theory of spectromorphology [Sma86] 26 2.17 The visualization of a music collection of 359 pieces of music. The numbers label regions where distinct musical character- istics have grouped together. [PRM02] ...... 27 2.18 Granular synthesis clouds in time-frequency representation: (a) Cumulus, (b) Glissandi, (c) Stratus. Each small circle stands for an elementary grain. [Roa96] ...... 30 2.19 Three approaches to time granulation from one or more grain sources. [Roa96] ...... 30 2.20 Block diagram of the synthesis part of SMS [SS90] ...... 31 2.21 Overlap-add resynthesis. The gray areas indicate overlapping spectrum frames. [Roa96] ...... 31 2.22 Approximate learning curve for the kazoo, kalimba, piano and violin, within a period of 10 years. [Jor04a] ...... 33 2.23 A performer can control a few synthesis parameters directly, continuously making adjustments to make perception match intention. [HC07] ...... 36 2.24 Alternatively, a performer can control many parameters using only a few control signals by means of an intervening mapping layer. [HC07] ...... 37 2.25 Typical MIDI controller ‘Doepfer Pocket Control’ [doe08] . . 39 2.26 Source image and motion as seen by the ‘Very Nervous Sys- tem’ [Rok86] ...... 40 2.27 User interacting with the MATRIX [Ove01] ...... 41 LIST OF FIGURES 95

2.28 Laetitia Sonami’s lady’s glove – http://tinyurl.com/3ds4wc, last visited 16.10.2008 ...... 41 2.29 The foam ball cluster as the SQUeeZABLE [Wei99] ...... 42 2.30 ESCHER system overview [WSR98] ...... 44 2.31 The ‘ali.jToop’ interpolator [Mom05] ...... 45 2.32 The Metasurface and AudioMulch [Ben05] ...... 45 2.33 Musician as a feedback controller. [O’M01] ...... 46 2.34 The CrumbleBag with cereal, coral and styrofoam fillings. [OE04] 47 2.35 The control panel of Propellerhead’s ReBirth synthesizer – http://tinyurl.com/6y277u, last visited 16.10.2008 . . . . . 49 2.36 The reacTable playing surface with visual feedback. [Jor05a] . 49 2.37 The tenori-on playing surface with visual feedback. [Jor05a] . 50 2.38 A screenshot of Levin’s Floo in use. [Lev00] ...... 51 2.39 The IXI software SpinDrum. Each wheel contains from 1 to 10 pedals. The wheels rotate in various speeds and when a pedal hits the 12 o’clock position it triggers a sample. [Mag06b] 51

3.1 Flow of gestural patterns through the envisioned instrument architecture...... 56 3.2 The Painting Knife tool is scraped against shag carpet. [MRA08] 57 3.3 Terhardt’s outer and middle-ear model. The dotted lines mark the center frequencies of the 24 critical bands. [PHH04] 59 3.4 Sounds (that is synthesis parameters) are evenly scattered on our map of the sound repertory. The gesture trajectory (from figure 3.1) picks up these parameters in real-time – they are in turn passed to the synthesis engine and transformed to digital audio...... 63 3.5 Region-based quad tree: Data points are inserted subsequently – as soon as a region contains more than a predefined number of data points it is divided up into four sub-regions [Sam84] . 65 3.6 Two-dimensional transformation, by subsequent translation, scaling and rotation ...... 66 LIST OF FIGURES 96

3.7 Three coexistent zones of transformation, translating from the gesture to the sound repertory. The BMU in the gesture repertory is close to two of the regions (a bit closer to A than to B) and is therefore transformed twice, weighted by factors according to the affiliation with the two Gaussians. The third zone C is far away, therefore the probability and the related weighting factor are both very low...... 67 3.8 Synthesis parameters are picked up along the gesture tra- jectory. This positional data is used to look up the sound collection, grains are formed and these in turn added up to the final sound output...... 68

4.1 Technical building blocks ...... 72 4.2 Preprocessing of gesture and sound repertories ...... 74 4.3 Program flow for the real-time parts of the system: audio pro- cessing (employing spectral modeling synthesis) on the l.h.s. and the graphical user interface on the r.h.s...... 76 4.4 Visualization of a gesture trajectory on the graphical inter- face. The color-coded SOM representation is overlaid by a trace of gesture positions and one Gaussian transformation zone...... 78 4.5 The entire performance system with painter’s knife input de- vice, screen, 3D mouse and audio interface (from right to left). Computer and sound reinforcement and an optional volume pedal are not visible...... 81

A.1 Main Pure Data patch, including analysis, transformation, synthesis as well as communication with the visualization pro- cess...... 85 A.2 Patcher abstraction for input analysis. The volume trace and the MFCC vectors are generated here...... 86 A.3 Transformation abstraction, comprising 10 transformation zones 87 A.4 Transformation zone abstraction, using some math to trans- late, scale and rotate 2D coordinates...... 88 A.5 SMS synthesis abstraction, including parameter retrieval, si- nusoidal and residuum combination, guide generation and ac- tual FFT-based sound synthesis...... 89 LIST OF FIGURES 97

A.6 Network socket communication patch using senders and re- ceivers to talk to the second Pure Data process...... 90 A.7 Main patch of the Pure Data visualization process, displaying a GEM window and two panels therein...... 91 A.8 Panel abstraction showing SOM graphics, trajectory points and 10 transformation zones. Mouse interaction is processed and sent to the Pure Data audio process...... 92 Bibliography

[ABLS01] Xavier Amatriain, Jordi Bonada, Alex Loscos, and Xavier Serra. Spectral modeling for higher-level sound transformation. In Proceedings of MOSART Workshop on Current Research Di- rections in Computer Music, 2001.

[All77] Jont B. Allen. Short-term spectral analysis, synthesis, and mod- ification by discrete fourier transform. IEEE Trans. Acoust. Speech Signal Process., ASSP-25:235–238, 1977.

[AR77] Jont B. Allen and Lawrence R. Rabiner. A unified approach to short-time fourier analysis and synthesis. Proc. IEEE, 65:1558– 1564, 1977.

[BA05] John Bowers and Phil Archer. Not hyper, not meta, not cyber but infra-instruments. In NIME ’05: Proceedings of the 2005 conference on New interfaces for musical expression, pages 5– 10, Singapore, Singapore, 2005. National University of Singa- pore.

[Ben05] Ross Bencina. The metasurface: applying natural neighbour in- terpolation to two-to-many mapping. In NIME ’05: Proceedings of the 2005 conference on New interfaces for musical expression, pages 101–104, Singapore, Singapore, 2005. National University of Singapore.

[Ber68] Glenn D. Bergland. Numerical analysis: A fast fourier transform algorithm for real-valued series. Commun. ACM, 11(10):703–710, 1968.

[BFMW05] D. Birnbaum, R. Fiebrink, J. Malloch, and M. M. Wander- ley. Towards a dimension space for musical devices. In NIME

98 BIBLIOGRAPHY 99

’05: Proceedings of the 2005 conference on New interfaces for musical expression, pages 192–195, Singapore, Singapore, 2005. National University of Singapore.

[Bil08] Stefan Bilbao. Numerical Sound Synthesis. John Wiley and Sons, 2008.

[BL81] R.A.W. Bladon and B. Lindblom. Modeling the judgment of vowel quality differences. Journal of the Acoustical Society of America, 69(5):1414–1422, 1981.

[BL04] Michel Beaudouin-Lafon. Designing interaction, not interfaces. In AVI ’04: Proceedings of the working conference on Advanced visual interfaces, pages 15–22, New York, NY, USA, 2004. ACM.

[Bou00] R. Boulanger, editor. The Csound Book: Perspectives in Soft- ware Synthesis, Sound Design, Signal Processing,and Program- ming. The MIT Press, March 2000.

[Bow05] Adam Bowen. Soundstone: a 3-d wireless music controller. In NIME ’05: Proceedings of the 2005 conference on New inter- faces for musical expression, pages 268–269, Singapore, Singa- pore, 2005. National University of Singapore.

[BPMB90] I. Bowler, A. Purvis, P. Manning, and N. Bailey. On mapping n articulation onto m synthesiser-control parameters. In Proc. of the 1990 Int. Computer Music Conf., pages 181–4, San Fran- cisco, CA, 1990. ICMA.

[Bre94] A. Bregman. Auditory Scene Analysis: The Perceptual Organi- zation of Sound. The MIT Press, Cambridge, MA, 1994.

[Bro91] J. C. Brown. Calculation of a constant q spectral transform. Journal of the Acoustical Society of America, 89(1):425–434, 1991.

[Bro06] P. Brossier. Automatic Annotation of Musical Audio for In- teractive Applications. PhD thesis, Queen Mary University of London, UK, August 2006. BIBLIOGRAPHY 100

[Cag73] John Cage. Silence: Lectures and Writings. Wesleyan Univer- sity Press, 1973.

[CCH04] Arshia Cont, Thierry Coduys, and Cyrille Henry. Real-time gesture mapping in pd environment using neural networks. In NIME ’04: Proceedings of the 2004 conference on New inter- faces for musical expression, pages 39–42, Singapore, Singapore, 2004. National University of Singapore.

[Cha97] Joel Chadabe. Electric Sound. The Past and Promise of Elec- tronic Music. Prentice-Hall, New Jersey, 1997.

[Cha02] Joel Chadabe. The limitations of mapping as a structural de- scriptive in electronic instruments. In NIME ’02: Proceedings of the 2002 conference on New interfaces for musical expression, pages 1–5, Singapore, Singapore, 2002. National University of Singapore.

[Chi83] M. Chion. Guide des objets sonores. Buchet/Chastel, Paris, 1983.

[Cho06] Eric H. C. Choi. On compensating the mel-frequency cep- stral coefficients for noisy speech recognition. In ACSC ’06: Proceedings of the 29th Australasian Computer Science Con- ference, pages 49–54, Darlinghurst, Australia, Australia, 2006. Australian Computer Society, Inc.

[Col96] Nicolas Collins. Handmade Electronic Music: The Art of Hard- ware Hacking. Routledge, London, 1996.

[Coo02] Perry R. Cook. Real Sound Synthesis for Interactive Applica- tions. A. K. Peters, Ltd., Natick, MA, USA, 2002.

[CP05] Alain Crevoisier and Pietro Polotti. Tangible acoustic inter- faces and their applications for the design of new musical in- struments. In NIME ’05: Proceedings of the 2005 conference on New interfaces for musical expression, pages 97–100, Singa- pore, Singapore, 2005. National University of Singapore. BIBLIOGRAPHY 101

[CT65] J. W. Cooley and J. W. Tukey. An algorithm for the machine calculation of complex fourier series. Math. Comput., 19:297– 301, April 1965.

[CW00] C. Cadoz and M. M. Wanderley, editors. Trends in Gestural Control of Music. IRCAM, Paris, 2000.

[Cyc08] Cycling ’74. Max/msp. http://www.cycling74.com, 2008.

[Dan97] Mark Danks. Real-time image and video processing in gem. In Proceedings of the International Computer Music Conference, pages 220–223, San Francisco, 1997. ICMA.

[Dan08] Mark Danks. Gem. http://gem.iem.at, 2008.

[DEA05] Elizabeth Dykstra-Erickson and Jonathan Arnowitz. Michel waisvisz: the man and the hands. interactions, 12(5):63–67, 2005.

[DH06] Philip L. Davidson and Jefferson Y. Han. Synthesis and control on large scale multi-touch sensing displays. In NIME ’06: Pro- ceedings of the 2006 conference on New interfaces for musical expression, pages 216–219, Paris, France, France, 2006. IRCAM — Centre Pompidou.

[DJ97] Charles Dodge and Thomas A. Jerse. Computer Music: Syn- thesis, Composition and Performance. Macmillan Library Ref- erence, 1997.

[doe08] Doepfer musikelektronik gmbh. http://www.doepfer.de, 2008.

[EO05] Georg Essl and Sile O’Modhrain. Scrubber: an interface for friction-induced sounds. In NIME ’05: Proceedings of the 2005 conference on New interfaces for musical expression, pages 70– 75, Singapore, Singapore, 2005. National University of Singa- pore.

[FB74] R. A. Finkel and J. L. Bentley. Quad trees: a data structure for retrieval on composite keys. Acta Informatica, 4(1):1–9, March 1974. BIBLIOGRAPHY 102

[FC91] Jean-Loup Florens and Claude Cadoz. The physical model: modeling and simulating the instrumental universe. pages 227– 268, 1991.

[FH93] S. Fels and G. Hinton. Glove-talk: A neural network inter- face between a data-glove and a speech synthesizer. In IEEE Transactions on Neural Networks, volume 4, 1993.

[FM33] H. Fletcher and W.A. Munson. Loudness, its definition, mea- surement and calculation. J.Acoust.Soc.Am, 5:82–108, 1933.

[Fod02] Imola K. Fodor. A survey of dimension reduction techniques. Technical report, Center for Applied Scientific Computing, Lawrence Livermore National Laboratory, 2002.

[FWC07] Rebecca Fiebrink, Ge Wang, and Perry R. Cook. Don’t for- get the laptop: using native input capabilities for expressive musical control. In NIME ’07: Proceedings of the 7th inter- national conference on New interfaces for musical expression, pages 164–167, New York, NY, USA, 2007. ACM.

[FZ06] Hugo Fastl and Eberhard Zwicker. Psychoacoustics: Facts and Models. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006.

[Gab46] Dennis Gabor. Theory of communication. J. IEEE, III(93):429– 457, 1946.

[Gab47] Dennis Gabor. Acoustical quanta and the theory of hearing. Nature, 159(4044):591–594, 1947.

[Gab52] Dennis Gabor. Lectures on communication theory. Technical Report 238, Research Laboratory of Electronics, MIT, 1952.

[Gan98] S. Gan. Squeezables: Tactile and expressive interfaces for chil- dren of all ages. Master’s thesis, MIT Media Lab, 1998.

[Han05] Jefferson Y. Han. Low-cost multi-touch sensing through frus- trated total internal reflection. In UIST ’05: Proceedings of the 18th annual ACM symposium on User interface software and technology, pages 115–118, New York, NY, USA, 2005. ACM. BIBLIOGRAPHY 103

[HC07] Matt Hoffman and Perry R. Cook. Real-time feature-based syn- thesis for live musical performance. In NIME ’07: Proceedings of the 7th international conference on New interfaces for musical expression, pages 309–312, New York, NY, USA, 2007. ACM.

[HK00] A. Hunt and R. Kirk. Mapping strategies for musical perfor- mance. In M. Wanderley and M. Battier, editors, Trends in Gestural Control of Music, pages 231–258. Ircam, Centre Pom- pidou, Paris, 2000.

[HWP02] Andy Hunt, Marcelo M. Wanderley, and Matthew Paradis. The importance of parameter mapping in electronic instrument de- sign. In NIME ’02: Proceedings of the 2002 conference on New interfaces for musical expression, pages 1–6, Singapore, Singa- pore, 2002. National University of Singapore.

[III96] Julius O. Smith III. Physical modeling synthesis update. Com- puter Music Journal, 20(2):44–56, 1996.

[JG06] Colin G. Johnson and Alex Gounaropoulos. Timbre interfaces using adjectives and adverbs. In NIME ’06: Proceedings of the 2006 conference on New interfaces for musical expression, pages 101–102, Paris, France, France, 2006. IRCAM — Centre Pompidou.

[JGAK07] Sergi Jord`a, G¨unter Geiger, Marcos Alonso, and Martin Kaltenbrunner. The reactable: exploring the synergy between live music performance and tabletop tangible interfaces. In TEI ’07: Proceedings of the 1st international conference on Tangible and embedded interaction, pages 139–146, New York, NY, USA, 2007. ACM.

[Jor01] Sergi Jord`a. New musical interfaces and new music-making paradigms. In Proceedings of the 2001 conference on New in- terfaces for musical expression, pages 1–5. NIME, 2001.

[Jor03] Sergi Jord`a. Interactive music systems for everyone exploring visual feedback as a way for creating more intuitive, efficient and learnable instruments. In Proceedings of the Stockholm BIBLIOGRAPHY 104

Music Acoustics Conference, Stockholm, Sweden, August 2003. SMAC.

[Jor04a] Sergi Jord`a. Digital instruments and players part i - effi- ciency and apprenticeship. In Proceedings of the 2004 conference on New interfaces for musical expression, Hamamatsu, Japan, 2004.

[Jor04b] Sergi Jord`a.Digital instruments and players part ii - diversity, freedom and control. In Proceedings of International Computer Music Conference 2004, Miami, USA, 2004. ICMA.

[Jor05a] Sergi Jord`a. Digital Lutherie. Crafting musical computers for new musics’ performance and improvisation. PhD thesis, UPF, Barcelona, Spain, 2005.

[Jor05b] Sergi Jord`a. Multi-user instruments models, examples and promises. In Proceedings of 2005 International Conference on New Interfaces for Musical Expression, Vancouver, Canada, May 2005.

[JS07] Randy Jones and Andrew Schloss. Controlling a physical model with a 2d force matrix. In NIME ’07: Proceedings of the 7th international conference on New interfaces for musical expres- sion, pages 27–30, New York, NY, USA, 2007. ACM.

[Kar66] Erhard Karkoschka. Das Schriftbild der neuen Musik. Hermann Moeck Verlag, Celle, 1966.

[KEDC02] Ajay Kapur, Georg Essl, Philip Davidson, and Perry R. Cook. The electronic tabla controller. In NIME ’02: Proceedings of the 2002 conference on New interfaces for musical expression, pages 1–5, Singapore, Singapore, 2002. National University of Singapore.

[KMGY97] R. Kronland-Martinet, Ph. Guillemain, and S. Ystad. Mod- elling of natural sounds by time–frequency and wavelet repre- sentations. Org. Sound, 2(3):179–191, 1997.

[Koh88] Teuvo Kohonen. Self-organized formation of topologically cor- rect feature maps. pages 509–521, 1988. BIBLIOGRAPHY 105

[KSH01] T. Kohonen, M. R. Schroeder, and T. S. Huang, editors. Self- Organizing Maps. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2001.

[Lev00] Golan Levin. Painterly interfaces for audiovisual performance. Master’s thesis, MIT Media Arts and Sciences, September 2000.

[Log00] B. Logan. Mel frequency cepstral coefficients for music mod- eling. In Proceedings of the 1st International Conference on Music Information Retrieval, Plymouth, Massachusetts, 2000.

[lua08] The programming language lua. http://www.lua.org, 2008.

[Mac99] Tod Machover. Technology and the future of new music, Octo- ber 1999. http://www.newmusicbox.org/article.nmbx?id= 316, visited 5.9.2008.

[Mag05] Thor Magnusson. ixi software: the interface as instrument. In NIME ’05: Proceedings of the 2005 conference on New inter- faces for musical expression, pages 212–215, Singapore, Singa- pore, 2005. National University of Singapore.

[Mag06a] Thor Magnusson. Affordances and constraints in screen-based musical instruments. In NordiCHI ’06: Proceedings of the 4th Nordic conference on Human-computer interaction, pages 441– 444, New York, NY, USA, 2006. ACM.

[Mag06b] Thor Magnusson. Screen-based musical interfaces as semiotic machines. In NIME ’06: Proceedings of the 2006 conference on New interfaces for musical expression, pages 162–167, Paris, France, France, 2006. IRCAM — Centre Pompidou.

[Man02] James Mandelis. Adaptive hyperinstruments: applying evo- lutionary techniques to sound synthesis and performance. In NIME ’02: Proceedings of the 2002 conference on New inter- faces for musical expression, pages 1–2, Singapore, Singapore, 2002. National University of Singapore.

[Mat69] Max Mathews. The Technology of Computer Music. MIT Press, 1969. BIBLIOGRAPHY 106

[McC96] James McCartney. Supercollider: A new real-time sound syn- thesis language. In The International Computer Music Confer- ence, pages 257–258, San Francisco, 1996. International Com- puter Music Association.

[Men02] D. Menzies. Composing instrument control dynamics. Organ- ised Sound, 7(3):255–266, 2002.

[MM07] Thor Magnusson and Enrike Hurtado Mendieta. The acoustic, the digital and the body: a survey on musical instruments. In NIME ’07: Proceedings of the 7th international conference on New interfaces for musical expression, pages 94–99, New York, NY, USA, 2007. ACM.

[Mol68] Abraham Moles. Information Theory and Esthetic Perception. University of Illinois Press, Urbana, 1968.

[Mom05] Ali Momeni. Composing Instruments: Inventing and Perform- ing with Generative Computer-based Instruments. Dissertation, University of California, Berkeley, 2005.

[Moo88] F. Richard Moore. The dysfunctions of midi. Comput. Music J., 12(1):19–28, 1988.

[MR07] D. Merrill and H. Raffle. The sound of touch. In Extended Abstracts of CHI 2007, San Jose, CA, 2007.

[MRA08] D. Merrill, H. Raffle, and R. Aimi. The sound of touch: Physical manipulation of digital sound. In Proceedings of the SIGCHI conference on Human factors in computing systems (CHI’08), Florence, Italy, 2008.

[MW03] Ali Momeni and David Wessel. Characterizing and controlling musical material intuitively with geometric models. In NIME ’03: Proceedings of the 2003 conference on New interfaces for musical expression, pages 54–62, Singapore, Singapore, 2003. National University of Singapore.

[MW06] Mark T. Marshall and Marcelo M. Wanderley. Vibrotactile feed- back in digital musical instruments. In NIME ’06: Proceedings BIBLIOGRAPHY 107

of the 2006 conference on New interfaces for musical expression, pages 226–229, Paris, France, France, 2006. IRCAM — Centre Pompidou.

[Nak01] Teresa Marrin Nakra. Synthesizing expressive music through the language of conducting. Journal of New Music Research, 31(1):11–26, June 2001.

[NI06] Yu Nishibori and Toshio Iwai. Tenori-on. In NIME ’06: Pro- ceedings of the 2006 conference on New interfaces for musical expression, pages 172–175, Paris, France, France, 2006. IRCAM — Centre Pompidou.

[NW07] Doug Van Nort and Marcelo Wanderley. Control strategies for navigation of complex sonic spaces. In NIME ’07: Proceed- ings of the 7th international conference on New interfaces for musical expression, pages 379–382, New York, NY, USA, 2007. ACM.

[NWD04] Doug Van Nort, Marcelo M. Wanderley, and Philippe Depalle. On the choice of mappings based on geometric properties. In NIME ’04: Proceedings of the 2004 conference on New inter- faces for musical expression, pages 87–91, Singapore, Singapore, 2004. National University of Singapore.

[OE04] Sile O’Modhrain and Georg Essl. Pebblebox and crumblebag: tactile interfaces for granular synthesis. In NIME ’04: Pro- ceedings of the 2004 conference on New interfaces for musical expression, pages 74–79, Singapore, Singapore, 2004. National University of Singapore.

[O’M01] Maura Sile O’Modhrain. Playing by feel: incorporating haptic feedback into computer-based musical instruments. PhD thesis, Stanford, CA, USA, 2001. Adviser-Chris Chafe.

[Opi03] Tim Opie. Creation of a real-time granular synthesis instru- ment for live performance. Master’s thesis, QUT, Brisbane, Austrialia, 2003. BIBLIOGRAPHY 108

[Ove01] Dan Overholt. The matrix: a novel controller for musical ex- pression. In NIME ’01: Proceedings of the 2001 conference on New interfaces for musical expression, pages 1–4, Singapore, Singapore, 2001. National University of Singapore.

[Ove05] Dan Overholt. The overtone violin. In NIME ’05: Proceedings of the 2005 conference on New interfaces for musical expression, pages 34–37, Singapore, Singapore, 2005. National University of Singapore.

[Pen85] Bruce W. Pennycook. Computer-music interfaces: a survey. ACM Comput. Surv., 17(2):267–289, 1985.

[PHH04] E. Pampalk, P. Hlavac, and P. Herrera. Hierarchical organiza- tion and visualization of drum sample libraries. In Proc. of the 7th Int. Conference on Digital Audio Effects (DAFx’04). DAFx, October 2004.

[Pir01] J¨orgPiringer. Elektronische musik und interaktivit¨at:Prinzip- ien, konzepte, anwendungen. Diplomarbeit, Technische Univer- sit¨at,Wien, October 2001.

[PPR91] Giovanni De Poli, Aldo Piccialli, and Curtis Roads, editors. Representations of musical signals. MIT Press, Cambridge, MA, USA, 1991.

[PR98] Giovanni De Poli and Davide Rocchesso. Physically based sound modelling. Org. Sound, 3(1):61–76, 1998.

[Pre97] Jeff Pressing. Some perspectives on performed sound and music in virtual environments. Presence, 6(4):482–503, 1997.

[PRM02] Elias Pampalk, Andreas Rauber, and Dieter Merkl. Using smoothed data histograms for cluster visualization in self- organizing maps. In ICANN ’02: Proceedings of the Interna- tional Conference on Artificial Neural Networks, pages 871–876, London, UK, 2002. Springer-Verlag.

[Puc91] Miller Puckette. Combining event and signal processing in the max graphical programming environment. Computer Music Journal, 1991. BIBLIOGRAPHY 109

[Puc96] Miller Puckette. Pure data. In Proceedings of the International Computer Music Conference, pages 269–272, 1996.

[Puc08] Miller Puckette. Pure data. http://puredata.info, 2008.

[pyt08] Python programming language. http://python.org, 2008.

[Ris91] Jean-Claude Risset. Timbre analysis by synthesis: Representa- tions, imitations, and variants for musical composition. pages 7–43, 1991.

[Roa96] Curtis Roads. The computer music tutorial. MIT Press, Cam- bridge, 1996.

[Roa02] Curtis Roads. Microsound. MIT Press, Cambridge, 2002.

[Rod03] Tara Rodgers. On the process and aesthetics of sampling in electronic music production. Org. Sound, 8(3):313–320, 2003.

[Rok86] David Rokeby. Very nervous system, 1986. http://homepage. mac.com/davidrokeby/vns.html, last visited 01.10.2008.

[Rot96] Joseph Rothstein. MIDI: A Comprehensive Introduction. A-R Editions, Inc., Madison, WI, USA, 1996.

[Rus98] Andre Ruschkowski. Elektronische Kl¨angeund musikalische Entdeckungen. Reclam, Ditzingen, 1998.

[SAH79] M. R. Schroeder, B. S. Atal, and J. L. Hall. Optimizing digital speech coders by exploiting masking properties of the human ear . Acoustical Society of America Journal, 66:1647–1652, 1979.

[Sam84] Hanan Samet. The quadtree and related hierarchical data struc- tures. ACM Comput. Surv., 16(2):187–260, 1984.

[SB98] X. Serra and J. Bonada. Sound transformations based on the sms high level attributes. Barcelona, 1998.

[Sch66] Pierre Schaeffer. Trait´edes objets musicaux. Editions du Seuil, Paris, 1966. BIBLIOGRAPHY 110

[Ser89] X. Serra. A System for Sound Analy- sis/Transformation/Synthesis based on a Deterministic plus Stochastic Decomposition. PhD thesis, Stanford University, 1989.

[Ser97] Xavier Serra. Musical Sound Modeling with Sinusoids plus Noise, pages 91–122. Swets & Zeitlinger, 1997.

[Shn83] Ben Shneiderman. Direct manipulation: a step beyond pro- gramming languages. IEEE Computer, 16(8):57–69, August 1983.

[SJ88] C. A. Scaletti and R. E. Johnson. An interactive environment for object-oriented music composition and sound synthesis. In OOPSLA ’88: Conference proceedings on Object-oriented pro- gramming systems, languages and applications, pages 222–233, New York, NY, USA, 1988. ACM.

[Sma86] Denis Smalley. The Language of Electroacoustic Music, chapter Spectro-Morphology and Structuring Processes, pages 61–93. Macmillan Press, 1986.

[Sma97] Denis Smalley. Spectromorphology: explaining sound-shapes. Organised Sound, 2(2):107–26, 1997.

[SR67] P. Schaeffer and G. Reibel. Solf`egede l’objet sonore. Ina-GRM, Paris, 1967.

[SS90] X. Serra and J. Smith. Spectral modeling synthesis: A sound analysis/synthesis based on a deterministic plus stochastic de- composition. Computer Music Journal, 14:12–24, 1990.

[sup08] Supercollider. http://supercollider.sourceforge.net, 2008.

[SV08] Bert Schiettecatte and Jean Vanderdonckt. Audiocubes: a dis- tributed cube tangible interface based on interaction range for sound design. In TEI ’08: Proceedings of the 2nd international conference on Tangible and embedded interaction, pages 3–10, New York, NY, USA, 2008. ACM. BIBLIOGRAPHY 111

[Ter79] E. Terhardt. Calculating virtual pitch. Hearing Research, 1:155– 182, 1979.

[TW05] V. Tyagi and C. Wellekens. On desensitizing the mel-cepstrum to spurious spectral components for robust speech recognition. Acoustics, Speech, and Signal Processing, 2005. Proceedings. (ICASSP ’05). IEEE International Conference on, 1:529–532, 18-23, 2005.

[van97] Guido van Rossum. Python. 2(2), Spring 1997.

[Ver05] William Verplank. Haptic music exercises. In NIME ’05: Pro- ceedings of the 2005 conference on New interfaces for musi- cal expression, pages 256–257, Singapore, Singapore, 2005. Na- tional University of Singapore.

[VGM02] Bill Verplank, Michael Gurevich, and Max Mathews. The plank: designing a simple haptic controller. In NIME ’02: Proceedings of the 2002 conference on New interfaces for musical expression, pages 1–4, Singapore, Singapore, 2002. National University of Singapore.

[VSM01] Bill Verplank, Craig Sapp, and Max Mathews. A course on controllers. In NIME ’01: Proceedings of the 2001 conference on New interfaces for musical expression, pages 1–4, Singapore, Singapore, 2001. National University of Singapore.

[WAFW07] David Wessel, Rimas Avizienis, Adrian Freed, and Matthew Wright. A force sensitive multi-touch array supporting multiple 2-d musical control structures. In NIME ’07: Proceedings of the 7th international conference on New interfaces for musical expression, pages 41–45, New York, NY, USA, 2007. ACM.

[Wan01] M. M. Wanderley. Performer-instrument interaction: Applica- tions to gestural control of sound synthesis. PhD thesis, Uni- versit´ePierre er Marie Curie, Paris VI, 2001.

[Wan02] M. M. Wanderley. Gestural control of music. In Proceedings of the International Workshop on Human Supervision and Control BIBLIOGRAPHY 112

in Engineering and Music, pages 101–130, Kassel, Germany, 2002.

[Wan08] Ge Wang. The ChucK Audio Programming Language: A Strongly-timed and On-the-fly Environ/mentality. PhD thesis, Princeton University, 2008.

[WB06] D. L. Wang and G. J. Brown. Computational Auditory Scene Analysis: Principles, Algorithms, and Applications. IEEE Press/Wiley-Interscience, 2006.

[Wei99] G. Weinberg. Expressive digital musical instruments for chil- dren. Master’s thesis, MIT Media Lab, 1999.

[Wis96] Trevor Wishart. On Sonic Art. Harwood, Amsterdam, 1985/1996.

[WSR98] M. Wanderley, N. Schnell, and J. B. Rovan. Escher - modeling and performing composed instruments in real-time. Proc. of the 1998 IEEE Int. Conf. on Systems, Man and Cybernetics (SMC’98), 1998.

[WWS02] David Wessel, Matthew Wright, and John Schott. Intimate mu- sical control of computers with a variety of controllers and ges- ture mapping metaphors. In NIME ’02: Proceedings of the 2002 conference on New interfaces for musical expression, pages 1–3, Singapore, Singapore, 2002. National University of Singapore.

[Xen92] Iannis Xenakis. Formalized music: Thought and mathematics in composition. Pendragon, New York, 1971/1992.

[Zic98] David Zicarelli. An extensible real-time signal processing envi- ronment for max. In Proceedings of the International Computer Music Conference, 1998.

[ZZS01] Fang Zheng, Guoliang Zhang, and Zhanjiang Song. Comparison of different implementations of mfcc. J. Computer Science and Technology, 16(6):582–589, September 2001.