Strategies for Managing Timbre and Interaction in Automatic

Improvisation Systems a b s t r a c t

Earlier interactive improvisa- tion systems have mostly worked with note-level musical William Hsu events such as pitch, loudness and duration. Timbre is an integral component of the musi- cal language of many improvis- ers; some recent systems use timbral information in a variety of ways to enhance interactiv- ity. This article describes the imbre is an important structural element in tried to identify simple behavioral timbre-aware ARHS improvi- T mechanics that one might con- sation system, designed in free improvisation. When working with a saxophonist or other instrumentalist whose practice incorporates extended tech- sider desirable in a human im- collaboration with saxophonist John Butcher, in the context of niques and timbre variations, it is important to go beyond proviser and emulate them with recent improvisation systems note-level MIDI-like events when constructing useful descrip- some straightforward strategies. that incorporate timbral informa- tions of a performance. For example, a long tone may be held One novel aspect of my work is tion. Common practices in audio on a saxophone, with fairly stable pitch and loudness, but the the integral and dynamic part that feature extraction, performance state characterization and intensity of multiphonics is increased through embouchure timbre plays in sensing, analysis, management, response synthe- control. An experienced human improviser would perceive material generation and interaction sis and control of improvising and respond to this variation. decision-making. agents are summarized and In this paper, I will focus on systems designed to improvise Collins [9] observed that a ma- compared. with human musicians who work with information beyond chine improviser is often “tied . . . note-level events. I am mostly interested in what Rowe calls to specific and contrasting musical player paradigm systems [1], that is, systems intended to act as an styles and situations.” For example, artificial instrumentalist in performance, sometimes dubbed a system that tracks timbral changes over continuous gestures machine improvisation systems. Blackwell and Young [2] also may be a poor match for an improviser playing a conventional discuss the similar notion of a Live Algorithm, essentially a ma- drumkit. We are attracted to the idea of a general machine chine improviser that is largely autonomous. improvisation system that can perform with any human instru- Earlier machine improvisers typically use a pitch-to-MIDI mentalist. This generality is perhaps mostly appropriate as an converter to extract a stream of MIDI note events, which is ideal for the decision logic of the Processing stage, detailed then analyzed; the system generates MIDI output in response. below. Specific performance situations inform our design deci- The best known of these systems is probably George Lewis’s sions, especially in the information we choose to capture and Voyager [3]. With the increase in affordable computing power, how we shape the audio output. the availability of timbral analysis tools such as Jehan’s ana- lyzer~ object [4] and IRCAM’s Zsa descriptors [5], and related Related Systems developments in the Music Information Retrieval community, researchers have built systems that work with a variety of real- In this survey of machine improvisation systems that work with time audio features. timbral information, I will identify the goals and general oper- Since 2002, I have worked on a series of machine improvisa- ational environments for each; discussion of technical details tion systems, primarily in collaboration with saxophonist John will be postponed to subsequent sections. This is not intended Butcher [6,7]. Butcher is well known for his innovative work as an exhaustive survey; however, the projects overviewed do with saxophone timbre [8]; all my systems incorporate aspects represent a spectrum of the technologies used in such systems. of timbral analysis and management. Our primary design goals An earlier system is Ciufo’s Eighth Nerve [10], for and were to build systems that will perform in a free improvisation electronics, which combines sensors and audio analysis. Pro- context and are responsive to timbral and other variations in cessing and synthesis are controlled by timbral parameters. the saxophone sound. The latest is the Adaptive Real-time His later work, Beginner’s Mind [11], accommodates a variety Hierarchical Self-monitoring (ARHS) system, which I will fo- of input sound sources and incorporates timbral analysis over cus on in this paper. In addition to Butcher, my systems have entire gestures. performed with saxophonists Evan Parker and James Fei and Collins [12] describes an improvisation simulation for hu- percussionist Sean Meehan. man and four artificial performers. His emphasis is on Butcher is mostly interested in achieving useful musical extracting event onset, pitch and other note-level information results in performance, rather than exploring aspects of from the input audio. machine learning or other technologies. To this end, I have Yee-King’s system [13] uses genetic algorithms to guide im- provisations with a saxophonist and a drummer. The improvis- ing agents use either additive or FM synthesis. Van Nort’s system [14] works with timbral and textural in- William Hsu (musician, educator), San Francisco State University, 1600 Holloway Avenue, San Francisco, CA 94132, U.S.A. E-mail: . formation, and performs recognition of sonic gestures in per- See for audio documentation related to formance. It has been used primarily in a trio with accordion this article. and saxophone.

©2010 ISAST LEONARDO MUSIC JOURNAL, Vol. 20, pp. 33–39, 2010 33

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/LMJ_a_00010 by guest on 29 September 2021  

                  

             

        

            

Fig. 1. Block Diagram of the ARHS System. Timbral and gestural information is extracted from the real-time audio stream (Sensing stage), analyzed by the Processing stage, and made available to a virtual improvising ensemble (Synthesis stage).

Young’s NNMusic [15] works with At this point, the Processing stage stage). The Interaction Management pitch and timbral features such as bright- takes data collected by the Sensing stage Component (IMC) computes high-level ness. It has participated in a variety of and organizes it into high-level descrip- descriptions of the performance (Pro- score-based and improvisatory pieces, tors of the performance state. There is cessing stage) and coordinates the high- with instruments such as , and usually post-processing of extracted fea- level behavior of the ensemble. . tures, determination of musical struc- Casal’s Frank [16] combines Casey’s tures and patterns and the development Soundspotter MPEG7 feature recogni- and preparation of behavioral guidelines Sensing tion software [17] with genetic algo- for improvising agents. The Sensing stage extracts low-level rithms; Frank has performed with Casal Finally, the outcomes of the Processing features from the input audio. These on piano and Chapman Stick. stage are fed to subsystems that directly features usually include MIDI-like note generate audio output; this is the Synthe- events, FFT bins and various timbre mea- sis stage. Boundaries between the stages surements. Organizational tend to blur in real systems; however, this Framework is a convenient framework for discussing The Sensing Stage Rowe’s Interactive Music Systems [18] information flow. in the ARHS system characterizes the processing chain of My earlier work [20] took a relatively MIDI-oriented interactive music systems compositional approach in emphasiz- as typically consisting of an input Sens- ARHS System Overview ing the detection and management of ing stage followed by a computational The Adaptive Real-time Hierarchical specific timbral and gestural events. In Processing stage and an output Response Self-monitoring (ARHS) system (Fig. 1) contrast, the ARHS system works with stage. Blackwell and Young [19] discuss is the latest of my machine improvisation general event types and performance de- a similar organization. systems. It extracts in real-time percep- scriptors. I was interested in constructing For the purposes of this paper, I will tually significant features of a saxophon- compact, perceptually significant charac- assign operations to subsystems in the fol- ist’s timbral and gestural variations and terizations of the performance state from lowing manner. Starting at the input to uses this information to coordinate the a small number of audio features. The a machine improvisation system, we cap- performance of an ensemble of virtual extensive literature in mood detection ture real-time audio data and subdivide improvising agents. and tracking from the Music Information the data into windows. Per-window mea- Timbral and gestural information is ex- Retrieval community mostly applied to surements and features are computed. tracted from the real-time audio stream traditional classical music or pop music These operations will be discussed under (Sensing stage) and made available to a (see, for example, Yang [21] and Lu et the Sensing stage. virtual improvising ensemble (Synthesis al. [22]); it suggests that compact perfor-

34 Hsu, Timbre and Interaction in Automatic Improvisation

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/LMJ_a_00010 by guest on 29 September 2021 mance state descriptors might be usable. Other Approaches The Processing Stage In Lu et al. [23], three primary sets of Most other machine improvisation sys- in the ARHS System features were found to be important for tems take fairly similar approaches to At the beginning of the Processing stage, mood classification: intensity (essentially sensing. Beginner’s Mind [27] also uses it is customary for machine improvisa- loudness/energy), rhythm (tempo, regu- measurements from Jehan’s analyzer~. tion systems to construct high-level de- larity, etc.) and timbral features (bright- Van Nort’s system [28] works with spec- scriptions of the human improviser’s ness, bandwidth, spectral flux, etc.). tral centroid, spectral deviation, pitch performance. A common approach is For efficient real-time analysis, I chose strength and a textural measure (based to partition the extracted feature stream a simplified trio of features for ARHS: on Linear Predictive Coding analysis). into sonic gestures. loudness (loud/soft), tempo (fast/ Yee-King’s system [29] extracts pitch and My earlier systems work extensively slow), and for timbre, I needed a single amplitude from input audio, in addition with feature curves organized by gesture measure that correlates well with musi- to FFT-based spectral snapshots. NN [40]. The ARHS system is also capable cal tension/release or consonance/ Music [30] comprises a pitch-oriented of such, but takes a somewhat different dissonance. Auditory roughness has been analysis function and an audio analysis approach. The ARHS Sensing stage com- found to constitute a significant factor in function that measures loudness, bright- putes a performance mode for each 1-sec dissonance judgments [24]. Roughness ness, etc. Frank [31] uses vectors of log window of audio. Its Processing stage quantifies interference between partials frequency cepstral coefficients generated uses the 1-sec descriptors and additional in a complex tone. My experiences with by Soundspotter [32]. information to influence the timing of saxophone tones, with many partials and performance actions and make decisions rich inter-relationships, also indicate that on material choice. roughness is a useful overall measure that tracks the variation of perceptually The Processing Stage Short-Term Responsiveness. significant characteristics. As per Vas- The Processing stage is where most re- In an improvisation that we characterize silakis [25], roughness is estimated by searchers have expended the most effort, as “responsive” or “tight,” the timing of extracting partials from the audio; for employing a variety of approaches to em- a particular performance action seems each pair of partials, its contribution to ulate human performance behavior. As tied to observed events in a very small the roughness measure is computed, and mentioned earlier, ARHS’s Sensing stage time window. (By performance actions, I the roughness contributions of all pairs passes a compact performance descriptor mean not just simple call/response activ- are summed. The ARHS system utilizes to the Processing stage; a straightforward ity, but also the small timbral and gestural FFTs and a version of Jehan’s analyzer~ mechanism is then used for material se- adjustments improvisers make when en- [26], customized to compute auditory lection and self-evaluation. In most other gaging in dialog.) roughness. systems, the decision-making processes In the ARHS system, I define sets of Figure 2 shows the main operations of the Processing stage work directly with potential trigger events (PTEs) based on of the Sensing stage. The audio input is a profusion of low-level information. short-term transition patterns in perfor- divided into windows. To form an esti- I have found it useful to frame a num- mance modes; they represent “ear-catch- mate of the tempo, I segment the audio ber of issues encountered here in the ing” events that encourage a change in input into note events, based on pitch es- form of two questions. Given a high-level the current behavior of an improvising timates and amplitude envelope shapes. description of the performance state, (1) agent. ARHS PTEs include gesture onset Note onsets are determined by changes How does the system determine the tim- and end, sudden changes in performance in the amplitude envelope and changes ing of its performance actions? (This mode, a long period of a high-intensity in pitch rated with a high confidence addresses how responsive a system is to performance mode and the sudden oc- score by the pitch tracker. A roughness the human’s performance.) (2) How currence of selected timbral features, for estimate is also computed and averaged does the system select musical materi- example, a sharp attack caused by a slap- over a sliding 1-sec window. A perfor- als to use in its performance actions? tongue. An agent may respond to a PTE mance mode descriptor, comprising (This addresses how the system chooses by starting or stopping a phrase, chang- loudness (loud/soft), tempo (fast/slow) and evaluates combinations of musical ing some timbral characteristic, etc. If an and timbre (rough/smooth), is derived. materials, and whether these decisions agent responds to a PTE, the probability The computation of roughness re- can adapt to changing performance that it responds to the next PTE is re- quires large analysis windows (>100 milli- conditions.) duced; similarly, if it ignores the current sec); a fast run of notes is often classified In attempts to achieve behavioral PTE, it is more likely that it responds to as rough because of note transitions and characteristics analogous to human per- the next one. For simplicity, the current instability within the window. Hence, formance (Young [33] identifies some ARHS implementation does not distin- audio segments classified as fast are not useful attributes, such as adaptability guish between different PTE types. I feel assigned a roughness descriptor. Each and intimacy), several systems have in- that this attention to relatively fine-grain 1-sec segment is classified into one of corporated neural networks and genetic timing issues, that is, decisions of when to seven performance modes: silence, slow/ algorithms in decision-making processes play (or make an adjustment), is crucial soft/smooth, slow/soft/rough, slow/ [34–37]. However, in the open environ- for responsive interactive performance. loud/smooth, slow/loud/rough, fast/ ment of free improvisation, it is not easy soft, fast/loud. From here on, the ARHS to define what constitutes desirable out- Long-Term Adaptivity. system works primarily with this compact comes to converge to, or fitness func- Over an extended improvisation, I performance state in its material selec- tions for evaluation. Yee-King and Casal wanted to emulate interaction behavior tion decisions. While it is unusual for this [38,39] use timbral matching as a goal, that I enjoy in human improvisations, much data reduction to be performed so but that seems rather limiting. I will ex- where a mode of dialog may be sustained, early, this decision seems reasonable in pand on related issues in subsequent then shift significantly as each improviser light of the MIR work cited earlier. sections. takes on different roles. An agent should

Hsu, Timbre and Interaction in Automatic Improvisation 35

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/LMJ_a_00010 by guest on 29 September 2021 have some reasonable strategy for choos- the human switching to mode H2, the . Performance timing is based on ing a performance mode in response to combination of observed mode H1 and detection of specific “hot button” events, a human performance mode. There agent mode A1 is discouraged; that of such as the presence of a timbral feature. should be a mechanism for evaluating observed mode H2 and agent mode A1 Van Nort et al. [46] also organize a combination of human/agent perfor- is encouraged. feature vectors into gestures. During a mance modes and changing the map- The Adaptive Performance Map training phase, representative gesture ping if necessary. (APM) in Fig. 3 is a table that maps each types are used to build a gesture diction- One approach that I tried to avoid set of observed (human) performance ary. In performance, sonic gestures from is to implement a moment-to-moment mode statistics to a response (agent) human improvisers are matched with dic- mapping of the human’s performance mode. It has 27 entries, corresponding tionary entries and influence the Genetic state to the system’s performance state. to the 27 performance summaries. Each Algorithm (GA)–based evolution of a Improvisations I enjoy are based not so entry contains one or more recommended population of gestures. Different gesture much on direct mirroring (such as overly modes for an agent. A score is kept for types are mapped to different modes of obvious echo gestures that may arise in a each mapping and is updated to indicate granulation or delay-based processing. delay line-based system), but more on ne- a preference for that mapping. Response There is little discussion of more general gotiation and dialog. Another problem- modes with high scores are used to guide mapping strategies or how performance atic situation was reported by Casal [41]; the agent’s performance; modes with low timing decisions are made. a complex system may also “optimally” scores are eventually dropped. Hence, In Yee-King [47], a Live Performance sys- and consistently converge to the same an agent’s response to a specific human tem extracts pitch and amplitude from output for a specific input gesture. In performance mode may change over the the input audio into a control data store. Young’s terminology [42], there should course of a performance, depending on Pitch and amplitude sequences are used be sufficient Opacity between perfor- the changing preferences of the human as control data for synthesizing output; mance state descriptors and proposed improviser. it is not clear how detailed performance performance actions. timing decisions are made. A Sound De- The ARHS system bases material se- Summary. sign system attempts to match the timbre lection on a fuzzy 8-sec summary of the It is convenient to divide the function- of the input audio using a GA-based ap- human’s performance, comprising fre- ality of the ARHS Processing stage into proach. FFT bin analysis data from the quencies of occurrence of each perfor- two parts: short-term timing decisions that Sensing stage represents the target sound; mance characteristic (fast/slow, loud/ determine when to play, and long-term the GA attempts to generate synthesis soft, rough/smooth). Each frequency is material selection decisions that determine parameters to match the target. quantized roughly into frequent/mod- what to play. Tying agent actions to Po- In Frank [48], MPEG7 feature vectors erate/seldom; hence there are 3 x 3 x tential Trigger Events enhances temporal are used to influence the GA-based co- 3 = 27 different performance summa- responsiveness. The Adaptive Perfor- evolution of populations of individuals. ries. These are mapped to performance mance Map enables novel combinations A fitness function, based on timbral simi- modes for guiding agent actions. We will of musical materials in performance larity to the input sound, determines the see later that improvising agents syn- without an extended training session. winning individuals; the winners are pre- thesize their performance actions, with While other machine improvisation sys- sented as live sound. It is not clear how variations based on performance mode tems may approach the problems from event timing is decided. guidelines, instead of processing the in- rather different perspectives, the division NN Music [49] accumulates statistics put audio stream; this further prevents of processing into timing and material of loudness, brightness, etc. to compile any locked-in mapping of input perfor- selection decisions is a useful one and a performance state. A library of perfor- mance gesture to output performance will be applied for comparison with other mance states is used to train feedforward action. systems. neural network A. Real-time incoming Mechanisms are needed for evaluating performance states are matched with combinations of musical materials in per- Other Approaches the learned states in network A. A sec- formance. The ARHS system attempts to Several systems organize data from the ond neural network (B) generates syn- “guess” the evaluation of the human im- Sensing stage into gesture-based collec- thesis control parameters in response to proviser from transitions in the human’s tions. Beginner’s Mind [43] segments specific performance states generated by performance mode statistics. Suppose the input audio into phrases based on network A. Evolving stochastic processes, the human improviser is observed to loudness and attack heuristics. The mean not documented in detail, determine the play somewhat consistently in mode H1 and standard deviation of pitch and tim- timing of sound event and rhythm gen- over an extended period, while an agent bral measurements over a phrase are eration. is playing in mode A1. One might rea- stored and used to identify the type of In Bown and Lexer [50], Continu- sonably assume that the human consid- instrument or sound and differentiate ous-Time Recurrent Neural Networks ers such a combination desirable. If the performance styles. A fully autonomous (CTRNNs) are used to control a spectral human considers the mode combination mode was in development, but few details filter, which adjusts the decay time of to be a musically unacceptable clash of were given [44]. piano tones. Subsequent improvisations activity, one might reasonably expect the In Hsu [45], parameter curves for with CTRNNs involving saxophone and human to change her/his performance loudness, brightness, roughness, etc. percussion are described briefly in Bown mode to adjust to the undesirable situa- are captured on the fly and stored at the [51]. tion. Hence, the agent is “encouraged” phrase level; this retains perceptually In the free improvisation simulation to continue behavior in mode A1, if it significant input gestural shapes. These of Collins [52], he identified a large continues to observe the human playing curves are used to guide the synthesis of parameter set for describing the “tem- in mode H1. However, if the agent has audio output from various improvising perament” of improvising agents that in- been playing in mode A1, and observes agents, for example, a waveguide fluence each other in a network. These

36 Hsu, Timbre and Interaction in Automatic Improvisation

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/LMJ_a_00010 by guest on 29 September 2021 parameters include “shyness” (amplitude material selection (fast/slow, loud/soft, agent “plays” a virtual instrument by gen- of responses), “sloppiness” (degree of rough/smooth). As the human impro- erating parameter curves that control deviation from referenced rhythmic ma- viser begins to play, potential trigger loudness and timbral features such as terial) and “keenness” (probability that events (PTEs) are generated. Eventually, brightness, roughness (usually a tremolo an agent responds to a trigger). Collins an agent activates, consults its APM for envelope), etc. and using these curves has also explored the use of reinforcement a performance mode and synthesizes a to synthesize output sound. Pitch tends learning in a MIDI-based system [53]; an gesture based on mode guidelines. As the to be less important in free improvisa- agent in some current performance state performance develops, additional PTEs tion; consequently, pitch is handled in chooses its next action and evaluates con- encourage adjustments to the agent’s a straightforward way, by extracting the sequent performance states. Behavioral ongoing gesture, and the APM’s score (occasional) stable pitch from the input types that indicate mutual influence are for the current combination of human/ audio and applying simple transforma- reinforced. It should be possible to in- agent performance mode is updated. tions. This general approach works well corporate timbral features into perfor- Human/agent performance modes with with a variety of synthesis algorithms, mance states and performance actions low scores may be discarded and map- from simple filtered noise sources, to and apply similar learning mechanisms. pings altered. Eventually, the human’s waveguide-based , to metallic- performance may change significantly. sounding comb filters. When the agent begins a new gesture, it The Synthesis Stage will consult its APM to select a response Other Approaches Design decisions in the Synthesis stage performance mode with a good score Many of the systems overviewed here tend to target-specific musical styles and and synthesize the new gesture. are relatively vague on specifics of the situations; hence, they are relatively ad Hence, agents can learn workable ma- Synthesis stage. Most involve adaptive hoc compared with the Sensing and terial combinations during performance processing of the input or synthesis of Processing stages. Collins for example and adjust them through an extended response materials. has commented on attention to timbral improvisation. It may seem restrictive to Of the processing-based systems, Be- choice [54], which can be important for set an implicit goal of maintaining stable ginner’s Mind [55] runs the input audio achieving good musical results. combinations of observed and response through a reconfigurable signal-process- modes. An agent can also decide to ing network including ring modulation, The Synthesis Stage change its response mode out of “bore- FFT-based analysis/resynthesis, delay in the ARHS System dom” with an overly extended period of lines and granulators. Van Nort’s system In the ARHS system, the Synthesis stage the same mode combination, or choose [56] uses collections of delay lines and comprises one or more improvising to specifically defeat the goal of main- granulators to process the input audio; agents. Each agent has access to output taining consistent mode combinations; processing parameters are influenced by from an Adaptive Performance Map these are options for future exploration. gestural length and gestural type. (APM), with high-level guidelines on As in my earlier systems, each ARHS Of the synthesis-based systems, NNMu-

Fig. 2. Sensing Stage of ARHS system. In the Sensing stage, the audio input is divided into windows.           

 

  

      

Hsu, Timbre and Interaction in Automatic Improvisation 37

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/LMJ_a_00010 by guest on 29 September 2021 sic [57] generates MIDI data to control improvisation systems overviewed in this fine-tuning of configurations and param- a Disklavier or control data for the play- paper, the sonic outcomes are the results eters, as part of the design process. back and transformation of samples. In of layers of adaptive algorithms and ran- A detailed discussion of the salient is- Yee-King’s system [58], pitch/amplitude dom processes, all in real-time interplay sues is beyond the scope of this paper. sequences generated in the Processing with the human musician. The success Hsu and Sosnick [62] is an attempt to stage control an additive or FM synthe- of such experiments is difficult to assess identify major issues and develop rigor- sizer. Frank [59] controls playback of se- objectively. ous procedures for testing and evalu- quences of MPEG7 frames, which can be Collins, among others, has observed ation from both the performer’s and sourced from previously recorded sound, that the evaluation of interactive music listener’s points of view; the concepts or from live performance. Collins’s free systems (IMSs) “has often been inad- and procedures developed were applied improvisation simulation for guitar and equately covered in existing reports. . . .” to compare the ARHS system and its pre- agents [60] works primarily with onsets, [61] He proposed three suggestions for decessor, the London system. pitch and other note-level events. Agents evaluating IMSs: (1) Technical criteria Ultimately, we are interested in play Karplus-Strong and comb filter syn- related to tracking success or cognitive whether a proposed combination of thesis modules to blend with the human modeling; (2) The reaction of an audi- strategies and implementations leads to musician’s guitar sound. ence; (3) The sense of interaction for the interesting results, in a particular musi- musicians who participate. cal context. Rigorous human computer It is perhaps not so useful to compare interaction–based evaluation meth- Evaluation disparate systems such as Voyager and odologies are tricky to apply to the dy- The ARHS system is an attempt to encode Frank, grounded as they are in specific namic user/audience environment of a and emulate some aspects of the musical musical practices and situations. Rigor- machine improvisation. In lieu of vague knowledge and performance practices of ous evaluation procedures are useful, reflections and generalizations, the my collaborators and myself. It has per- however, for comparing similar systems, reader is invited to sample the extensive formed in concerts with John Butcher to aid in the selection of specific compo- audio documentation at .

Fig. 3. Processing stage of ARHS system. The Adaptive Performance Map is a data structure that maps each set of observed (human) perfor- mance mode statistics to a response (agent) mode.

     

             

 

     

             

     

    

38 Hsu, Timbre and Interaction in Automatic Improvisation

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/LMJ_a_00010 by guest on 29 September 2021 References and Notes Lecture Notes in Computer Science Series (Springer- 45. Hsu [6]. Verlag, 2008). 1. R. Rowe, Interactive Music Systems (Cambridge, MA: 46. Van Nort [14]. MIT Press 1993). 16. D. Casal, “Time after Time: Short-circuiting the Emotional Distance Between Algorithm and Human 47. Yee-King [13]. 2. T. Blackwell, and M. Young, “Self-Organised Improvisation,” Proceedings of the International Com- 48. Casal [16]. Music,” Organised Sound 9(2), 123–136 (2004). puter Music Conference, Belfast, U.K. (2008). 49. Young [15]. 3. G. Lewis, “Too Many Notes: Computers, Complex- 17. M. Casey, “Soundspotting: A New Kind of Pro- ity and Culture in Voyager,” Leonardo Music Journal cess?” in R. Dean, ed., The Oxford Handbook of Com- 50. O. Bown and S. Lexer, “Continuous-Time Recur- 10 (2000). puter Music (New York: Oxford Univ. Press, 2009). rent Neural Networks for Generative and Interac- tive Musical Performance,” EvoMusArt Workshop, 4. T. Jehan and B. Schoner, “An Audio-Driven Per- 18. Rowe [1]. ceptually Meaningful Timbre ,” Proceed- EuroGP, Budapest (2006). ings of the International Computer Music Conference 19. Blackwell and Young [2]. (2001). 51. O. Bown, . 5. M. Malt and E. Jourdan, “Real-Time Uses of Low Level Sound Descriptors as Event Detection Func- 21. Y. Yang, “Music Emotion Classification: A Fuzzy 52. Collins [12]. Approach,” in Proceedings of ACM Multimedia, Santa tions Using the Max/MSP Zsa. Descriptors Library,” 53. N. Collins, “Reinforcement Learning for Live Mu- Barbara, CA (2006) pp. 81–84. in Proceedings of the 12th Brazilian Symposium on Com- sical Agents,” in Proceedings of International Computer puter Music (September 2009). 22. L. Lu et al., “Automatic Mood Detection and Music Conference (2008). 6. W. Hsu, “Managing Gesture and Timbre for Analy- Tracking of Music Audio Signals,” in IEEE Transac- 54. Collins [12] p. 180. sis and Instrument Control in an Interactive Environ- tions on Audio, Speech, and Language Processing 14, No. ment,” Proceedings of the International Conference on New 1 (2006) pp. 5–18. 55. Ciufo [11]. Interfaces for Musical Expression, Paris, France (2006). 23. Lu et al [22]. 56. Van Nort et al. [14]. 7. W. Hsu, “Two Approaches for Interaction Manage- ment in Timbre-Aware Improvisation Systems,” Pro- 24. Vassilakis, P., “Auditory roughness as a Means of 57. Young [15]. ceedings of the International Computer Music Conference, Musical Expression,” in Selected Reports in Ethnomusi- Belfast, U.K. (2008). cology, 12: 119-144. 58. Yee-King [13].

8. P. Clark, “Breaking the Sound Barrier,” The Wire 25. Vassilakis [24]. 59. Casal [16]. (March 2008). See also . 26. Jehan and Schoner [4]. 60. Collins [12]. 9. N. Collins and J. d’Escrivan, eds., The Cambridge 61. Collins [9] p. 183. Companion to Electronic Music (Cambridge, U.K.: Cam- 27. Ciufo [11]. bridge Univ. Press, 2007) p. 182. 28. Van Nort et al. [14]. 62. W. Hsu and M. Sosnick, “Evaluating Interactive Music Systems: An HCI Approach,” Proceedings of the 10. T. Ciufo, “Design Concepts and Control Strate- 29. Yee-King [13]. International Conference on New Interfaces for Musical gies for Interactive Improvisational Music Systems,” Expression, Pittsburgh, U.S.A. (2009). Proceedings of MAXIS International Festival/Symposium 30. Young [15]. of Sound and Experimental Music, Leeds, U.K. (2003). 31. Casal [16]. 11. T. Ciufo, “Beginner’s Mind: An Environment for Sonic Improvisation,” Proceedings of the International 32. Casey [17]. Manuscript Received 1 January 2010. Computer Music Conference (2005). 33. Young [15]. William Hsu works with electronics and real- 12. N. Collins, “Towards Autonomous Agents for Live Computer Music: Realtime Machine Listening and 34. Yee-King [13]. time animation in performance. He is inter- Interactive Music Systems,” Ph.D. Thesis, Univ. of ested in building real-time systems that evoke 35. Van Nort et al. [14]. Cambridge, Cambridge, U.K., 2006. some of the live, tactile qualities of musicians 36. Young [15]. 13. M. Yee-King, “An Automated Music Improviser playing acoustic instruments and the evolv- Using a Genetic Algorithm Driven Synthesis Engine,” 37. Casal [16]. ing, unstable forms and movement observed EvoMusart workshop, evo* 2007, published in Ap- in natural processes. He has performed in the plications of Evolutionary Computing, 4448 of Lecture 38. Yee-King [13]. United States, Asia and Europe; recent collabo- Notes in Computer Science (LNCS 4448) (Springer, April 2007) pp. 567–577. 39. Casal [16]. rators include Peter van Bergen, John Butcher, James Fei, Matt Heckert and Lynn Hershman. 40. Hsu [6]. 14. D. Van Nort et al., “A System for Musical Impro- Since 1992, he has been with the Department visation Combining Sonic Gesture Recognition and 41. Casal [16]. Genetic Algorithms,” Proceedings of Sound and Music of Computer Science at San Francisco State Computing Conference, Porto, Portugal (2009). 42. Young [15]. University, where he teaches and does research in computer music, performance evaluation 15. M. Young, “NN Music: Improvising with a ‘Liv- 43. Ciufo [11]. ing’ Computer,” in R. Kronland-Martinet et al., eds., and high-performance computing. Computer Music Modelling and Retrieval: Sense of Sounds, 44. Ciufo [11].

Hsu, Timbre and Interaction in Automatic Improvisation 39

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/LMJ_a_00010 by guest on 29 September 2021