<<

Basics of Real-Time Interactive Audio ProcessingN

Real-time Interaction with Recorded Sounds

DAAD Edgar Varèse Guest-Professorship 2007 TU Berlin – Audio Kommunikation Norbert Schnell INTRODUCTIOBerlin, April 19 & 20, 2007 intro 1 about myself my generation

Kraftwerk = Oldies

Ligeti in Hamburg

DX-7

Atari ST

Cubase & Notator

Max (ISPW) my central europe

Hamburg

Graz / Wien

Paris intro 2 IRCAM IRCAM Création IRCAM * * 7 studios Espro hall La Saison Le Festival (June) IRCAM On Tour * Pédag Création IRCAM

o * gie Cursus (new from 2007) 1st year: 15 students 2nd year: 7 students ECMCT

MASTER ATIAM 1 year, 25 students June Training Program Software Courses * Pédag Création IRCAM

o * gie 8 teams Instrumental Acoustics Room Acoustics Perception and Sound Design Analysis-synthesis he Representation * Analysis of musical practices Online Services herc Real-Time Musical Interaction Rec ~90 researchers Many projects Création EU Projects FR Projects Industry Contracts

Pédag Création IRCAM

o * gie Le Forum IRCAM Software User Group > 500 members (1/3 + 2/3) * 2 3-day meetings / year he Médiation from packages to sections herc Médiation R&C Rec

Pédag Création IRCAM

o gie

he Médiation herc Médiathèque Rec

Pédag Création IRCAM

o gie

he Médiation herc Médiathèque Rec chapter 0 my IRCAM real-time history platforms IRCAM software friends’ software

1981 1986

4X G. Di Guigno 4X M. Puckette

e 1987 Mac 1988 d ag Max 1990 1988 M. Puckette (Opcode) har 1989 Max NeXT D. Zicarelli ISPW 1991 E. Lindemann et al. Max ISPW M. Puckette platforms IRCAM software friends’ software

1981 1986

4X G. Di Guigno Max 4X M. Puckette

e 1987 Mac 1988 d ag Max 1990 1988 M. Puckette (Opcode) har 1989 Max NeXT D. Zicarelli ISPW 1991 E. Lindemann et al. Max ISPW M. Puckette

1994 es 1993 1997 Max/FTS 1996 J. Francis, P. Foley, M. Puckette SGI MIPS (R10000) F. Dechelle et al. PureData M. Puckette

middle ag 1998 1997 jMax (incl. FTS) Max/MSP 1999 199x F. Dechelle et al. Cycling ‘74 PC (Linux) platforms IRCAM software friends’ software

1981 1986

4X G. Di Guigno Max 4X M. Puckette

e 1987 Mac 1988 d ag Max 1990 1988 M. Puckette (Opcode) har 1989 Max NeXT D. Zicarelli ISPW 1991 E. Lindemann et al. Max ISPW M. Puckette

1994 es 1993 1997 Max/FTS 1996 J. Francis, P. Foley, M. Puckette SGI MIPS (R10000) F. Dechelle et al. PureData M. Puckette

middle ag 1998 1997 jMax (incl. FTS) Max/MSP 1999 199x F. Dechelle et al. Cycling ‘74 (Linux) PC 2003

e Jitter 2004 2001 FTM & Co Mac (OS X) w ag N. Schnell et al. ne platforms IRCAM software friends’ software Repons Boulez 1981 Jupiter 4X G. Di Guigno ManourMax 4Xy 1987M. Puckette e Pluton Mac Manoury 1988 d ag Max ...explosante-fixe... M. PuckBoulezette 1991-1995 (Opcode) har Max NeXT En Echo D. Zicarelli ISPW Manoury 1993 E. Lindemann et al. Max ISPW M. Puckette Anthème II Boulez 1997 es Max/FTS J. Francis, P. Foley, M. Puckette SGI MIPS (R10000) F. Dechelle et al. PureData K... M. Puckette Manoury 2001 middle ag jMax (incl. FTS) Max/MSP F. Dechelle et al. Cycling ‘74 PC (Linux) Lolita Fineberg 2006

e Jitter FTM & Co Mac (OS X) w ag N. Schnell et al. ne FTM — what is?

French Touch Maxtensions • Max/MSP is still absolutely perfect, but ... • modular integration platform for everything we develop analysis/synthesis, recognition, interaction design, ... Funner Than Messages • working with complex data types ➡ powerful operators ➡ structured memory (short and long term)

Faster Than Music (rather than Sound) • forget about latency — organise time • integration of real-time with off-line processing FTM & Co

FTMlib, shared library FTM, basic externals Gabor, analysis/(re-)synthesis MnM, recognition and mapping Suivi, score following FTM releases

FTM freebies and forumers • FTMlib and FTM basic externals ... released under GPL • Gabor and MnM ... free (as in free beer) • Suivi ... classical IRCAM Forum distribution download from http://www.ircam.fr/ftm doc under permanent construction • old wiki: http://freesoftware.ircam.fr/wiki/index.php/FTM • new wiki (coming soon): http://ftm.ircam.fr/ PureData port in 2007!!! Varèse SS 2007

FTM & Co • Gabor processing scheme • OSC extension for massive multi-channel applications Other systems • Faust (Orlarey et al.) • ActionScript 3 + any you prefer (SuperCollider?, PureDate?) chapter 1 my spatial audio spacial audio at IRCAM (J.-M. IRC Spat La Timée AM - Room

J Proceedings of ICAD 04-Tenth Meeting of the International Conference on Auditory Display, ot, Sydney, Australia, July 6-9, 2004 O .

apparent when the visitor turns the head, resulting in a 2.4. Example3: The “Voyage to Tunisia”

W panorama shift that allows a better localisation of the sound

arusf source. The “Voyage to Tunisia” represents a content which is not

Acoustics Second, the amplitude by which the voice is played to the attached to a single work of August Macke, but to a visitor changes with the angle in which the head of the user subdivision of the exhibition space, displaying works

el points. When the user is pointing straight at the painting, related to or produced during August Macke’s famous journey to Tunisia in 1914, in which he was joined by his

et al. the level is highest, decaying when the user transgresses a certain angular tolerance (amplitude mapping to head painter friends Paul Klee and Louis Moillet. The visitor is orientation). This is a feature not inherent to sounds in engulfed by a spatial soundscape composed of on-location

) physical reality, and works very well in terms of signifying recordings from Tunisia (radiophonic artist: Sabine the visitor that the presentation is about to stop when the Breitsameter). This sound-scene coheres with the physical gaze is taken further away from the painting. The same space in orientation, however, the sound-events within it are amplitude shift happens when the visitor increases the not directly attached to physical objects or works of August distance to the painting (amplitude decreasing with Macke. The soundscape is superimposed by two voices, distance). representing August Macke and Paul Klee. While the In this example, all layers work towards the creation of the soundscape coheres with the surrounding space, the two same networking between the audio content and the voices remain positioned helf-left and half-right behind the paiIRCnting. AMAs t h-e Roomattachme Acousticsnt of the audio content to the head of the visitor, creating an intermediate layer between visual object is clarified so well, the voices themselves may the visitor and the images and invisible scenes evoked by WFS the soundscape. The soundscape itself is obviously dig(Oress. m Worearusf freelyel fr oetm dal.es)cribing the painting without loosing perceptual coherence to the picture (reducing dissociated from the exhibition space and its architecture, semantic coherence). but is, nevertheless, part of a semantic network involving the sounds, the voices and the works created from the journey surrounding the visitor. A second level of coherence is created by means of interaction: The soundscape is only heard within a designated zone in the exhibition space, visually marked by the architecture. When the visitor exits this zone, the audio presentation fades out and pauses, continuing as the visitor re-enters. In this way, the “Journey Binaual to Tunisia” employs a combination of semantic- and interaction-based networking to connect audio content and exhibition space.

3. EVALUATING THE VISITS

During the installation of the “Macke Labor”, the tracking and event data generated by the visitors was recorded and stored. For the evaluation of the visits, a tool for sonification and visualisation of the visit’s progress was developed, allowing to recapture each of the ca. 700 visits as time-lapse. The sonification consists of three layers, one layer displaying the motion of the visitor through the zones of the room, a second layer containing a detailed display of the motions of the user in respect to the paintings, and a third layer which contains a cartoonified version of the actual audio content heard by the visitor. The zones are sonified as keynotes, softly toggled by the entrance and IRCAM - Room Acoustics of the user to the respective zone. The user’s motion is (O. Warusfel et al.) Figure 2: A visitor to the "Macke Labor" wearing the audible via discrete frequency spectra, one per painting, Listen – EurLISTopeanEN headp hISTone sprysteojectm which are controlled in amplitude by the distance between (G. Eckel et al) the visitor and the respective painting. For each of the paintings, the degree by which the visitor faces them is 2.3. Example2: An exchange of letters between Elisabeth illustrated by a lowpass filter whose cut-off frequency is Macke and German museum staff in the Nazi aera – modulated by the angle between the visitor’s nose and the Applying amplitude-based interaction only. respective direction of the painting. The relative motion of the visitor is also audible in panoramic shifts of the In another module, letters from the time of third Reich spectrum attached to a painting in accordance with the are quoted, illustrating Elisabeth Macke’s suggested motion of the painting in relation to the visitor. engagement in protecting the works of her deceased husband The audio content of the environment was cartoonified from the hands of the curators of the exhibition “Entartete using samples of the respective voice actor and granular Kunst”. An introduction by the art-historian is followed by synthesis. the recitation of three letters, each of which composed into a The evaluation of the tracking paths show that the grouping specific spatial atmosphere ‘surrounding’ the user: The of the content to the picture with the amplitude-based sound-sources, apart from the art-historian, are not placed at interaction alone is enough to group the visual perception the painting, some are even located behind the visitor. The of the image to semantic content that is rather associative in grouping of the content to the painting is created by the nature. Concerning the ‘unlocalised’ presentation attached interaction with the amplitude only, as described above. to the zones, the tracking data suggests that many visitors S0UNDB1TS

s0undb1ts (Robin Minard & Norbert Schnell) Varèse SS 2007

interaction with massive multi-channel audio • applications for TU WFS system • applications for 16:9 • S0UNDB1TS ? chapter 2.1 my analysis/re-synthesis family

kind of sampling always “works”

well documented (Truax, Roads, et al.) no model arbitrary additive synthesis

model based on sinusoidals • no noise (→ residual) • analysis/re-synthesis (sampling the temporal development) • vast set of low level parameters (?) morphing of harmonic sounds see also • resonators • modal synthesis • PSOLA • phase extensions PSOLA

model based on elementary waveforms • no noise (→ granular synthesis) • analysis/re-synthesis (sampling the temporal development) • no parameters but pitch and waveforms independent control of time, pitch and timbre speech friendly see also • source/filter models • FOF and PAF • granular synthesis • phase vocoder extensions extended phase vocoder

model based on discrete Fourier transform • phase vocoder standards (Dolson, Laroche) • + transient detection and preservation • + shape invariant processing (waveform preservation) • + spectral envelope preservation insideor transformation • + stereo preservation • + separation of sinusoidals/noise/transients high qualityel sound Röbel transformations • time-strAxetching and pitch transposition • cross-synthesis • remixing of sinusoidals, noise, and transients physical modelling

analysis/synthesis ??? • modal analysis • finite element systems etc. • and... inversion of physical models • physical model: physical parameters → sound • inversion: sound → physical parameters • gestures ↔ physical parameters ?? come back in 3 – 5 years • or start your thesis now... Varèse SS 2007

real-time signal processing basics based on Gabor tutorials relationships between different models experimentation and experience chapter 2.2 real-time interactive content based audio processing real-time interactive content the concept based audio processing

synthesis based on analysis • segmentation • audio descriptors • audio representations content based interaction • (real-time) rendered representations • representation = interface • “touch the sound” integration of off-line and real-time processing

→ interaction with recorded sounds X-Micks Interactive Content Based Real-Time Audio Processing

Norbert Schnell Diemo Schwarz Remy Müller [email protected] [email protected] [email protected]

Real Time Applications, IRCAM , Paris, France

“Interactive Content Based Real-Time Audio Processing”, an emerging paradigm: • Real-time rendering of the interaction interface according to the audio content X-Mic • Robustness andks intuitiveness of the representation in terms of interaction with + Integration of off-line analysis with real-time analysis and re-synthesis 3 classics in a matrix • spectrogram • step sequencer The X-Mic• filter ksbank example, re-mixing two beat synchronized songs on the fly

B Deck A Deck B 16 beats A A2 - Polyester.wav A Madonna - Die.wav toggle color toggle color B Madonna - Die.wav B Madonna - Die.wav row: cmd row: cmd choose beats choose beats col: cmd-shift col: cmd-shift Representation = Interface on/off on/off X level level MICKS • Spectrogram 12 bands beat-sync •b Stepeat -sequencersync fake-player •f Filterake- pbanklayer

peak peak auto auto contrast selection contrast selection peak peak color color (cmd-slider for continuous peak selection) (cmd-slider for continuous peak selection)

A x-fade X-Micks 3B graphical userA xinterf-fade ace B

A >< A B B A >< A B B 1 1

X-Micks implementationchoose choose analysis analysis A B source A B source

A- beat signal audio BstreamA B A- B A B beat signal audio stream Gabor/FTM SFFT SFFT http://www.ircam.fr/ftm FFT points / analysis frame • optimized data for Max/MSP reduction bark scale 12 coefficients / beat 12 coefficients / analysis frame • overlap add audio processing operators log onset energy user • audio extractors and SDIF support matrix representation interaction 12 coefficients / analysis frame

beat integration

12 coefficients / beat 12 coefficients / beat filter construction & convolution A collaboration with Native Instruments

matrix representation IFFT & OLA using Max/MSP Pluggo

audio stream

X-Micks functional overview X-Micks analysis stage Developed in the framework of the European project SemnanticHIFI. X-Micks video (Schnell et al.) Corpus based synthesis

based on audio indexing “inversion” of analysis by fast search current speech synthesis technique • concatenative synthesis (diphones) • typically large data bases for each speaker • driven by phonetic and grammar cataRT application (Schwarz et al.) • based on FTM & Co • corpus based granular synthesis • currently ~12 descriptors • 2D navigation interface chapter 3 real-time audio analysis, descriptors and extractors feature extraction

classical ingredient of real-time following vs. estimation music information retrieval (MIR) descriptors and extractors which features – where and why? perception modelling

→ statistical modelling and recognition → observation (vs. estimation/extraction) Varèse SS 2007

real-time signal processing basics based on Gabor FEAPI? experimentation with different • descriptors • sources • paradigms chapter 4.1 recognition based interaction (audio) between signals and symbols

the good old sound / symbolic separation • audio vs. MIDI • live sampling and sound FX vs. generative processes • a historical issue? • still present in Max/MSP and many other systems (not in FTM/Gabor!) walking on the borderlinec (since a long long time) • CAC and synthesis controli • “high-level audio descriptors s” • performing synthesis by gestures or voice • performing and notationu • the vocabulary of contemporary music m ... & dance?! latency vs. synchron(icit)y

hard latency • ISPW < 20 msec (~ 40 msec) • Mac OS X & recent Linux kernels/tweaks < 5 msec • Network Europe – Amerika > 100 msec soft latency • YIN pitch estimation ~ 2 waveforms = 40 msec (50Hz) • SuperVP high quality processing ~ 100 msec (4K points FFT) • recognition of a notei c with glissando and vibrato ? ➡ modeling of stemporal processes ➡ anticipationu m Score following Score following in the 20th century

Dannenberg 1984 “Computer accompaniment” Vercoe 1984 “Synthetic performer”, “Listen, perform, learn” Puckette, Manoury 1987 “Score following”, “La Partition Virtuelle” Braid, Blevins, Zahler 1990 “Artificially Intelligent Computer Performer” Score following in the 20th century

Dannenberg 1984 “Computer accompaniment” Vercoe 1984 “Synthetic performer”, “Listen, perfJupiterorm, learn” Manoury 1987 Pluton Puckette, Manoury 1987 Manoury 1988...explosante-fixe... Boulez 1991-1995 “Score following”, “La Partition Virtuelle” En Echo Manoury 1993Anthème II Braid, Blevins, Zahler 1990 Boulez 1997 “Artificially Intelligent Computer Performer” How did this work?

Jupiter () 20th century follower

MIDI symbolic instrument real-time score notes matching acoustic signal to MIDI score (instrument conversion)

score position tempo 21st century follower

score probabilistic HMMmodel acoustic training instrument observation & decoding

score position tempo Raphael 1999 Automatic segmentation and accompaniment Loscos, Cano, Bonada 2000 Following of singing voice for Ja...explosante-fixpanese Karaoké e... Orio @ IRCAM 2001 2005 2nd generation IRCAM score-follower Just the other day... chapter 4.2 recognition based interaction (gesture) gesture following/recognition F. Bevilacqua et al. originally developed for dance capture recognition of gestures following of a recognised gesture any kind of input video motion capture, sensors, audio extractors gesture follower vs. score follower • same HMM technique • no symbolic representation • statistical model created from examples (recordings) The observation probability function for each state is considered these aspects. For some mid-level students, this may lead to rigid to be a multidimensional Gaussian model with a mean !i and a posture and stiff gesture in their instrument practice. 2 variance and ! i, where i is the state number. The mean !i is set to We made two experiments with the students of a music class (music school “Atelier des Feuillantines” in Paris). Precisely we used our prototype during a regular music theory lesson. sensor probability density function Minimum perturbation were required: the lesson was conducted value by the usual teacher and following the usual lesson structure

(Figure 5). The pedagogical aim of the exercise was to experience

and practice “smoothness” and “fluidity” of musical gestures. The

time

Figure 4 Learning procedure: a left-to-right HMM is used to model the example, downsampled by a factor 2. the values of the recorded gesture. The variance value is a factor adjusted by the user, which must match approximately expected variations between the performed and recorded gestures. In most of our experiments we found that the variance value is not critical since the recognition is based on a comparison process. 3.2.2 Decoding Consider the performed gesture as a partial observation sequence O1..OT, corresponding to the performed gesture values from time 1 to T (sample index). The probability !t(i)! of this partial Figure 5. Teacher and student using the system during a sequence and state i is computed from the standard forward music class. The teacher holds the wireless sensor module procedure in HMM [19]. during the learning phase. The following procedure corresponds to determine the most likely prototype was used to continuously synchronise a chosen state i, denoted j(t), for all time 1...t: soundfile to a conducting gesture performed with the wireless j(t) = argmax["t (i)] 1 < t < T Eq. 1 module. The teacher starts the exercise with the recording of the i reference gesture by conducting while listening to the original Since the Markov chain has a simple left-to-right structure, the soundfile. In the following the students perform the conducting computed sequence of j(t) reports time indexes of the time-warped gesture themselves. sequen!ce to the learned example (as shown in Figure 2). The comparison and recognition procedure corresponds to 4.1 Interaction paradigm compute the likelihood of the observation sequence for a given The gesture-follower was used to control the playback of example (i.e. Markov model) soundfiles. The time index output by the gesture-follower directly determines the position in the soundfile. Two types of time- likelihood(example) = #"t (i) Eq. 2. stretching were used: granular synthesis or phase vocoder i implemented with the Gabor library of FTM [17][18]. 3.2.3 Implementation On a practical level, the procedure is as follows: The gesture-follower is implemented as a set of Max/MSP module!s integrated in the toolbox MnM [2] of the library 1. Record mode: Record the gesture example while listening the FTM[18] (LGPL licence). It takes advantages of the data structure sound file. This step provides thus a gesture example that is of FTM for Max/MSP such as matrices and dictionaries. An synchronized with the soundfile. example is freely available in the FTM package, under 2. Play mode: The soundfile playback speed varies according to MnM/example. the gesture-follower output, depending of the temporal variation in gesture performance. Separate soundfiles can be associated to different recorded 4. PEDAGOGICAL EXPERIMENTS examples. Different playback schemes are possible. First, the Conducting is an important part of musical education for all recognition feature can be used for the selection of the soundfile instrument players. It is an essential part of practice training, that corresponding to the most likely gesture. Second, the different is closely related to music theory. While teaching methods for soundfiles can be played back simultaneously, and mixed small children or beginners are often based on playful approaches according to the likelihood values. and exercises focusing on body movements (e.g. Dalcroze, Menuhin), music education at higher levels tends to underestimate enough to the emitter. If needed, larger range can be achieved by using the PRO version of the XBee modules, although autonomy Performed gesture might be reduced in such a case. The wireless sensor interface was fully tested and used in two sensor applications: handheld devices for free gesture interaction and augmented string instruments. This article concerns the first type value application.

3. REALTIME GESTURE ANALYSIS time 3.1 Gesture following/recognition The development of the gesture-follower is pursued with the gesture 2 general goal to compare in real-time a gesture with a set of likelihoods prerecorded examples, using machine learning techniques. Similar approaches have been reported [6][12][13][14][15][16] and are value gesture 1 often used in implicit mapping strategies. In our context, a “gesture” is defined by its numerical time representation produced by the capture device. Technically, this References = recorded examples corresponds to a multidimensional data stream, which can be

stored in a matrix (e.g. row corresponding to time index, and gesture 1 column to sensor parameters). A multimodal ensemble of sensor temporal curves can be directly accommodated within this framework, as long as all curves have identical sampling rate. value 3.1.1 Following The gesture-follower indicates, during the performance, the time location (or index) in relation to the recorded references. In other time words, the gesture-follower allows for the real-time alignment of a gesture 2 performed gesture with a prerecorded gesture. Figure 2 illustrates the computation, continuously performed each sensor time a new data is received, of the corresponding the time index value of the reference. This operation can be seen as a real-time time warping of the performed gesture to the recorded reference. time

performed gesture Figure 3. Comparison and recognition paradigm. sensor value 3.2 Algorithm The two paradigms we described above, following and reference gesture recognition can be directly implemented by using Hidden Markov Models (HMM) [19]. Generally, the parameters of Markov (recorded example) models are estimated using the Baum-Welch algorithm using a large set of examples. In our case, we choose a simplified learning time method enabling the use of a single example to set the model Figure 2. The following paradigm: the performed gesture is parameter. To achieve this, assumptions are made on the expected time warped to a given reference variations within a class of gesture. This procedure can lead to a suboptimal determination of Markov model parameters. However, 3.1.2 Comparing and recognizing the possibility of using only a single example represents a The process explained in the pr evious section can be performed significant advantage in term of usage. with several references simultaneously. In this case, the system compute also the likelihood values for each reference to match the 3.2.1 Learning performed gesture. An example of this process is illustrated in The learning process is illustrated in Figure 4 where the temporal Figure 3 where the performed gesture is compared to two other curve is modeled as left-to-right Markov chain. The learning examples. example is first downsampled, typically by a factor 2, and each sample value is associated to a state of the Markov chain. As shown in Figure 3 the likelihood values are updated Assuming a constant sampling rate, the left-to-right transition continuously while the performed gesture is unfolding. The result probabilities are constant and directly related to the downsampling of the recognition can therefore vary from the beginning, middle factor. For example, in the case of downsampling of factor n, the or the end of the performed gesture. Gesture recognition can be transition probabilities are equal to 1/n, ensuring the Markov achieved by simply selecting the highest likelihood, at a chosen chain to model adequately the temporal behavior of the learning time. example. This interaction paradigm can be used to simulate orchestral Since the system does track the entire gesture and not only the conducting. Similar applications have been proposed and beats, the gesture between beats is important and affect directly implemented by several groups[13][14][5]. However, our the conducting procedure. Therefore, the audio playback speed approach is distinct on various points. depends directly on the overall movement quality. For example, if First, the gesture is considered here as a continuous process. In the student gesture does not match the smoothness and fluidity of particular, no beat detection is used. This point has important the teacher gesture, a striking modification of the rhythmic pattern consequences discussed in the next section. of the conducted sound appears (Figure 7). This effect provides a direct sonic feedback to the students of its overall gesture quality, Second, the choice of the gesture is totally open and can be who can then progressively learn, “by ear”, how to perform a chosen with a very simple and effective procedure. As mentioned smooth and fluid gesture. earlier, a single recording of the gesture is sufficient to operate the system. This flexibility allows us to elaborate pedagogical 4.3 Experiment 2: Free gesture exploration scenarios where the conducting pattern can be freely In this experiment, the students were asked to find a free gesture experimented and adjusted by the user. This point is further they felt appropriate to various soundfiles. Various gestures were developed in section 4.3. experimented by the students to control the temporal flow of music/sound. 4.2 Experiment 1: Conducting After starting the software in “record mode”, the teacher records a Different cases were tested, including the excerpts used for usual beat pattern gesture while listening to an excerpt of the experiment 1. After experiencing traditional beating patterns, the soundfile. For example an excerpt of the Rite of Spring was students were able in this case to try other types of gesture than chosen for its changes of metric. usual conducting gestures. Voice recordings of the students were also used. Associating a free gesture to a voice recording allowed The software is then switched in “follow” mode and the students them for instance to alter the rhythm/prosody. are asked to use the system to “conduct” the soundfile. An excerpt of a recorded beat pattern and the time-warped performed gesture 4.4 Discussion and further work is shown in Figure 6. Even though the system was first tested during regular music theory lessons, the students immediately pointed out its creative potential. Student were highly motivated by the experiments, and time warped the teacher felt significant improvements of student awareness to references performed gesture key aspects of performance practice, and in particular musical phrasing. The following points summarizes important point that acc. x the experiments brought out

4.4.1 Smoothness and fluidity acc. y Smoothness and fluidity was one of the important aspects of experiment 1. The control of these “gesture qualities” is crucial in acc. z music performance, and is directly related to breathing, and more generally on interpretation. A common problem encountered in

time young students (typically older than 10 years) who start to learn an instrument is the overall body rigidity. They tend to activate Figure 6. 4-beat gesture as recorded by the 3D only the body parts touching the instrument. Our prototype could accelerometer, stimulate adequate motion in this case by attaching sensors to different body parts.

written 4.4.2 Breathing rhythm Breathing is a well-known issue in music practice. For example, students practicing “mechanically” in a stiff position tend to play often in apnoea, blocking their breathing. These moments of smooth, fluid stiff apnea show a broken link between breathing and playing. The two

experiments suggest that a sonification tools highlighting this

output broken link may help and enhance teaching of musical phrasing/breathing. time index 4.4.3 Link between intention and gesture Teachers often complain about the lack of theory knowledge of

speed their students, preventing them to elaborate consistent music interpretation. For example, the understanding of the musical time time structure and other compositional aspects of a musical piece (cadences for instance) usually help music interpretation and expression. Our experiment gave the opportunity to students to perceived embody aspects of music theory in particular cases. rhythm

Figure 7. Effect of smoothness and fluidity in the performance of the 4-beat conducting pattern. enough to the emitter. If needed, larger range can be achieved by using the PRO version of the XBee modules, although autonomy Performed gesture might be reduced in such a case. The wireless sensor interface was fully tested and used in two sensor applications: handheld devices for free gesture interaction and augmented string instruments. This article concerns the first type value application.

3. REALTIME GESTURE ANALYSIS time 3.1 Gesture following/recognition The development of the gesture-follower is pursued with the gesture 2 general goal to compare in real-time a gesture with a set of likelihoods prerecorded examples, using machine learning techniques. Similar approaches have been reported [6][12][13][14][15][16] and are value gesture 1 often used in implicit mapping strategies. In our context, a “gesture” is defined by its numerical time representation produced by the capture device. Technically, this References = recorded examples corresponds to a multidimensional data stream, which can be stored in a matrix (e.g. row corresponding to time index, and gesture 1 column to sensor parameters). A multimodal ensemble of sensor temporal curves can be directly accommodated within this framework, as long as all curves have identical sampling rate. value 3.1.1 Following The gesture-follower indicates, during the performance,r ecorthe timdede location (or index) in relation to the recorded referencesr.ef Iner oencether time gesture 2 words, the gesture-follower allows for the real-time alignmgesturent of aes performed gesture with a prerecorded gesture. Figure 2 illustrates the computation, continuously performed each sensor time a new data is received, of the corresponding the time index value of the reference. This operation can be seen as a real-time time warping of the performed gesture to the recorded reference. time eennoouugghh ttoo tthhee eemmitter. If needed, larger range can be achieved by Performed gestu re uussiinngg tthhee PPRROO vversion of thep eXrBfoerem medo gdeusletusr, ea lthough autonomy mmiigghhtt bbee rreedduucceedd in such a case. Figure 3. Comparison and recognition paradigm. senTTshhoeer wwiirreelleessss sseennsor interface was fully tested and used in two sensor aapppplliiccaattiioonnss:: hhaanndheld devices for free gesture interaction and value 3.2 vAalulegor ithm aauuggmmeenntteedd ssttrriinngg instruments. This article concerns the first type The two paradigms we described above, following and aapppplliiccaattiioonn.. reference g esture recognition can be directly implemented by using Hidden Markov recognition of Models (HMM) [19]. Generally, the parameters of Markov 3.3. RREEAALLTTIIME GESTURE A(reNcoArdLedY exSaImSp le) time a performed models are estimated using the Baum-Welch algorithm using a 3.3.11 GGeessttuurree following/recognition gesture large set of examples. In our case, we choose a simplified learning TThhee ddeevveellooppmmeennt of the getsitmuree -follower is pursued with the method enabling the use of a single example to sgeets ttuher em 2 odel ggFeeinngeeurrraaell 2gg.o oTaahll ett oofo lcloomwipnagr ep ainra dreigalm-ti:m the e ap egrefostrumree dw gitehs tua res eits of parlaikmeelitheor.o Tdos achieve this, assumptions are made on the expected pprreerreeccoorrddeedd eexxtaaimmpel wesa, rupsiendg t mo aac ghiivneen l eraerfneirnegn cteec hniques. Similar variations within a class of gesture. This procedure can lead to a aapppprrooaacchheess hhaavvee been reported [6][12][13][14][15][16] and are subvoaplutiem al determination of Markov model parameters. However, gesture 1 3oo.ff1ttee.n2n uussCoeedd iimnn piimmapprliincigt manapdp irnecog stgranteigziensg. the possibility of using only a single example represents a The process explained in the pr evious section can be performed IInn oouurr ccoonntteexxtt,, a “gesture” is defined by its numerical significant advantage in term of usage. time with several references simultaneously. In this case, the system rreepprreesseennttaattiioonn pproduced by the capture device. Technically, this compute also the likelihood values for each reference to match the 3.2.1 LearningR eferences = recorded examples ccoorrrreessppoonnddss ttoo a multidimensional data stream, which can be performed gesture. An example of this process is illustrated in The learning process is illustrated in Figure 4 where the temporal ssttoorreedd iinn aa mmaattrix (e.g. row corresponding to time index, and Figure 3 where the performed gesture is compared to two other curve is modeledg esatsu rle f1t- to-right Markov chain. The learning ccoolluummnn ttoo sseennsor parameters). A multimodal ensemble of examples. exampsleen siso rf i rst downsampled, typically by a factor 2, and each tteemmppoorraall ccuurrvveess can be directly accommodated within this sample value is associated to a state of the Markov chain. value Affrrsaa mmseehwwowoorrnkk ,, aainss llooFnigg uarse a ll3 c utrhvee s lhiakveeli hidoeondt icvaal lsuaems plainreg ruatped. a ted Assuming a constant sampling rate, the left-to-right transition continuously while the performed gesture is unfolding. The result probabilities are constant and directly related to the downsampling o33f ..1t1h..e11 r eFcFooogllnlliootiwiwion ncgan therefore vary from the beginning, middle TThhee ggeessttuurree--ffoolllloower indicates, during the performance, the time factor. For example, in the case of downsampling of factor n, the or the end of the performed gesture. Gesture recognition can be transition probabilities are equal to 1/n, ensuring the Markov allcoohcciaaettviiooennd ((booyrr iisnnimddeepxl)y inse rleeclatitniogn tthoe theig rheecsot rldikede lirheofeorde,n caet sa. Icnh oosthener time chain to model adgeqstuuarte l2y the temporal behavior of the learning twiwmooerr.dd ss,, tthhee ggeessttuurre-follower allows for the real-time alignment of a ppeerrffoorrmmeedd ggeessttuurre with a prerecorded gesture. example. FFiigguurree 22 iilllluussttrraattes the computation, continuously performed each sensor ttiimmee aa nneeww ddaattaa is received, of the corresponding the time index value ooff tthhee rreeffeerreennccee.. This operation can be seen as a real-time time wwaarrppiinngg ooff tthhee pperformed gesture to the recorded reference. time performed gesture performed gesture Figure 3. Comparison and recognition paradigm. sseennssoorr vvaalluuee 3.2 Algorithm The two paradigms we described above, following and reference g esture recognition can be directly implemented by using Hidden Markov Models (HMM) [19]. Generally, the parameters of Markov (recorded example) models are estimated using the Baum-Welch algorithm using a large set of examples. In our case, we choose a simplified learning time method enabling the use of a single example to set the model FFiigguurree 22.. TThhee ffollowing paradigm: the performed gesture is parameter. To achieve this, assumptions are made on the expected ttiime warped to a given reference variations within a class of gesture. This procedure can lead to a suboptimal determination of Markov model parameters. However, 33..11..22 CoCommppaaring and recognizing the possibility of using only a single example represents a TThhee pprroocceessss eexxpplained in the pr evious section can be performed significant advantage in term of usage. wwiitthh sseevveerraall rreefferences simultaneously. In this case, the system ccoommppuuttee aallssoo tthhee likelihood values for each reference to match the 3.2.1 Learning ppeerrffoorrmmeedd ggeessttuure. An example of this process is illustrated in The learning process is illustrated in Figure 4 where the temporal FFiigguurree 33 wwhheerree the performed gesture is compared to two other curve is modeled as left-to-right Markov chain. The learning eexxaammpplleess.. example is first downsampled, typically by a factor 2, and each sample value is associated to a state of the Markov chain. AAss sshhoowwnn iinn Figure 3 the likelihood values are updated Assuming a constant sampling rate, the left-to-right transition ccoonnttiinnuuoouussllyy wwhhile the performed gesture is unfolding. The result probabilities are constant and directly related to the downsampling ooff tthhee rreeccooggnniittiioon can therefore vary from the beginning, middle factor. For example, in the case of downsampling of factor n, the oorr tthhee eenndd ooff tthhe performed gesture. Gesture recognition can be transition probabilities are equal to 1/n, ensuring the Markov aacchhiieevveedd bbyy ssiimmply selecting the highest likelihood, at a chosen chain to model adequately the temporal behavior of the learning ttiimmee.. example. This is my house (2005/2006) Myriam Gourfink the augmented violin

violin + sensors • bow acceleration (3 axes) • bow pressure • bow position • optical pick-up

pieces by Florence Baschet Franck Bedrossian Philippe Manoury ... and others Nicolas Rasamimanana chapter 4.3 recognition based interaction (next) multi-modal modelling/recognition ... of musical gestalts

performance acquisition audio, sensors, video observation modelling instrument/interface dependent ry temporal modelling hierarchical temporal structurbulaes oca ➡ toolbovx Max + FTM & Co (MnM, Gabor, ...) Varèse SS 2007 praktisch

Donnerstag 14 - 17 Uhr c.t. Freitag 14 - 17 Uhr c.t. Die nächsten Termine (Grundlagen) 10./11. Mai 24./25. Mai 14./15. Juni

Arbeiten und Projekte • rein-schriftlich wissenschaftlich • -funktionell dokumentiert (Max/MSP, Faust, SC, MatLAB, ...) →Vorschläge folgen!