<<

Bowing the

A case study for auditory-motor pattern modelling in the context of music performance

Quim Llimona Torras

Sonologia

Enric Guaus Térmens

2013/2014

ABSTRACT

This project addresses methodological and technological challenges in the develop- ment of multi-modal data acquisition and analysis methods for the representation of instrumental playing technique in music performance through auditory-motor patterning models. The case study is violin playing: a multi-modal database of violin performances has been constructed by recording different musicians while playing short exercises on different . The exercise set and recording protocol have been designed to sample the space defined by dynamics (from to forte) and tone (from sul tasto to sul ponticello), for each type being played on each of the four strings (three different pitches per string) at two different tempi. The data, containing audio, video, and motion capture streams, has been processed and segmented to facilitate upcoming analyses. From the acquired motion data, the positions of the instrument string ends and the bow hair ribbon ends are tracked and processed to obtain a number of bowing descriptors suited for a detailed de- scription and analysis of the bow motion patterns taking place during performance. Likewise, a number of sound perceptual attributes are computed from the audio streams. Besides the methodology and the implementation of a number of data acquisition tools, this project introduces preliminary results from analyzing bowing technique on a multi-modal violin performance database that is unique in its class. A further contribution of this project is the data itself, which will be made available to the scientific community through the repovizz platform.

iii

ABSTRACT (CATALA)`

Aquest projecte adrec¸ a reptes metodologics` i tecnics` en el desenvolupamnt de metodes` d’adquisio´ i analisi` de dades multi-models per la representacio´ de tecniques` instrumentals en la interpretacio´ musical mitjanc¸ ant models de patrons auditori- motors. El cas d’estudi es´ la interpretacio´ del ´ı: s’ha constru¨ıt una base de dades multi-modal amb musics´ tocant diferents exercicis al viol´ı enregistrant-los amb diferents instruments. El conjunt d’exercicis i el protocol d’enregistrament s’han dissenyat per mostrejar l’espai definit per la dinamica` (de piano a forte) i to (de sul tasto a sul ponticello), per cada tipus d’arcada tocada en cadascuna de les quatre corda (i tres notes diferents per corda) a dos tempi diferents. Les dades, que contenen fonts d’audio,` v´ıdeo i captura de moviment, s’han processat i segmentat per facilitar analisis` posteriors. A partir de les dades de moviment adquirides, la posicio´ dels l´ımits de les cordes de l’instrument i de les cerdes de l’arc es segueixen i processen per tal d’obtenir un nombre de descriptors de l’arc, que permeten fer una descripcio´ i analisis` detallada dels patrons de moviment de l’arc durant l’acte interpretatiu. De manera similar, es calculen uns quants atributs perceptuals a partir de les fonts d’audio.` A part de la metodologia i implementacio´ d’un conjunt d’eines per a la adquisicio´ de dades, aquest projecte introdueix resultats preliminars fruit de l’analisi` de la tecnica` de l’arc en una base de dades multi-modal amb interpretacions de viol´ı unica´ en la seva especie.` Una altra contribucio´ del projecte son´ les dades en si mateixes, que es posaran a la disponibilitat de la comunitat cient´ıfica mitjanc¸ant la plataforma repovizz.

v

ABSTRACT (ESPANOL)˜

En este proyecto se tratan retos metodologicos´ y tecnicos´ en el desarrollo de metodos´ de adquisicion´ y analisis´ de datos multi-modales para la representacion´ de tecnicas´ instrumentales en la interpretacion´ musical mediante modelos de patrones auditorio-motoros. El caso de estudio es la interpretacion´ del viol´ın: se ha constru- ido una base de datos multi-modal con musicos´ tocando diferentes ejercicios al viol´ın grabados con distintos instrumentos. El conjunto de ejercicios y el protocolo de grabacion´ han sido disenados˜ para muestrear el espacio definido por la dinamica´ (de piano a forte) y tono (de sul tasto a sul pointicello), por cada tipo de arcada tocada en cada una de las cuatro cuerdas (y con tres notas distintas por cuerda) a dos tempi distintos. Los datos, que contienen fuentes de audio, v´ıdeo y captura de movimiento, se han procesado y segmentado para facilitar posteriores analisis.´ A partir de los datos de movimiento adquiridos, se siguen la posicion´ de los l´ımites de las cuerdas del instrumento y de las cerdas del arco para procesarlas y obtener unos descriptores del arco, que permiten hacer una descripcion´ y analisis´ detallado de los patrones de movimiento del arco durante la interpretacion.´ De forma similar, se calculan atributos perceptuales a partir de las fuentes de audio. A parte de la metodolog´ıa e implementacion´ de un conjunto de herramientas para la acquisicion´ de datos, este proyecto introduce resutlados preliminares fruto del analisis´ de la tecnica´ de arco en una base de datos multi-modal sobre la interpretacion´ del viol´ın unica´ en su especie. Otra contribucion´ del proyecto son los datos en si mismos, que se publicaran para su uso en la comunidad cient´ıfica mediante la plataforma repovizz.

vii

Contents

ABSTRACT iii

ABSTRACT (CATALA)` v

ABSTRACT (ESPANOL)˜ vii

Index of figures xi

Index of tables xiii

1 INTRODUCTION 1 1.1 Context ...... 1 1.1.1 The author ...... 1 1.1.2 The MUSMAP project ...... 2 1.2 Motivation ...... 3 1.3 Background: the violin ...... 6 1.3.1 Basics of the the bowed string motion ...... 8 1.3.2 Bowing control parameters ...... 9 1.3.3 Early studies on bowing control in performance ...... 11 1.3.4 Acquisition of bowing parameters in violin performance . 11 1.4 Objectives ...... 13 1.5 Structure of the thesis ...... 14

2 EXPERIMENTAL DESIGN 17 2.1 Sampling space ...... 18 2.1.1 Performer ...... 19 2.1.2 Instrument ...... 19

ix 2.1.3 Articulation ...... 22 2.1.4 Sounding point ...... 22 2.1.5 Dynamics ...... 23 2.1.6 Pitch (string and string length) ...... 23 2.1.7 Bow direction ...... 23 2.2 Sample permutations (score) ...... 24

3 DATA ACQUISITION 27 3.1 Overview ...... 27 3.2 Motion Capture ...... 27 3.2.1 Body markers ...... 28 3.2.2 Violin markers ...... 29 3.2.3 Bow markers ...... 31 3.2.4 Load cell markers ...... 33 3.3 Audio ...... 34 3.4 Video ...... 35 3.5 Load cell ...... 36 3.6 Synchronization ...... 37 3.7 Recording protocol ...... 38

4 FEATURE EXTRACTION 39 4.1 Overview ...... 40 4.1.1 High-level features ...... 40 4.1.2 Low-level features ...... 41 4.2 Computation of low-level descriptors ...... 41 4.2.1 Vector basis ...... 41 4.2.2 Bow to string distances ...... 44 4.2.3 Bow deformation ...... 45 4.3 Computation of high-level features ...... 45 4.3.1 Noise reduction ...... 47 4.4 Force estimation ...... 47 4.4.1 Load cell calibration ...... 48 4.4.2 Regression model ...... 49 4.4.3 Evaluation of the force estimation process ...... 53 4.5 Audio features ...... 54

x 4.5.1 Pitch ...... 55 4.5.2 Aperiodicity ...... 55 4.5.3 Energy ...... 55

5 DATABASE 57 5.1 Score-performance alignment ...... 57 5.1.1 Zero-crossing finder ...... 58 5.1.2 Graphical User Interface and program flow ...... 58 5.2 Annotations ...... 60 5.2.1 Look-up tables ...... 61 5.3 The repovizz platform ...... 62

6 PRELIMINARY DATA ANALYSIS 65 6.1 Introduction ...... 65 6.1.1 Player selection ...... 67 6.2 Bowing technique ...... 67 6.3 Dynamics ...... 70 6.4 Tone ...... 73 6.5 Duration ...... 76 6.6 Player ...... 78

7 CONCLUSION 81 7.1 Achievements ...... 81 7.2 Future work ...... 82 7.3 Acknowledgments ...... 82

Bibliography 85

Appendices 87

A MUSICAL SCORES 91

B QUESTIONNAIRE 109

xi

List of Figures

1.1 From an instrumental gesture perspective, musical score, instru- mental gestures, and produced sound represent the three most accessible entities for providing valuable information on the music performance process...... 5 1.2 Music versus dance notation ...... 6 1.3 Parts of the violin ...... 6

2.1 Admittance and radiation plots ...... 20 2.2 Excerpt of the score that was given to the musicians ...... 25

3.1 Violins used during the recordings, ordered from left to right. . . . 30 3.2 Bow that was used for the recordings...... 32 3.3 Pickup and close-up microphone mounted on the violin...... 35 3.4 Load cell mounted on a support with motion tracking markers. . . 36 3.5 Calibrated weights that were used for characterizing the load cell. 37

4.1 Block diagram of the feature extraction process ...... 39 4.2 Illustration of the violin and bow planes, the sounding points and some of the basis vectors ...... 42 4.3 Illustration of the hair ribbon and string deflection, together with the shortest segment joining the sounding points ...... 42 4.4 Plot of the force measured during the load cell characterization. . . 48 4.5 Fitted loadcell polynomial together with the measured samples . . 49 4.6 Bow force regression example ...... 50 4.7 Histograms of the loadcell recordings ...... 52

5.1 Output of the automatic segmentation software...... 59 5.2 Screenshot of the repovizz visualizer displaying a datapack with one of the takes from MUSMAP I ...... 63

xiii 6.1 Bow- distance vs bow force for Players 1, 2 and 3 . . . . . 67 6.2 Scatter plots for and martele bow strokes ...... 68 6.3 Bow velocity and force temporal profiles for legato and martele bow strokes ...... 69 6.4 Audio energy and aperiodicity temporal profiles for legato and martele bow strokes ...... 69 6.5 Scatter plots for forte, mezzoforte and piano bow strokes . . . . . 70 6.6 Scatter plots for forte, mezzoforte and piano bow strokes, separat- ing legato and martele notes ...... 71 6.7 Bow velocity and force temporal profiles for forte, mezzoforte and piano, separating legato and martele bow strokes ...... 72 6.8 Audio energy temporal profiles for forte, mezzoforte and piano, separating legato and martele bow strokes ...... 73 6.9 Bow velocity, force and distance to bridge for legato (and martele bow strokes, grouped by dynamics...... 74 6.10 Scatter plots for sul tasto, ordinary and sul ponticello bow strokes 74 6.11 Scatter plots for sul tasto, ordinary and sul ponticello bow strokes, separating legato and martele notes ...... 75 6.12 Bow force and distance to bridge for legato (and martele bow strokes, grouped by dynamics and tone...... 76 6.13 Duration of half and quarter notes, for legato and martele strokes, and bow-bridge distance vs bow force and bow velocity vs bow force scatter plots, also for half and quarter notes...... 77 6.14 Comparison of bow-bridge distance for various tones and dynamics and bow velocity for various articulation types and dynamics, for each player ...... 79

xiv List of Tables

4.1 Training statistics for the final regression parameters...... 53

6.1 Expected duration of the notes...... 77

xv

Chapter 1

INTRODUCTION

1.1 Context

1.1.1 The author

I am a Sonology student at Escola Superior de Musica´ de Catalunya (ESMUC), one of the leading music academies in Spain, and an Audiovisual Engineering student at Universitat Pompeu Fabra (UPF).

I worked in this project thanks to my experience on multimodal data processing and analysis that I got from my stay with the EU project SIEMPRE, where I developed web applications for experimental data storage, broswing and visualization, as well as from my violin professor’s PhD, where I recorded serveral violin students while playing with multiple microhpones, video cameras and motion capture sensors.

This interest for music, science and technology comes from long ago; when in high school I already played, on my free time, with musical software, and as my senior baccalaureate project I built the Theremidi, a musical robot capable of playing the , an electronic instrument, given a standard MIDI input. I had learnt how to code some years before because I have always liked building things, but I find especially interesting producing tangible and interactive objects, and music is perfect for that. I also like paying a lot of attention to details and architect properly everyhing I build, which, combined with my passion for research and mathematical

1 modelling, define myself quite accurately.

Above all the former points, this project is about violins because I myself have been a violinist since I was a kid, and both the instrument and its sound have always been such an integral part of my life that it felt natural to work around it, only at a new level. More than 15 years of expertience in playing the violin both alone and as part of an , plus my recent incursions into modern genres such as , have made me wonder over time most of the questions that this project intends to answer taking a scientifically sound approach, putting into practise everything I have learnt during my training both as a sonologist and as an engineer.

1.1.2 The MUSMAP project

This project is part of a Marie Curie action called MUSMAP by Esteban Maestre, from the Group at Universitat Pompeu Fabra (Department of Information and Communication Technologies). The action involves spending the first part of the project in Montreal, Canada, where I have been doing all the recordings within the Computational Acoustic Modeling Laboratory at McGill University and the Centre for Interdisciplinary Research in Music Media and Technology.

The Music Technology Group

The Music Technology Group (MTG) of the Universitat Pompeu Fabra in Barcelona, part of its Department of Information and Communication Technologies, is special- ized in sound and music computing. With more than 50 researchers coming from different and complementary disciplines, the MTG carries out research on topics such as audio signal processing, sound and music description, musical interfaces, sound and music communities, and performance modeling. The MTG combines expertise in engineering disciplines such as Signal Processing, Machine Learning, Semantic Technologies, and Human Computer Interaction, to apply them in sound and music related problems.

2 CAML / CIRMMT

The Computational Acoustic Modeling Laboratory (CAML) at McGill University is devoted to musical acoustics and sound synthesis research. Lab projects are also directed toward the development of software tools to assist with audio processing, music performance, and pedagogy. CAML is part of the Music Technology area of the Schulich School of Music of McGill University and is directed by Gary Scavone. CAML is also part of the Centre for Interdisciplinary Research in Music Media and Technology.

The Centre for Interdisciplinary Research in Music Media and Technology (CIR- MMT) is a multi-disciplinary research group centred at the Schulich School of Music of McGill University. It unites researchers and their students from three institutions - McGill University (Faculties of Music, Science, Engineering, Education and Medicine), l’Universite´ de Montreal´ (Faculte´ de musique, Faculte´ des arts & des sciences), and l’Universite´ de Sherbrooke (Faculte´ de genie).´ The CIRMMT community also includes administrative and technical staff, research associates, visiting scholars, musicians, and industrial associates. CIRMMT holds a unique position on the international stage having developed intense research part- nerships with other academic and research institutions, as well as diverse industry partners throughout the world.

1.2 Motivation

Human nature and the violin

In our everyday life we engage in all sorts of activites that require physical actions with varying levels of accuracy and coordination, and often dependent on what we see or hear. Surprisingly, some of the most complex in terms of demanded precision and cognitive load involve art. It might even seem that, further than what is necessary for survival, we push ourselves in a quest for our limits, both in the physiological and psychological sense.

An example that easily comes to mind when thinking about pushing the limits is sport, which can actually be understood as art in its broadest sense. However, there is an artistic discipline that not only requires as much coordination and precision in timing and positioning as sport, but goes further by requiring a fair understanding

3 of emotions, thus involving deeper cognitive processes: music.

Amongst all musical instruments, violin is one of the most demanding in terms of precision, independence of hand control and complexity (that is, non-linearity) of its response, at all levels. It involves tight coupling between motor and auditory patterns in the action and perception of performance, and hence it constitutes an excellent case study for better undestanding of sensory motor integration.

Composers have historically bowed to that, by giving the violin a predominant role in all sorts of music ensembles, especially in the form of solo . While it could be argued that similar instruments such as the or the require similar habilities and therefore deserve similar honours, the violin has been especially appreciated for its high tone, especially during the Bel Canto epoche, where any metaphoric had a clear advantage. From the physical point of view though, the violin requires finer-grain control due to its smaller dimensions; most of the times, slightly bending one of the fingers can already make a big difference.

Performance encoding

Musical works have traditionally been transmitted by means of musical scores: symbolic representations of music that can be written in paper and thus bypass the volatility common to all performing arts. Great part of musicianship lies on knowing how to interpret these symbols and generate the gestures out of them that will produce the desired sound, as illustrated in Figure 1.1. In some cases, this relationship is obvious enough; such is the case of keyboard instruments, where there is a lot to be decided on timing and strength of each key press, but every note on the paper ultimately corresponds to pressing one of the keys.

Violin performance, on the other hand, has a much looser link between symbols and execution. That is why artificially synthesizing the sound of a violin with the input people expect to give to the system, such as a musical score in the form of a MIDI file, is so complex. Before modelling the high-level semantic processing of musical symbols, a way of describing a performance with enough accuracy to repeat it afterwards is required. Dancers are a good reference, because they use scores inspired on the gestures themselves and not on much higher level representations of the performance; see Figure 1.2 for a comparison.

Notice the emphasys put on the input constraints that are externally imposed to

4 MusicalScore Performer Instrument MusicalSound

IntendedMusicalMessage InstrumentalGesturePerceivedMusicalMessage NoteEventSequence ControlParameters AudioPerceptualFeatures

DiscreteNature ContinuousNature ContinuousNature LowDimensionality LowDimensionality HighDimensionality

Figure 1.1: From an instrumental gesture perspective, musical score, instrumental gestures, and produced sound represent the three most accessible entities for providing valuable information on the music performance process. the system. Much of the confusion here comes from the fact that, given an audio excerpt of a violin performance, the first thought that comes to mind is detecting which notes are being played and for how long, which is the most basic concept musical scores encode. This, however, gives little information about how the musician actually was playing, in terms of movements and gestures.

More refined information can be extracted from audio, such as the attack slope, that is, whether a note starts softly from silence or has a hard transient. This is more difficult to encode in symbols than musicians can understand, but it is plain enough; they refer to it as part of what they call articulation.

One could even try to extrapolate the kind of gestures the musican was doing from the audio alone. This, while being a highly indirect measurement, could work in estimating parameters such as the bow velocity, given that a baseline is available for determining how bow velocity translates to acoustic features - it is very correlated with the amplitude of the audio signal.

However, none of the above comes closer to reality than measuring the gestures themselves using a motion capture system. Actually, the last described approach still requires such measurements in order to establish a baseline. Doing this direct measurement of the bow is beneficial for two reasons: first, it yelds greater accuracy because it doesn’t require non-linear or extreme transformations that might amplify the measurement errors, and second, it avoids non-injectivity issues, where a given sound could have been generated by two completely different combinations of bow gestures.

1https://en.wikipedia.org/wiki/File:Zorn_Cachucha.jpg

5 Figure 1.2: Music versus dance notation. Source: Wikipedia1

Figure 1.3: Parts of the violin. Source: Wikipedia2

1.3 Background: the violin

The violin is a bowed string . It is usually played by dragging the bow on one of the strings, action called bowing. It has 4 strings, usually tuned at 660 Hz (E), 440 Hz (A), 293 Hz (D), and 195 Hz (G). Musicians usually number

2https://en.wikipedia.org/wiki/File:Violinconsruction3.JPG

6 them from highest to lowest frequency, but we will always address them by the note name; while this is not as universal, it avoids any misleading.

The strings start at the peg box, where they are held by a peg, a piece that can turn and roll the string to modify its tension and hence tune it. They stop at the , where the fingerboard starts, stop again at the bridge, and end at the , which has usually fine tuners; Figure 1.3 illustrates all this.

The bridge is one of the most important parts of the violin, because it is the part actually moved by the strings. It transmits the movement down to the top plate; the , located near the bridge and connecting the top and bottom plates, transmits it downwards.

The bow is basically a stick, usually made of , although carbon fiber models such as the one used in the recordings are becoming popular. This stick is attached from both ends to a ribbon made of tail hair. The musician can pull the ribbon and increase its tension by moving the frog, the piece where the hair ribbon ends, along the stick, with the help of a screw. The other end, the bow tip, is fixed.

Bowed-string instruments are often regarded as to allow for a high musical expres- siveness only comparable to the human voice. The space of control parameters, yet constrained, include sufficient freedom for the player to continuously modulate the at a high level of detail. It is not only the notes themselves, but how the performer navigates from one note to another in the control parameter space, which carries a large part of the expressiveness in performance (1; 2).

Both hands play an important role in the sound production phenomena behind violin. From a non-functional perspective (contrarily to the classification on instrumental gestures presented previously), instrumental gestures involved in violin performance can be divided into left-hand gestures and right-hand gestures. Basically, the left hand controls the length of the string that is played, and the right hand acts as the exciter, shaping the interaction between the bow hairs and the string. Such interaction leads to the characteristic vibration of the bowed-string, which is transmitted to the violin resonating body through the bridge end of the string.

In a first approximation, most of the playing techniques and expressive resources commonly available in classical violin performance are achieved through right- hand instrumental controls not involved in the selection of the string to play. These are known as bowing controls. During performance, the musician continuously modulates a number of parameters that are directly influencing the bow-string

7 interaction characteristics (bowing parameters), with the aim of affecting the timbre properties of produced sound.

1.3.1 Basics of the the bowed string motion

The first study of string motion under bowing conditions is commonly attributed to Helmholtz, who observed, using a vibration microscope, that the motion of the string could be described by a sharp corner, traveling back and forth on the string along a parabola-shaped path (3). The fundamental period of vibration is determined by the time it takes for the corner to make a single round trip, and is directly related to the length of the string. As a combination of the velocity of the bow and the bowing point on the string, two bow-string interaction phases happen during each vibration period. During the sticking phase, the string moves along with the bow at the same velocity. During the slipping phase, the string slips back in opposite direction. The traveling corner is responsible for keeping the transition times between the two phases, as well as for triggering slipping (release) and sticking (capture). As the string follows the motion of the bow during sticking, the amplitude of the string vibrations is mainly determined by the combination of bow velocity and the relative bow-bridge distance.

The transversal force exerted by the string on the bridge, which excites the violin body and produces the sound, is proportional to the angle of the string at the bridge. The energy losses, including internal losses in the string and at the string terminations, combined with dispersion due to stiffness (higher frequencies travel- ing slightly faster than lower frequencies) introduce a smoothing of the traveling corner. The net rounding of the corner is determined by a balance between this smoothing and a resharpening effect at the bow during release and capture. The effect of corner rounding and resharpening has been described by (4). Sharpening takes place mainly during release, when changing from sticking to slipping. If the perfectly sharp corner is replaced by a rounded corner of finite length, the string velocity no longer drops suddenly when the corner arrives at the bow. Instead, a gradual change in velocity takes place. Taking the frictional force between bow and string into account, the string is now prevented from slipping at first instance when the rounded corner arrives at the bow. The frictional force increases until the maximum static friction force is reached, and the bow eventually loses the grip of the string. The slipping phase is initiated, slightly delayed compared to the idealized Helmholtz motion. As a result of the build-up in frictional force, the rounded corner is sharpened as it passes under the bow. The balance between rounding and resharpening of the traveling corner explains the influence of bow

8 force in playing. A higher bow force yields a higher maximum static friction, which in turn leads to a more pronounced sharpening during release. As a result, the energy of the higher partials will be boosted, leading to an increase in brilliance of the sound.

The maintenance of regular Helmholtz motion, characterized by a single slip and stick phase per fundamental period, involves two requirements on bow force: (1) during the sticking phase the bow force must be high enough to avoid premature slipping under influence of variations in friction force, and (2) the bow force must be low enough that the traveling corner can trigger release of the string when it arrives at the bow. The limits of the playable region have been formalized by Schelleng (5).

Beyond the basics given here as a brief introduction, much research has been devoted to describe in general and how bow-string interaction characteristics affect produced sound in different, complex ways. These topics, however, are not addressed here. The reader is referred to the works by (3; 5; 4; 6; 7; 8; 1; 2) for a thorough description of sound production phenomena taking place in bowed string instruments.

1.3.2 Bowing control parameters

The modulation of bowing control parameters is carefully planned and controled by the performer in order to reach the intended acoustical features of notes and phrases while respecting a number of patterns and constrains derived from the complex connection between physical actions exerted on the violin (mostly those involved in affecting bow-string interaction) and the timbre characteristics of sound. The string player needs to coordinate a number of bowing parameters continuously, and several of them may be in conflict with each other due to constrains of the following types: physical (bow-string interaction), biomechanical (the players build and level of performance technique), or musical (the score). Players learn and adapt early to common strategies for basic, frequent playing habits and, as experience is gained, bowing control becomes a natural task that might be perceived as less complex than it actually is.

The control parameters for the sound available to the player (the main bowing parameters) are the follwing three:

9 Bow velocity: The velocity of the bow as imposed by the player’s hand at the • frog. The local velocity at the contact point with the string is not exactly the same due to small modulations in the bow hair and bending vibrations of the stick. Bow velocity sets the string amplitude together with the bow-bridge distance.

Bow-bridge distance: The distance along the string between the contact • point with the bow and the bridge. The bow-bridge distance sets the string amplitude in combination with the bow velocity, and it is often modulated as a means for controlling the tone brightness or tone of the sound.

Bow force: The force with which the bow hair is pressed against the string • at the contact point. The bow force determines the timbre (brightness) of the tone by controlling the high-frequency content in the string spectrum. In tones of normal quality (Helmholtz motion) the bow force needs to stay within a certain allowed range. The upper and lower limits for this range in bow force range increase with increasing bow velocity and decreasing bow-bridge distance.

In addition to these, three secondary bowing parameters allow the performer to facilitate the control of the three main parameters outlined before. The secondary parameters are:

Bow position: The distance from the contact point with the string to the • frog. The bow position does not influence the string vibrations per se, but has a profound influence on how the player organizes the bowing. The finite length of the bow hair represents one of the most important constraints in playing.

Bow tilt: The rotation of the bow around the length axis. The bow is often • tilted in playing in order to reduce the number of bow hairs in contact with the string. In classical violin playing, the bow is tilted with the stick towards the fingerboard. Changing the tilt angle helps the performer to modulate both the width of the hair ribbon and the pressing force applied on the string.

Bow inclination: Pivoting angle of the bow relative to the strings. The • inclination is mainly used to select the string played.

10 1.3.3 Early studies on bowing control in performance

The study of bowing gestures in string players is not an extended field of research, having its origins linked to pedagogical interests. Even though extensive literature (mostly in the area of music education and performance training) has been devoted to a rather qualitative description of bowing patterns in classical violin playing (9; 10; 11), one finds early works that opened paths for future studies on bowing control based on data acquired from real violin performance.

At the beginning of the 20th century, Hodgson published the first results on vi- sualizations of trajectories of the bow and bowing arm using cyclegraphs (12). Using this technique, which had been developed by the manufacturing industries for time studies of workers, he could record brief bowing patterns by attaching small electrical bulbs to the bow and arm and exposing the motions on a still-film plate. The controversial results showing that bow trajectories were always curved (crooked bowing), and that the bow was seldom drawn parallel to the bridge, caused an animated pedagogical debate. Some years before Hodgson published his results, Trendelenburg had been examining string players’ bow motion from a physiological point of view (13). Without access to measurement equipment for recording the motions of the players arms and hands, he drew sensible conclusions on different aspects of suitable bowing techniques based on his expertise as a physi- cian. Fifty years later Askenfelt studied basic aspects of bow motion using a bow equipped with custom-made sensors for calibrated measurements of all bowing parameters except the bow angles (14; 15). Apart from establishing typical ranges of the bowing parameters, basic bowing tasks as detach´ e´, crescendo-diminuendo, sforzando and were investigated. A general conclusion was that it is the coordination of the bowing parameters which is the most interesting aspect. The result was not surprising in view of the many constraints which determine the player’s decisions on when and how to change the bowing parameters. However, it was a reminder of that the control of bowed-string synthesis needs interfaces which easily can control several parameters simultaneously, like a regular bow (2).

1.3.4 Acquisition of bowing parameters in violin performance

Because of the complex and continuous nature of physical actions involved in the control of bowed-string instruments (often considered among the most articulate and expressive), acquisition and analysis of bowed-string instrumental gestures (mostly bowing control parameters) has been an active and challenging topic of

11 study for several years, leading to diverse successful approaches.

(14; 15) presents methods for measuring bow motion and bow force using diverse custom electronic devices attached to both the violin and the bow. The bow transversal position was measured by means of a thin resistance inserted among the bow hairs, while for the bow-bridge distance, the strings were electrified, so that the contact position with the resistance wire among the bow hairs was detected. For the bow pressure, four strain gages (two at the tip and two at the frog) were used. A different approach was taken by (16), who measured bow displacement by means of oscillators driving antennas (electric field sensing). In a first application carried out for cello, a resistive strip attached to the bow was driven by a mounted antenna behind the bridge, resulting as well into a wired bow. Afterward, in the violin implementation of this methodology which resulted into a first wireless measurement system for bowing parameters, the antenna worked as the receiver, while two oscillators placed in the bow worked as drivers. There, what is referred to as bow pressure was measured by using a force-sensitive resistor below the forefinger (or between the bow hair and wood at the tip). These approaches, while providing means for measuring the relevant bowing parameters, did not allow tracking performer movements. Furthermore, the custom electronic devices that needed to be attached to the instrument resulted to be somehow intrusive, while not being easy to interchange the instrument at performer’s demand.

More recent implementations of violin bowing parameter measurement introduced some important improvements, resulting in less intrusive systems than previous ones. (17; 18) measured downward and lateral bow pressure with foil strain gages, while bow position with respect to the bridge is measured in a similar way as it was previously carried out by (16). The strain gages were permanently mounted around the midpoint of the bow stick, and the force data were collected and sent to a remote computer via a wireless transmitter mounted at the frog, resulting in considerable intrusiveness to the performer. (19) used a commercial EMF device for tracking some low-level momevement parameters and use them for controlling some synthesis features in a performance scenario. The procedure for extracting movement or gestural parameteres was not much ellaborated, as he just used speeds or positions/rotations of the sensors in the violin or bow without extracting relevant instrumental gesture parameters.

(20) performed wireless measurements of acceleration of the bow by means of accelerometers attached to the bow, and used force sensitive resistors (FSRs) to obtain the strain of the bow hair as a measure of bow pressure. This system had the advantage that could be easily attached to any bow. Conversely, it needeed considerable post-processing in order to obtain motion information, since it was

12 measuring only acceleration. This was carried out afterward by (21), who combined the use of video cameras with the measurements given by the acceloremeters in order to reconstruct bow velocity profiles.

Accuracy and robustness in bow pressing force measurement was recently taken to a higher level (see the work by (22; 23) and extensions by (24; 25), the latter two constituting part of the contributions of this dissertation) by using strain gages attached to the frog end of the hair ribbon, thus measuring ribbon deflection.

Also recent is the approach presented by (26? ), where bowing control parameters are very accurately measured by means of one of the commercially available electromagnetic field-based tracking devices. Afterward, this methodology (based on tracking positions of hair string and ribbon ends) was adapted by (27) to a more expensive commercial camera-based motion capture system that needed a more difficult calibration system and post-processing. Latterly, research in capturing bowing parameters in real time led to the first commercial product, the K-Bow3. It consists on an augmented bow plus additional electronics attached to the violin, and is mostly intended for controling sound processing algotithms in stage performance.

1.4 Objectives

As already mentioned in Section 1.1, this project is part of a longer term Marie Curie action aiming at systematically comparing violin performances involving different musicians and instruments, which has not been done in previous studies. More specifically, the goal of this action is to analyze violin performances and link the temporal evolution of gestures acquired from violin players and some audio features (timbre, etc) derived from biologically inspired models during a unit of performance, and then use these mathematical models for synthesizing violin sounds and experimenting with feedback control loops.

There is already some literature on how to get violin bowing parameters from the position in space of the bow and the violin; this project borrowed a lot from it, and added a few improvements. Its objectives are:

To define a methodology and setup for carrying out this kind of research. • 3http://www.keithmcmillen.com/kbow/

13 To provide actual software tools and working knowledge useful for the • synchronization and formatting of multimodal data, so that constructing and publishing databases such as the one presented here in the future can be done much more efficiently and faster. To implement methods for bowing control parameters acquisition through • available motion capture systems. To implement methods for bowing control parameters extraction from the • raw motion capture data. To design and record experiments involving bowing control parameters • acquisition. To process the data from the experiments and build a bow stroke database • comprising carefully annotated multimodal data. To work on the repovizz platform, an online multimodal data archiving • and visualization tool with an emphasis on collaboration, and surrounding software tools in order to make it possible to upload the database there and make it available to the public To perform a preliminary analysis on some aspects of the recordings. •

1.5 Structure of the thesis

This report starts with a review of the structure of the recordings, making special emphasis on how they contribute to the experiment, what the musicians had to perform and how they were asked to do so.

An accurate description of the experimental setup for the recordings follows, with specific details on how all data was recorded and how synchronization between the different modalities (motion capture, video, audio and haptic sensors) was achieved.

One of the contributions that will probably have more impact in the future, apart from the data, is the well-documented code for computing bowing descriptors from 3D motion capture data, including the force; the definition of the descriptors and the machine learning methods for approximating the force mentioned earlier are presented in Chapter 4.

The following chapter presents the database that has been constructed with the recordings; from the segmentation and annotation of them to the online repository where all data will be stored and made available to the public is presented, together

14 with some insight on how the recordings have been organized there.

Next, the recorded data is analyzed, with plenty of graphs showing some things we observed, from the statistical distribution of the descriptors themselves to possible correlations with audio features. The previously mentioned hypotheses are revisited and checked with the data.

The report ends with some general conclusions, including a summary of the achievements and a brief discussion of the next steps.

15

Chapter 2

EXPERIMENTAL DESIGN

Most of this project is about the recordings that we made at the CIRMMT facilities at the McGill University in Montreal, Canada, which are being put together in a dataset called MUSMAP I. In order to make it possible to model violin per- formances, we acquired a total of 8 hours of carefully synchronized multimodal data.

The general idea of the experiment was to record several performers, each of them playing on several instruments; this would show how musicians adapt to certain instrument characteristics, and how different they are at doing that. For every performer plus instrument combination, we recorded the same set of thoughfully designed exercises aiming to capture data as diverse as possible while having control over what the performance parameters are supposed to be (or, at least, giving instructions as complete as possible to the musicians), and getting enough redundancy to help smooth out occasional noise and to have a backup in the event of poorly executed strokes.

With that directives in mind, the experiment sampled different dimensions of performance: the performer itself; the instrument; the bow-bridge distance or sounding point, as violin players call it; the dynamic; the effective string length, which is related to the pitch; the string being played, which also defines pitch; the bow direction; and the articulation. Given the time constraints we had, we decided to take 4 samples of every possible combination of the parameters. All parameters had between 2 and 4 possible values.

17 2.1 Sampling space

The goal of the recording was, therefore, to sample different dimensions of violin performance. What follows is a brief description of each of the sampling dimen- sions, along with a rationale of why we thought it was important to include it in the experimental design:

Performer As explained next, we sampled different performers in order to be able to have a more generic model of how violin is played and to spot differences between them.

Instrument It was also important to have different instruments to see how violinists adapt their technique to them, and distinguish between what is fixed and what is adaptive.

Articulation Different kinds of bow strokes expose different aspects of the violin, so it is important to sample some of them to have a more complete profile of the violin, both for the performer and for us. For instance, in legato the response to transients is not as obvious because note transitions are smooth, but the overall timbre is better appreciated.

Sounding point Asking the musicians to use different sounding points makes them go to the limits of the Schelleng region, and exposes much better the range of playability of the violin and how musicians adapt to it.

Dynamic Similarly to the sounding point, different dynamics tend to be on different regions of the Schelleng diagram, so it is important to have control on them.

Pitch (position and string) Sampling different pitches exposes different characteristics of the violin, because they excite different modes of the system. Furthermore, different strings have different physical characteristics and excite the bridge at different points, and different positions (that is, effective string lengths) especially modify aspects related to the sounding point.

18 Bow direction Virtually all violin performances consist of alternating up and down bows, so sampling both of them makes sense. The temporal evolution of the control parameters that condition bow force is also very different, because of the large asymmetry in the distribution of weight along the bow.

2.1.1 Performer

As already mentioned, the constraints on the project made us choose to record 3 different performers. They were all professional active musicians, although specialized in different styles. While they were all classically trained, Player 1 works mostly on ; Player 2 is specialist on traditional Eastern European music, although has a broad experience on classical , and Player 3 specializes on Arabic and Jazz music.

Although the exercises we designed were very generic to the act of playing the violin itself and not so much to specific genres, we expect to find more accentuated differences in their style then if they were all specialized on the same field. More specifically, we expect performers 1 and 2, who have orchestral experience, to be more consistent on their playing because playing in such an ensemble requires precise control on most of the parameters. We also expect performers 2 and 3, with experience on non-western music, to sample a broader range of values, because they are used to techniques that are of little use to the so called classical music.

2.1.2 Instrument

The violins used during the recordings were provided by the Schulich School of Music from McGill University. The main criteria for choosing them was that they had to be as different as possible, not only in terms of brightness of the sound, which is what usually comes first to mind, but also in terms of how loud they sound or how much margin has the player for staying within the slip-stick region within the Schelleng diagram.

In order to assess a bit the differences before analyzing the performances, we asked the musicians what they thought about each of the instruments and which one they preferred (see previous subsection), and did some measurements of the violins in a quasi-anechoic chamber to characterize them.

19 0 0 0

000 000 000

020 020 020

030 030 030

040 040 040

000 000 000 )))))))))))))) )))))))))))))) ))))))))))))))

000 000 000

000 000 000

000 000 000

000 000 000 2 3 4 2 3 4 2 3 4 00 00 00 00 00 00 00 00 00 )))))))))))))) )))))))))))))) )))))))))))))) (a) Violin 1 (b) Violin 2 () Violin 3

Figure 2.1: Admittance (blue) and radiation (red) plots for the three violins, together with the measured coherences (green shades).

The first measurement made was the admittance at the bridge, which is the transfer function from a known force applied on a point of the bridge to its velocity over time. The force was produced with an impact hammer, to excite as much of the spectrum as possible evenly, and the velocity was measured with a laser vibrometer. The admittance is closely related to the signal we get from the pickup, because it is actually a mesaurement of how the bridge moves when the strings apply a force on it.

The second measurement was the radiation of the violin, which is the transfer function from a known force applied on a point of the bridge to the air pressure recorded with a very flat response microphone around 1 meter away, in line with the body of the instrument. This measurement is closely related to the sound the musicians perceive.

Both measurements are very important because they model a good part of how the different violins transform the excitation coming from the strings into sound, and it can be very interesting to see how correlated the recorded sound of the different instruments is with them, and what changed in the control patterns in order to get a sound more consistent across instruments, if that is the case, than what the instrument characterizations predict. We leave this, though, for future studies.

The description of the violins that follows is based on what the musicians said on the questionnaire that was given to them at the end of their recording session, as well as what we observed when doing the setup and looking at the described measurements.

20 Violin 1

The first violin that was chosen (labelled as Violin 1) is an anonymous hand-made piece, and looks the oldest of the ones we chose. It had been reported in former studies as having a very good sound, actually beating some first-tier violins in blind tests. The musicians described its sound as being glassy, which is very appreciated, but they said it can’t handle much pressure, so it’s not good for playing very loud. Nevertheless, musicians 1 and 3 chose it as their preferred instrument whithin the set of 3 that was oferred to them.

Violin 2

The second violin (labelled as Violin 2) is another hand-made instrument . It has a brighter sound than the first one, even slightly raucous. However, it can handle more pressure, by yielding a more open sound. That is the reason why musician 2 chose it as the one for performing the free excerpt.

Violin 3

The third violin (labelled as Violin 3) is a factory-made instrument, branded Suzuki and made in China around 50 years ago. It has been described as being clearly the worst of them, with a raucous sound and very irregular response. Some of the musicians reported having trouble adapting to it, which makes us suspect that its admittable stable region is narrower.

Bow

In order to better isolate the effect of the violin, all musicians always played with the same bow. The bow that was chosen was a carbon fiber model assembled by a Spanish manufacturer, with the parts made in China, which has been described by all musicians that tried it as lacking the feeling of wood, especially on the timbre of the sound it generates, but with a very good response in terms of weight balance and bouncing.

21 2.1.3 Articulation

The great variety in articulation is what makes the violin a very rich and complex to emulate instrument. Different bow strokes are usually described using italian or french terms, but in the end violinists aknowledge to interpolate a lot between them. We chose to record two distinct articulations that were bowed in the classical sense (i.e. no pizzicatto or ); were not too difficult to play, so that the musicians could concentrate more on being consistent with the sound they produced; and that were able to be sustained long enough to provide rich data (i.e. no short strokes such as sautille´). All these words have differences in meaning among different players that can be quite substantial, so we chose two of them that seemed clear enough and then discussed a bit with the musicians what we really meant.

The first articulation that was sampled was legato, which means tied. In legato, the objective is to produce a sound that is perceived as something continuous, so musicians try to hide the note boundaries by doing a quick and smooth bow change. During the note, the sound is usually static, unless specific marks such as dynamic modulation are present.

The second articulation that was sampled was martele, which means hammered. In martele, the bow is first striked very hard, with a sharp attack, and then the sound decays slowly by releasing force on the bow. It is not to be confused with stacatto, which has a not so strong attack and where the decay of the note is produced by taking the bow off the string shorty after the note onset.

2.1.4 Sounding point

We decided to sample three different sounding points: the ordinary point where they would play normally, a point closer to the bridge, or sul ponticello, usually sought after when an increase in brightness or loudness is desired, and a point closer to the fingerboard, or sul tasto, which gives a softer sound.

The specific sounding point was decided by the musician, because there are a lot of factors that affect how the violin behaves in a given sounding point, such as the overall intensity in terms of force and velocity or the effective string length. Giving them these directions is enough to push themselves a bit to the limits of the Schelleng stability region, especially in combinations such as forte, sul tasto and a short string length or piano and sul ponticello.

22 2.1.5 Dynamics

Dynamics is one of the most subjective dimensions of music performance; more than depending on the performer, it depends a lot on the context. The same dynamic indication will be interpreted completely differently depending on the kind of ensemble playing, the genre, or the mood of the piece.

We asked the musicians to sample 3 different intensity levels: piano (soft), mezzo- forte (medium) and forte (hard). This nomenclature, while being the most generic for 3 arbitrary dynamic levels, can be interpreted as not extremely soft (which would be pianissimo), and not extremely loud (which would be fortissimo). How- ever, some musicians play with the full range of dynamics they have, and hence scale the indications accordingly when the extrema are missing.

2.1.6 Pitch (string and string length)

Pitch in instruments such as the violin has an added complexity: it is a combination of both the string being played and the point at which such string is stopped. Musicians choose what combination to use depending on ergonomy constraints and the sound quality they want to achieve. In order to retain control of all possible variable, we decided to sample each string independently. For each string, we recorded 3 different stop positions along the first half of the string (the one closest to the nut).

Another thing to take into account is which finger stops the string; by moving the whole hand, one can stop at the same position with any of the 4 available fingers. We chose the notes and their order so that there would be a single way to perform it in an ergonomic way, switching between two different hand positions; the whole pitch sequence actually contains every sample twice.

2.1.7 Bow direction

There is not much to say about bow direction. We sampled both down and up bows; and, obviously, the easiest for a musician is to play alternating the bow direction. They can play two notes in a single bow stroke, but we chose not to because it makes it much harder to define the boundaries between them. We also asked them

23 to always start down bow, because that is the standard practice when the music starts with the down beat and the notes are organized in binary patterns; it would have felt extremely unnatural having asked them otherwise.

2.2 Sample permutations (score)

The order in which samples are taken in an experiment is very important, and given the constraints the musicians have in terms of how quickly they can change between different parameters, especially if they do not follow a pattern, we could not just randomly generate scores for them to play.

We decided to sample the dimensions recursively, with a carefully designed order. We had three constraints in mind: it had to be playable without much difficulty in order to minimize errors, it had to avoid being too repetitive for the musicians so that they stayed alert of what came next and did not just play by inertia, and it had to reduce the sampling bias by ensuring the most subjective dimensions change often.

The one we could not change due to practical constraints was the player; we sampled each of them independently, actually on different days. Alternating between them would not have had much effect anyway, unless they had heard what the others were playing.

From there, we had freedom to choose. Since we wanted to have as many samples as possible from every combination of the parameters, we recorded the same thing twice; once in the morning and once in the afternoon, with changes in the ordering of dynamics and tone as will be detailed later.

Each permutation (A and B), as we called these groups, was divided in two parts, corresponding to the sampling in articulation (legato and martele). Legato always came first, because it is simpler and gives some time to the musician to get into the mood. Since the articulations are very different, we thought it was not critical to sample them so separately; the musicians played with the same articulation for about one hour at a time.

For each articulation type, the musicians recorded the same thing with each one of the violins, always in the same order (1,2,3). The sampling of a violin consisted on playing the same set of exercises on each of the strings, always in the same order,

24 from lowest to highest.

The musicians recorded a single take for each string of the violins, which was around 2.5 minutes long. Within this take there were two distinct parts: the first one with half notes (one note every two beats), and the second one with quarter notes (one note every beat). In each of these parts, the sounding point was changed three times, always starting from the middle, and within each sounding point position the dynamics were changed three times as well, starting always from mezzoforte. The order of the other two sounding points and dynamics was switched every time the duration or the permutation changed, so all combinations were sampled equally.

Finally, for each dynamic, the musicians played half of the pitch sequence that follows. Each note was repeated 4 times in a row by alternating the up and down bow directions, and always starting with bow down:

1. Two semitones above open string, first finger. (first position) 2. Ten semitones above open string, fourth finger. (third position) 3. Five semitones above open string, first finger. (third position) 4. Same as (2) 5. Same as (1) 6. Five semitones above open string, third finger. (first position)

An excerpt of the score that corresponds to the sampling of the A string is privded in Figure 2.2, displaying pitch and intensity alternations. For a full version of the score that was given to the musicians, please refer to Appendix A.

Figure 2.2: Excerpt of the score that was given to the musicians

25

Chapter 3

DATA ACQUISITION

The experiments we designed required a highly complex setup, that was only possible to put together thanks to the collaboration of the Centre for Interdisci- plinary Research in Music Media and Technology in Montreal, Canada. At their facilities, we made use of the Qualysis Motion Capture System, which can record 3D data from various infrared cameras, a high definition video camera, a number of microphones and piezoelectric sensors, audio interfaces, a clock generator, and a load cell with all additional electronics to make it work and adapt its output so that we could record it.

3.1 Overview

3.2 Motion Capture

The motion capture system was one of the most important parts of the setup. We used a Qualysis1 system, which can register the 3D position of number of markers over time, at 300 Hz in our case, with sub-millimeter accuracy. The Qualysis Motion Capture system employs a number high-speed infrared cameras with built- in infared light ring emitters for tracking high reflective spheres as 2D blobs. The provided software, the Qualysis Track Manager, takes the data from all the cameras

1http://qualysis.com

27 and reconstructs the 3D scene. Several markers can be grouped into a rigid body, of which Qualysis computes position and orientation over time; this allows defining virtual markers relative to the coordinate system of the body. The whole process requires a precise calibration process, which will be described in Section 3.7.

In our setup, we had 12 cameras and were tracking 10 markers from the musician, 5 from each violin, and 6 from the bow. The markers from the musicians were left independent; we use them only for visualization purposes, although we do not discard doing further analysis with them in future studies. The markers from the violin are grouped in a rigid body, so that we can know the position of seven very specific points relevant to the feature extraction steps, such as where the strings meet the bridge. It is not possible to model the bow as a single rigid body because it deformates when a lot of pressure is applied on it, and we actually use that information during the feature extraction. Instead, we took a piece-wise approach by constructing a rigid body for the three markers near the bow and another one for the other three near the frog. Each of them has two virtual points associated at the hair ribbon boundaries.

3.2.1 Body markers

The body markers are prefixed with player, where is the numeric code assigned to each player. player RSHO Right shoulder player RELB Right elbow player RWR Right wrist player LSHO Left shoulder player LELB Left elbow player LWR Left wrist

28 player TFHD Top forehead player RFHD Right forehead player LFHD Left forehead player C7 Nape

3.2.2 Violin markers

As already mentioned, there are two kinds of markers on the violins: the ones that were physically placed on it during the recording, and the ones that were derived from the rigid body model plus a snapshot of them made at the beginning of the experiment; see Figure 3.1 in Section 3.3 for a close-up view of some of these virtual markers during the snapshot.

The physical markers were put asymetrically and in different places on the different violins so that the system can distinguish them and apply the right rigid body model; this is depicted in Figure 3.1. The labels were still the same for all the violins and prefixed with violin, where is the numeric code of the violin (see Section 2.1.2 for more details). The labels are also grouped in logical, hierarchical blocks, separating the levels with an underscore.

Physical markers violin scroll Marker put on the scroll of the violin, after the peg box violin pl bottom left Bottom left side of the top plate violin pl bottom right Bottom right side of the top plate

29 Figure 3.1: Violins used during the recordings, ordered from left to right.

30 violin pl top left Top left side of the top plate violin pl top right Top right side of the top plate

Virtual markers violin st G bridge Intersection of the bridge and the G string (beginning of the string) violin st D bridge Intersection of the bridge and the D string (beginning of the string) violin st A bridge Intersection of the bridge and the A string (beginning of the string) violin st E bridge Intersection of the bridge and the E string (beginning of the string) violin st G nut Intersection of the nut and the G string (end of the string) violin st E nut Intersection of the nut and the E string (end of the string) violin fb center Point on the center of the fingerboard, near its end closest to the bridge

3.2.3 Bow markers

The bow, as already mentioned, is modelled by using 2 rigid bodies. They are identified with the prefixes bowCarb frog and bowCarb tip respectively; bowCarb stands for carbon fiber bow, and we put it to keep the naming generic in case we recorded with more than one bow in the future. The handedness convention is to look at the bow with the hair ribbon in front of the stick and with the frog at the bottom.

31 Figure 3.2: Bow that was used for the recordings.

Physical markers bowCarb tip stick This marker is part of the tip rigid body, and it’s the one closest to the frog. bowCarb tip corner Also part of the tip rigid body, it’s placed at the end of the bow, on the angle it has at the tip. bowCarb tip tip The last of the tip rigid body group, it’s placed on the very tip of the bow. bowCarb frog ant right Part of the frog rigid body group, it’s placed on the right antenna. bowCarb frog ant left Another frog marker, placed on the left hand side of the left antenna. bowCarb frog stick The last of the frog group, placed on the stick; it’s the one closest to the tip.

32 Virtual markers bowCarb tip hr left Left hand side of the hair ribbon’s tip boundary. bowCarb tip hr right Right hand side of the hair ribbon’s tip boundary. bowCarb frog hr left Left hand side of the hair ribbon’s frog boundary. bowCarb frog hr right Right hand side of the hair ribbon’s frog boundary.

3.2.4 Load cell markers

The load cell markers are laid out similarly to the violin, with the difference that there is a single string, the fingerboard marker does not exist, and there is an extra marker for building the rigid body: cell st nut Placed on the end of the bowing bar furthest from the player cell st bridge Placed on the end of the bowing bar closest to the player cell loadcell Placed on the loadcell itself cell pl scroll Equivalent to the violin one, it’s the marker on the plate more to the top cell pl bottom left Equivalent to the one on the violin cell pl top left Equivalent to the one on the violin cell pl bottom right Equivalent to the one on the violin

33 cell pl top right Equivalent to the one on the violin

3.3 Audio

For a given take, we had 3 audio sources that we wanted to record: a piezoelectric pickup placed on the bridge of the violin for recording its vibration, a close-up microphone placed on top of the tail piece and an ambience microphone.

The pickup that was available was the Fishman V1002, which is mounted between the hip and the shoulder of the bridge. We chose to put it on the side where the chin rest is because, being the part where the is (and not the sounding post), it is the one vibrating with the most amplitude. However, one of the musicians -who also happens to be a vioilin maker- advised putting it on the other side on future studies, because the sound is more bassy. The pickups were fed to the preamplifier through a Direct Injection (DI) box.

The close-up microphone we chose was the DPA 4099V3, which is the industry standard for close miking of lots of classical instruments, especially violin. We wrapped the connector of both transducers (the pickup and the microphone) to- gether, and placed the microphone above the tail piece and facing the bridge, as can be seen in Figure 3.3.

The ambience microphone was a Shoeps Colette kit4, with a wide cardioid capsule (MK 21). We put it for having a reference recording of the violins as a listener would have heard them, and for recording ourselves when talking in order to be able to identify what happened during the recordings.

Since the performers had to change violins reasonably often (every 20 minutes, approximately), we decided to have 3 pickups and 3 close-up microphones of the same brand and model, one pair for each violin, and to record all of them. We could discard the unused ones afterwards, and this also happened to be a safeguard against misspelled take labels.

2http://www.fishman.com/product/v-100-classic-series-violinviola-pickup 3http://www.dpamicrophones.com/en/products.aspx?c= item&category=118&item=24346 4http://www.schoeps.de/en/products/categories/overview-mod-mi

34 Figure 3.3: Pickup and close-up microphone mounted on the violin.

The next elements in the chain were an RME Micstasy5 8-channel microphone preamplifier and an RME Fireface 8006 audio interface for doing the analog to digital conversion. All this was recorded using the Reaper7 Digital Audio Workstation (DAW).

3.4 Video

The video was recorded with a Sony PMW-EX38 professional videocamera. We tried not to move it during the recordings, in order to be able to render the motion capture 3D scene overlaid on top of it without much hassle.

The camera was set at a resolution of 720p, 25 frames per second (PAL standard), compressing to a bitrate of around 35 Mbps.

5http://www.rme-audio.de/en_products_micstasy.php 6http://www.rme-audio.de/en_products_fireface_800.php 7http://www.reaper.fm/ 8http://www.sony.co.uk/pro/product/broadcast-products-camcorders-xdcam/ pmw-ex3/

35 3.5 Load cell

In random samples along the recording sessions, some calibration takes were performed by placing weights on the load cell in order to characterize the response of the load cell, as will be described in Section 4.4.1.

The load cell was recorded with the Qualysis Analog acquisition box, which pro- vides 64 input channels and has the advantage of integrating with the Qualysis motion capture system, both in terms of software control and of timing and syn- chronization. Between the load cell and the acquisition box there was a signal conditioner containing an instrumentation amplifier and a voltage divider for pow- ering the sensor, because the acquisition box expects the signal to be in the +-5 V range, and the load cell delivers a much smaller voltage.

We also tried using an analog to digital converter plugged to the audio card through the SPDIF digital protocol and doing an amplitude modulation to the signal with a carrier of around 10 kHz, so that we could record it from the soundcard and synchronize it together with the audio instead of the Qualysis. Both cases had worse performance than the acquisition box in signal to noise ratio and other artifacts.

Figure 3.4: Load cell mounted on a support with motion tracking markers.

36 Figure 3.5: Calibrated weights that were used for characterizing the load cell.

3.6 Synchronization

We implemented two levels of synchronization in the setup: clock-level synchro- nization, which guaranteed that samples of different modalities are acquired in parallel without drift, and timestamp-based synchronization, which enables us to align data from different sources post-hoc without having to start recording that the exact same frame, which would have been very difficult, if not impossible, to achieve given the resources available. Both levels were controlled by a master synchronization board, a Rosendahl Nanosyncs HD9, which can genereate high res- olution audio (World Clock) and video clocks, as well as Linear Time Code (LTC) timestamps, among other signals, with a choice of sampling rates and frequency multipliers.

The clock synchronization was quite straightforward, because it’s only a matter of supplying each device a signal at the rate it requires. The audio interface and the Qualysis system were driven from Word Clock outputs at 44100 Hz. We set the internal Qualysis frequency divisor to 147, so that the motion capture data and

9http://www.rosendahl-studiotechnik.de/nanosyncs.html

37 the load cell were recorded at 300 Hz, which happens to be a multiple of the video sampling rate as well. The video camera was driven from a Framelock output with PAL formatting (25 fps).

The timecodes required more tinkering, because it is not as standard. We chose a professional video camera deliberately so that it would accept LTC input, and the Qualysis system already had an input for that as well. The audio signals were timestamped by recording the LTC signal as another audio channel, and we relied on the recording software to start recording all channels at the same time. Recording LTC signals as if they were audio is a common practice, and there is software available for decoding it. Some basic analysis shows that a sampling rate of around 20 kHz already gives enough frequency resolution to decode it properly.

3.7 Recording protocol

Most of the recording process was manually operated, because the Qualysis system, the audio recorder and the video camera had to be activated and stopped indepen- dently by pressing or clicking a button; besides, calibration recordings had to be performed regularly to guarantee the reliability of the data. We recorded more than 150 takes, so it was equally important to have a very consistent naming scheme in order to make batch processing of the recordings easier and avoid losing or mixing up data. In order to minimize human error on all these factors, we established a recording protocol and followed it rigurously:

At the beginning of the session: • – Reset the SMPTE clock generator – Calibrate the Qualysis cameras with the provided wand – Check that all the hardware is running – Start video capture – Calibrate the load cell by recording the weights – Record a bow calibration with the load cell At the end of the session: • – Record another bow calibration with the load cell – Stop video capture

38 Chapter 4

FEATURE EXTRACTION

Once the data has been acquired, the descriptors are computed as follows. First of all, from each frame of 3D data a vector with some frame-based descriptors is extracted. Then, bow velocity is extracted from the obtained vectors, and all the data is smoothed using quadratic interpolation. After that, the recording is segmented into parts where the violin is actually playing and parts where it is not. Finally, in the case of force baseline recordings, the Support Vector Machine (SVM) regression model is trained against the recorded force. In the case of violin recordings, the SVM model is applied to the descriptors in order to obtain an approximation of the bow force.

Figure 4.1: Block diagram of the feature extraction process

39 4.1 Overview

The descriptors are extracted in two steps: first of all, a set of low-level features is extracted from the 3D data, comprising vectors and vectorial basis; then, the high-level features that comprise the dataset are derived from both low-level data and other high-level features. High-level features are presented first, because they are the most relevant in further analysis.

4.1.1 High-level features

Current string String that is currently being played (1 - G, 2 - D, 3 - A, 4 - E). Notice that the ordering is reversed from usual violin conventions.

Bow-bridge distance Distance between the bridge and the sounding point.

Bow position Distance from the to where the bow contacts the string (i.e. low when playing near the frog and high near the tip).

Bow velocity Velocity of the bow in the playing direction (derived from the position).

Bow inclination Vertical angle of the bow, the one that changes when playing different strings.

Bow tilt Angle that defines how much of the hair ribbon contacts the string (i.e. 0 when the whole hair ribbon contacts the string).

Bow skew Angle between the bow and the bridge (i.e. 0 when they are parallel).

Bow force Force that the bow exerts on the string, derived from various descriptors using machine learning methods (see Section 4.4).

40 4.1.2 Low-level features

Violin vector basis Vectorial basis that follows the violin, with its components being the bridge vector, the string vector and the plate normal vector.

Bow vector basis Vectorial basis that follows the bow, with its components being the hair ribbon length vector, the hair ribbon width vector and the hair ribbon normal vector.

Bow to string distances Length of the shortest segment between the bow and each of the strings.

Bow position (centered) Bow position, without taking the current string into account (computed with an imaginary string at the center of the bridge).

Bow pseudoforce (both sides) Bow to string distance for the current string, computed separately with the left and right sides of the hair ribbon.

Bow deformation (angle) Angle between vectors related to the bow length at the frog and at the tip. Measures the arching of the bow.

Bow deformation (distance) Distance between the antennas and the tip end of the hair ribbon. It decresases when the bow is arched.

4.2 Computation of low-level descriptors

4.2.1 Vector basis

Two vector basis, one for the violin and another for the bow, are defined to make the computation of some descriptors easier. Both of them are comptued by fitting a plane on a set of markers using a least-squares minimization, and then finding the spanning vectors by taking projections and cross products on other vectors defined by markers; Figure 4.2 illustrates their definition.

41 Figure 4.2: Illustration of the violin and bow planes, the sounding points and some of the basis vectors (borrowed from (28)).

Figure 4.3: Illustration of the hair ribbon and string deflection, together with the shortest segment joining the sounding points (borrowed from (28)).

42 The least-squares minimization plane fitting works as follows: given a set of points (or markers) pi, it finds an optimal plane that minimizes the sum of squared distances from every point to the plane. It is computed by performing a Singular Value Decomposition (SVD) on the covariance matrix of the data, given by the expression XtX, where X is a matrix with a row for each observation and as many columns as dimensions (3 in our case). The plane is defined by the two major singular vectors, and the normal to the plane is the smallest singular vector.

The least-squares method does not guarantee the orientation of the plane, that is, the sign of the normal vector. We probed the plane with a point from which we knew the side it was on, and then adjusted the sign multiplying the normal by -1 or 1 accordingly.

Violin vector basis

From the markers on the violin, a vector basis is defines that will be a frame of reference for the computation of other descriptors. The basis is constructed from the string plane, which is the plane defined by the markers on the body of the violin with the normal pointing up, parallel to the bridge. It is comprised of the following 3 vectors:

Bridge vector (v1): Unit vector that goes from the bridge end of the G string • to the bridge end of the E string, projected onto the violin plane. String vector (v2): Unit vector that points in the direction of the cross • product of the bridge and normal vectors, which is parallel to the strings, from the bridge to the nut. Normal vector (v3): Unit vector that points up from the violin top plate. It • is the normal of the violin plane, computed with a least squares fitting as described next, with the sign adjusted to be consistent.

Bow vector basis

From the virtual markers on the 4 edges of the hair ribbon, a plane is fitted. Then, the bow vector basis is defined on this plane, similarly to the violin vector basis. The plane has the normal pointing out of the bow; when playing, it points towards the violin.

43 Hair length vector (b1): Unit vector that goes from the frog end of the hair • ribbon to the tip end, projected onto the hair ribbon plane. The right side of the hair ribbon is taken for the computation, although it shouldn’t matter because the left and right sides are assumed to be parallel. Hair width vector (b2): Unit vector that, given the hair vector, spans the • hair ribbon plane; computed as the cross product between the hair length vector and the hair normal vector. Hair normal vector (b3): Unit vector normal to the hair plane, as given by • the least squares fitting.

4.2.2 Bow to string distances

As already mentioned, the sounding point is the point where the bow touches the string. A relaxed definition allows viewing it from three different perspectives; as being the point on the bow hair, either on its left or right side, closest to the string, and the point on the string closest to the bow. This allows talking of sounding points even when the bow is off the string; they can be found by intersecting the shortest segment that crosses both a line representing the string and a line representing the bow length (29).

Distance between bow and string, understood as the length of the segment that joints the two sounding points between a string and the right side of the bow ribbon, is computed for every string. This distance is made negative if the bow is below the string and positive if above, according to the violin plane. The estimated string being played is the one closest to the bow. This method differs from what was done traditionally, to set thresholds on the bow inclination.

Bow position (centered) The bow position (centered) is a simplified version of the bow position high level descriptor, which is defined as the distance between the bow frog and the bow sounding point on the right side of the hair ribbon estimated with a virtual string that starts at the mean of the four bridge string ends and ends at the mean of the two measured nut string ends. It has the advantatge of not presenting discontinuities on string changes, which makes it more suitable for estimating the bow velocity.

Bow pseudoforce The bow pseudoforce is the distance between bow and string, computed as

44 described above in string detection, for the current string. It is computed for both sides of the bow. When the bow is pressing the string, the bow hair and the stick bend, so the bow appears to be below the string if the distance is computed from the frog and tip ends of the hair ribbon alone. The sign is the opposite of the original distance definition: positive is below the string. We expect it to be correlated with the force when positive, with dependence on other parameters such as bow position. See Figure 4.3 for more details.

4.2.3 Bow deformation

The bow deformation features measure how much the bow stick bends when applying high forces. They are used during the regression process, because the pseudoforce alone is not enough in regions where the hair ribbon deflection is very small, such as near the tip, as a rough estimation of the bow force. The measurements are very noisy due to the variability of the feature compared to the noise of the markers, but with appropiate smoothing can improve the regression performance.

Bow deformation (angle) The angle between the bow stick near the tip and near the frog is also a good indicator of the force, because it measures the deformation of the stick.

Bow deformation (length) The distance between hair ribbon tip (mean) and frog antenna (mean).

4.3 Computation of high-level features

Current string

The current string is estimated by checking which of the distances between the right side bow sounding point and the string sounding point is the smallest for each of the 4 strings.

45 Bow-bridge distance

Bow-bridge distance is the distance between the string sounding point (same as above) and the bridge end of the current string. It is what most violins actually mean to control when they talk about the sounding point.

Bow position

Similarly the the centered bow position, it is computed as the distance between the bow frog and the bow sounding point for the currently played string. Therefore, it is 0 when playing at the frog and up to around 70 mm when playing near the tip, which is the length of the bow.

Bow velocity

The bow velocity on the bowing direction is obviously the derivative of the bow position, computed taking into account the following considerations:

The bow position is not computed from the current string but from a virtual • string on the middle of the bridge, to avoid discontinuities on string change. To reduce the effect of noise, we use a 9-point approximation of the derivative. • This approximation has been extended to be consistent under non-uniform sampling rates, because some of the points are not present due to tracking errors.

Bow angles

We computed an orthonormal basis that represents the violin and another one that represents the bow. By computing the transformation of one basis to the other (in terms of pitch, yaw and roll), we can know the relative orientation of the bow in respect to the string:

Bow inclination Bow inclination is the angle that varies when the musician plays one string or

46 another. Zero is flat, negative values mean closer to the G string and positive closer to E.

Bow tilt Bow tilt defines how the bow gets in contact with the string, regarding the surface of the hair ribbon that is on the string. It is normally 0; positive values mean the hair ribbon is facing the bridge.

Bow skew Bow skew makes the bow not perpendicular the string. It is normally 0. Negative values mean the tip gets closer.

Bow force

The bow force is calculated by making a regression using some other descriptors and a model previously built by recording a load cell, as described in next section.

4.3.1 Noise reduction

Some descriptors are smoothed using a quadratic frame-based technique, and saved with the suffix smooth. This smoothing is applied to all descriptors when training the force regression model.

The smoothing is performed as follows: the data is divided in frames, with overlap, and for each frame the data is substituted by a quadratic polynomial fit to it; this allows working with non-uniform sampling rates. Then, each frame is multiplied by a window, and all samples are put back together with the overlap-add method. Each sample is divided by the sum of the values of the at that sample.

4.4 Force estimation

The bow force feature extraction process is a bit more complex than the others, because it involves some machine learning tecniques. As introduced before, it is based on finding the mapping between force applied on the string (measured on a load cell for training) and the physical deformation of the bow, measured through

47 4

3

2

1

Load cell reading (V) 0

0 50 Time (s) 100 150

Figure 4.4: Plot of the force measured during the load cell characterization. the motion capture sensors. First of all, a method for recovering the actual force in Newtons applied to the sensor from the reading in Volts is given. Then, we detail how we performed the data pre-processing prior to training a model that can predict the bow force from some of the motion capture descriptors. Finally, the model is presented and some results are discussed.

4.4.1 Load cell calibration

The load cell used for recoding the bowing pressure has a high quality transceiver, which is supposed to yield a voltage with a nearly linear response when a certain force is applied to it. This voltage is scaled from -5 to 5 Volts with an instrumenta- tion amplifier across the dynamic range of the transciever. Moreover, when being converted into the digital domain using an acquisition board, the value is an integer number that depends on the gain of the analog to digital converter. Most of the analysis could have been done in this units, but in order to give more scientific value to the data standarized units are much more useful, in the same way that other descriptors are expressed in millimeters or degrees. Therefore, a method for convering these integer values back into Newtons is required.

Before and after every recording session, the load cell was recorded alone, with different known weights on top of it. This data was then used for fitting a two- degree polynomial that would transform the measured signal back into Newtons, knowing that the force applied by the weights is their times the gravity.

For fitting the polynomial, however, the total force being applied to the load cell has to be known. Any moment, the force was the known weight we had put on it,

48 6

4

2

0

Load cell reading (V) 2 0 200 400 600 800 1000 1200 Mass (g)

Figure 4.5: Fitted polynomial for the loadcell Newtons to Volts conversion, with the measured samples displayed as dots on top. plus the weight of the supporting structure -unknown but constant-, all this positive or negative depending on whether we were pushing or pulling from the sensor. In our case, we only put weight on the sensor, i.e., we did not pull from it, as that would be the case during the recordings with the bow.

This method assumes that throughout the recordings (both for calibrating and with the violin) the load cell is aligned with the gravity vector. During the recording with weights some levels were put to verify it, but when the load cell is being bowed there is little control. In that case, however, the force of the bow is active, not due to grativty, so that component is not affected by the alignment, and we assume the difference of force due to changes of orientation in the range they occur is negligible.

4.4.2 Regression model

The algorithm we chose for estimating the force function from the motion capture data is Support Vector Regression (30). A radial basis seems to be the best choice for the kernel, as pointed out by (29). This particular kernel is very prone of overfitting, so the parameters such as the coefficient must be chosen carefully; for that specific case, we chose a value of =0.1.

The features the model was trained with were:

Bow position • 49 1 Measured 0.8 Estimated

0.6

0.4

0.2 Bow force (V)

0

−0.2

126 128 130 132 134 136 Time (s)

Figure 4.6: Measured bow pressure (blue) and predicted (orange) using the method discussed above. Notice that, while the algorithm cleary distnguishes when the bow is on the load cell and follows the force contour, the peaks are not exactly where they should be; in this specific case, there is an underestimation.

Bow pseudoforce (left) • Bow pseudoforce (right) • Bow tilt • Bow deformation (angle) •

We also tried adding bow deformation (length), but the results were qualitatively worse. Figure 4.6 shows an excerpt of a regression test on load cell data.

Sample rejection and feature distribution

The values the descriptors can take when the violinist is not playing (i.e. when the bow is not on the string) can vary a lot; some of the descriptors themselves are actually ill-defined there. This doesn’t normally matter, because the goal of the project is to analyze gesture and sound together, not what the gestures are when there is no sound. All data that is not part of a note being played is discarded during the segmentation step (see Section 5.1).

50 This is critical, however, when estimating the bow force function we are seeking, which is expressed in terms of the bowing descriptors. Regardless of the value of the descriptors, whenever the bow is not in range the force has to be at least negative, and ideally zero. This introduces a strong non-linearity on the funcion, which makes it much harder to estimate using machine learning techniques.

This effect can be mitigated by estimating first in which frames the violinist is not playing, and discarding them for the traning. For instance, one could train a machine learning classifier such as a Support Vector Machine to identify frames where the violinist is playing, given the bowing descriptors and some basic audio features.

In our case, we already knew the force ground truth for the training recordings, so we decided to apply a threshold on it for deciding whether the bow was on the load cell or not. As shown in Figure 4.7, the histograms of the calibration recordings have a very steep peak at the leftmost region. This peak corresponds to time instants when the bow was not on the loadcell, and by zomming in one can see that is quite gaussian. From this, we deduced that it was indeed the zero-force reading of the cell plus some additive gaussian noise. We defined the threshold as the local minimum closer to the right of this zero-force peak, as pictured on the figure with vertical lines. The threshold detection had a hystheresis of 0.005 V on both sides, meaning that it did not jump to playing from not playing unless the force was 0.005 V above the threshold and viceversa.

This worked most of the time, but sometimes the loadcell was bumped by something else and nobody was playing. To filter these samples out, we alse applied thresholds, this time without hystheresis, on the value of the descriptors, excluding anything with a bow position higher than the length of the bow, or a pseudoforce greater than 1 cm or smaller than -3 cm, for instance. The result was further cleaned with a median filter of length 5, to remove spikes with only one of two samples.

Offset compensation

The regression model has a fundamental problem: when working with data from the violin, the measured bow to string distance when the bow contacts the string may be different than the one used for training from the load cell. This happens because the offset between the marker position and the actual contact point can vary, and because the string is deformated much more than the metal piece on the load cell that is bowed. Since the model is non-linear, this can have a large impact

51 1

Day 1 (start) 0.5

0 −0.5 −0.25 0 0.25 0.5 0.75 1 1.25 1

Day 1 (end) 0.5

0 −0.5 −0.25 0 0.25 0.5 0.75 1 1.25 1 Day 2 (start) 0.5

0 −0.5 −0.25 0 0.25 0.5 0.75 1 1.25 1

0.5 Day 2 (end)

0 −0.5 −0.25 0 0.25 0.5 0.75 1 1.25 1

Day 3 (start) 0.5

0 −0.5 −0.25 0 0.25 0.5 0.75 1 1.25 1 Day 3 (end) 0.5

0 −0.5 −0.25 0 0.25 0.5 0.75 1 1.25

Figure 4.7: Histograms of the loadcell recordings

52 on the prediction.

To compensate for this effect, we added a fixed offset to the pseudoforce features, which are the ones affected the most, established by trial and error; we found +2 mm to work best.

4.4.3 Evaluation of the force estimation process

Training take Iterations Support Vectors Bounded Support Vectors

Day 1, beginning 31256 14052 13826 Day 1, end 42007 31233 30985 Day 2, beginning 47423 25605 25313 Day 2, end 46100 22031 21769 Day 3, beginning 40809 18877 18632 Day 3, end 42978 18126 17878

Table 4.1: Training statistics for the final regression parameters.

Cross-validation

In machine learning, it is common to evaluate a system with cross-validation, which means splitting a dataset with a known ground truth in several parts, and then using all parts but one for training and the remaining for testing the performance. However, we had samples that were very close to each other, because of the high sampling rate, but at the same time there were huge gaps in the gesture space, because of its high dimensionality. Therefore, a traditional cross-validation approach would probably have shown a high overfitting.

Since we recorded a calibration take at the beginning and at the end of each session, we had two independent sets of data. We recorded both of them because we were expected little variations in the response of the bow, but since the bow tension was not modified and the room had the same conditions, these variations can be expected to be small. We thought that if the variacions were indeed small, a good regression would generalize enough to accomodate them and still provide a

53 reliable estimate, so we could use a model trained with data from the beginning of the session to test on the data from the end, and viceversa. This hand-made cross-validation would be much more reliable than a traditional one in terms of over-fitting.

Correspondence with the Schelleng diagram

Since we have no measured ground truth for bow force on the actual violins, we can only indirectly assess how good the regression is performing. Nevertheless, it is very critical to have some metric because parts of the regression process only affect data recorded on violins, such as the offset compensation.

Knowing that the bow force and the bow-bridge distance should be not only correlated but distributed in a way similar to the Schelleng diagram, we established as a qualitative metric for evaluating bow force regressions how much does the scatter plot of bow-bridge distance vs bow force resemble a Schelleng diagram. See Section 6.1.1 for some details on the results.

4.5 Audio features

A number of audio features were extracted from the signal captured with the DPA microphone (see Section 3.3). Some of them, such as the Mel-Frequency Cepstral Coefficients, are still being worked on at the time of the writing and will not be presented. The ones ready for analysis are pitch, aperiodicity and energy, all computed with an open-source implementation of the YIN algorithm (31).

All the audio features are computed frame-wise, by dividing the audio signals in a series of overlapping frames and weighting them with a window. Therefore, the sampling rate of the audio features is given by the audio sampling rate and the overlap between frames, understood as the increment in samples between them:

r r = a f o

In our case, the audio sampling rate was 44100 Hz and the window increment was 32 samples, so the effective audio feature sampling rate was 1378.125 Hz.

54 4.5.1 Pitch

Although we already know the pitch of the notes from the score, computing the actual pitch that was being played can be very useful. Its purpose is two-fold: first, it can verify that the musician was indeed following the score and the segmentation and alignment are corrent; and second, it gives information about finer-grain aspects of pitch, such as how does it behave in note transitions ( and ) and during the note (). However, for the parts of the analysis that compare how certain variables vary depending on the pitch the score pitch, with one value per note, will be used.

4.5.2 Aperiodicity

The aperiodicity gives a measure of how the sound is: the more aperiodic, the more noisy it will sound, without a specific pitch, and the less aperiodic, the clearer will be its pitch. In our case, it is very useful for separating attack transients from the pitched part of the note, and for identifying ghost notes where Helmoltz regime is not well established and the sound is not what one would expect.

4.5.3 Energy

The energy is related to the amplitude of the audio signal. Although it doesn’t directly relate to the perceived loudness, because that would need weighting more certain frequencies as our auditory system does, it can give an idea of it. It will be more useful than the score dynamics in some cases, because that is a very subjective instruction and will most likely be executed differently in different repetitions of the experiments.

55

Chapter 5

DATABASE

With all the recorded data, a database has been constructed, and it will be published in the near future as the MUSMAP I dataset. The database is comprised of around 25,000 violin notes, grouped in 150 takes. It will be published in two forms: as a huge database of queryable notes by metadata; and as a group of 150 datapacks comprising raw video, motion capture, audio, and all the computed descriptors and segmentations for each take.

5.1 Score-performance alignment

One the descriptors were computed, the next step towards building the database was aligning the performances to the score with a note-based segmentation in order to have a set of segments with the timeseries of descriptors for individual notes.

Since there were over 150 recording to align, each of them with over 100 notes, we decided to take a semi-automatic approach: we designed a segmentation algorithm that guessed boundaries between notes without taking the score into account and generated separately a set of metadata from the score of that particular recording. Then, we marked by hand the first note and the timestap of the end of the last note of the recording on the plot. The system counted the number of notes and if they matched that of the score, the segmentation was written to disk.

57 5.1.1 Zero-crossing finder

In order to detect the bow direction changes, we implemented an algorithm that finds zero-crossing points in the velocity signal. Since just checking the sign of pairs of neighboring elements would jump a lot because of noise, we implemented a somewhat more sophisticated algorithm.

Such algorithm is given two margins, left and right. Then, it finds points where 1) there is a zero-crossing at their right; 2) Nl samples on its left, where Nl is the left margin, have the same sign as the point; and 3) Nr samples on its right, where Nr is the right margin, have the opposite sign of the point.

In our case, we run the algorithm without left margin, and with a right margin of around one third of the mean duration of the notes. In this way, it found zero-crossings right before the next note started; see Figure 5.1 for an example.

5.1.2 Graphical User Interface and program flow

As mentioned, the zero-crossing finder only finds bow direction changes, without taking into account the score or anything like that. We implemented a Graphical User Interface where the user is displayed with a plot of the bow velocity, with vertical lines marking the detected note boundaries. The steps performed for every take are the following:

Open relevant files with the generated filenames • Guess note boundaries with the default thresholds • Plot the bow velocity and the pitch, with vertical lines on the detected • boundaries Ask the user whether to continue, adjust thresholds or discard the take • If the user validates the segmentation, ask for the beginning and end marks • of the segment Check that the number of notes matches that of the score • If not, throw an error and restart • When the segmentation is done, write it to disk together with the auto- • generated metadata.

58 600

400

200

0 bow velocity (mm/s) −200

−400

−600

80 85 90 95 100 time (sec)

Figure 5.1: Output of the automatic segmentation software, displaying the found boundaries (red), the pickup waveform (green), the bow velocity (blue), and the pitch (black). Notice that pitch does not vary within several notes, so using bowing gestures was very helpful.

59 5.2 Annotations

Some of the annotations are constant for a given take, i.e., they describe the take as a whole. They are extracted from the filename of any of the files from data belonging to that take, using the following regular expression: day([0-9]) ( w+) perm([A-Z]) inst([0-9]) ([A-G]) ([0-9]+).* \ The others are calculated on a note-by-note basis, based on the score generation al- gorithm (see Section 2.2). In both case, some transformations are applied to supply the user with more useful information during data exploration. The following is a list of all the fields the database contains:

Performer This is extracted with the first capture group of the regular expression. It actually says day, because we recorded a different performer every day. They are numbered from 1 to 3, as have been presented thoughout the report.

Articulation Also extracted from the regular expression (second capture group), it is either legato or martele as detailed in Section 2.1.3.

Permutation Another regular expression group, the third, which can be A or B. See Section 2.2 for more details on their meaning.

String (name) The next capture group is the string; the first annotation derived from it, string name, can be G, D, A, or E.

String (MIDI) Which is the MIDI code of the note that sounds when the string is not stopped, also known as open string note.

Take This is the last capture group, which indicaes the take within the exercise. It is usually 0001, unless the recording went wrong and it had to be repeated, in which case it is 0002, 0003, etc.

Tone Can be ord (ordinary), pont (sul ponticello), or tasto (sul tasto), de- pending on where the score says the note should be played.

60 Dynamic Can be p (piano), mf (mezzoforte), or f (forte), depending what the score says about the current note.

Pitch (MIDI) MIDI code of the pitch being played, according to the score. The pitch in Hz (pmidi 69)/12 can be obtained as: phz = fA 2 , where fA is the frequency of the note A4 (440 Hz in our case) and pmidi is the MIDI code of the pitch.

Pitch (name) Extracted from the MIDI code, and of the form A4, F#3, etc. This is very useful for musicians, who are more used to this representation, and for the repovizz platform, which parse it for displaying scores (see Section 5.3).

Pitch (interval) Number of semitones between the open string note and the current pitch, computed as the MIDI code of the current note minus the MIDI code of the current open string note.

Pitch (finger) Which finger the musician was told to use to stop the string, following usual violin conventions: the numbering goes from 1, the XXX finger, up to 4, the little finger.

Pitch (position) Left hand position the musician was told to use to play the note, follow- ing usual violin conventions: 1 is the lowest, closest to the nut, and the increments are in natural notes.

String length Relative effective string length, which is the ratio between the distance from the stopping point to the bridge and the length of the string. Given by the ps/12 expression 2 , where ps is the pitch interval in semitones described previoulsy.

5.2.1 Look-up tables

In order to navigate the bow stroke database with ease, we saved the chunks of descriptors corresponding to each note separately. Then, we built an index that, given a value of one of the metadata fields, tells which notes have it. The index is saved as a collection of boolean vectors, one for every possible value of every

61 parameter. This allows easy logical filtering (e.g. select all notes that are forte, played on the G string, and not sul ponticello) by using matrix-wise boolean operands on the vectors.

5.3 The repovizz platform

Repovizz (32) is an integrated online system capable of structural formatting and remote storage, browsing, exchange, annotation, and visualization of synchronous multi-modal, time-aligned data. Motivated by a growing need for data-driven collaborative research, repovizz aims to resolve commonly encountered difficulties in sharing or browsing large collections of multi-modal data. At its current state, repovizz is designed to hold time-aligned streams of heterogeneous data, such as audio, video, motion capture, physiological signals, extracted descriptors, or annotations. Most popular formats for audio and video are supported, while CSV formats are adopted for streams other than audio or video (e.g. motion capture or physiological signals). The data itself is structured via customized XML files, allowing the user to (re-)organize multi-modal data in any hierarchical manner. Datasets are stored in an online database, allowing the user to interact with the data remotely through a powerful HTML5 visual interface accessible from any standard web browser; this feature can be considered a key aspect of repoVizz since data can be explored, annotated, or visualized from any location or device.

All the data that we recorded is being uploaded to the repovizz platform. We created a dataset named MUSMAP I, with the roughly 150 takes as different datapacks inside it. Each datapack contains the 3 relevant audio channels (ambience, close-up and pick-up), some audio descriptors computed automatically with the Essentia extractor (33) for each channel, the motion capture data, the reference video, the gestural and audio descriptors that we extracted, and the segmentations at the note, dynamics group, tone group and duration group levels.

62 Figure 5.2: Screenshot of the repovizz visualizer displaying a datapack with one of the takes from MUSMAP I

63

Chapter 6

PRELIMINARY DATA ANALYSIS

In this chapter, we present three main kinds of representations of the data based on simple descriptive statistics: scatter plots, which show how the descriptors are spread on projections of the gestural and auditory spaces; curves, which show how descriptors change over time during a single bow stroke; and box plots, which compare the statistical distribution of scalar features that take a single value for the whole note in different groupings. In some cases, different groups of data are presented at the same times by using different colors. In general, the objective of these graphic representations is to check how the musicians performed when asked to play with different parameters, such as dynamics or articulations.

6.1 Introduction

Before getting into the analysis, we will define the kinds of plots that will appear for visualizing the dataset, together with a brief discussion on the subset choice for plotting.

Box plots

Some of the descriptors, such as the bow-bridge distance, do not change a lot within a single note, but comparing their average value in different notes can be very interesting. In these cases, we have computed the median value of them

65 along all frames belonging to each note, and then drawn box plots for sets of notes containing the median (horizontal line), sample mean (circle), 25-75% quartiles (colored box), the first sample after the quartiles plus 1.5 times the inter-quartile range (T-shape lines) and some outliers.

Curves (functional box plots)

The curve plots need some explanation; what is pictured on them is the median and an estimation of the spread of one of the descriptors, computed for different strokes of the same group, versus normalized time; this can be understood as a functional box plot, or the box plot of a series of functional observations(34). There were several tradeoffs when representing these dimensions, which will be now discussed.

The normalized time is defined as a linear transformation of the relative time axis, which is the time in seconds since the beginning of the note, to the range (0,1). This normalization allows plotting several bow strokes, which have different durations, at the same time, and hance comparing between them. This is not the best way of aligning the curves, process known as curve registration, but it is very easy and fast to perform, and the results obtained are good enough for the points we wanted to make. A better approach would be to align each possible pair of curves with each other using a non-linear warping tecnique such as Dynamic Time Warping (DTW), find the average curve as the one with a minimum on the alignment cost to all the others, and then align all the curves to that one, as discussed in Chapter 7.

The descriptor being represented has two features drawn on the plot: the median curve, computed as the median of all the points with the same timestamp and drawn with the brightest color; and the 25-75% quartile region, fully painted with a bright color.

Scatter plots

The scatter plots allow comparing two dimensions at the same time, and expose correlations between them that might be missed when doing the other kinds of plots. They are mainly used to picture how different groups of data cover different regions or clusters in space, regardless of the time dimension.

66 Player 1 Player 2 Player 3

2 2 2

1 1 1 Bow force (V) Bow force (V) Bow force (V) 0 0 0 0 20 40 60 80 0 20 40 60 80 0 20 40 60 80 Bow−bridge distance (mm) Bow−bridge distance (mm) Bow−bridge distance (mm)

Figure 6.1: Bow-bridge distance vs bow force scatter plots for Players 1, 2 and 3 (left to right).

Each note has approximately between 150 and 300 frames; that means that the scatter plots often contain the order of millions of points. To deal with that, we drawed them with a rasterizer: what is plotted is essentially a 500x500 bins two- dimensional histogram, that can be precomputed much more efficiently than trying to draw all the points.

6.1.1 Player selection

We decided to do the main analysis on one of the players, because the force estimation process gave quite different results for different calbirations. In order to choose the best one, we plotted the bow-bridge distance vs bow force scatter plot, hoping to indentfy the player with the charachteristic closest to the Schelleng diagram, as shown in Figure 6.1. We chose Player 2 because it has the best approximation: both the high force further from the bridge and low force close to the bridge regions are cut out.

6.2 Bowing technique

The first thing to analyze on the dataset is the difference betwen the two bow strokes the musicians were asked to perform, legato and martele (see Section 2.1.3). As a reminder, legato is supposed to be more continuous, in the sense that the beginning of the note should be smooth and the intensity should be kept constant, and martele is supposed to have a strong and fast attack, with a significant energy decay until

67 2.5 2.5

2 2

1.5 1.5

1 1

0.5 0.5 Bow force estimation (V) Bow force estimation (V) 0 0 0 20 40 60 80 0 500 1000 1500 Bow−bridge distance (mm) Bow velocity (mm/s)

Figure 6.2: Bow-bridge distance vs bow force (left) and bow velocity vs bow force (right) scatter plots for legato (blue) and martele (orange) bow strokes. the sound stops completely. This arises very different expectations on what the temporal evolution of parameters such as the bow velocity will look like.

To begin with, the two articulation types cover different regions of the gesture space; Figure 6.2 shows the bow-bridge distance vs bow force and bow velocity vs bow force planes for illustrating that. Martele strokes are centered around low velocities for all force ranges and high velocities with low force, while legato strokes have more weight at moderate velocities, and go higher in force there; since martele have a sharp attack, where the bow begins the stroke with a very high force but no velocity, and then accelerates quickly, this is the shape we expected. As for the

As Figure 6.3 shows, there is a huge difference in the curve profile of the bow velocity for different articulations. This averages all musicians; the difference is even bigger when separating them, as will be discussed later. As expected, martele has a strong peak at the first half of the note, corresponding to the attack of the note, followed by a decay, and legato is moreless constant except for beginning and end, where the bow changes direction.

Audio features also validate the prior findings; the curve plot for the energy of the audio signal, in Figure 6.4, is very similar to the bow velocity profile, which is normal because bow velocity is quite correlated with the amplitude of the resulting audio signal (see Section 1.3.2).

Aperiodicity is more interesting for this audio feature analysis; as seen in Figure 6.4, it is very low at the middle of the note, when the bow is in full Helmholtz

68 Figure 6.3: Bow velocity (left) and bow force (right) temporal profiles for legato (blue) and martele (orange) bow strokes. The x axis is normalized time, as explained before.

Figure 6.4: Audio energy (left) and audio aperiodicity (right) temporal profiles for legato (blue) and martele (orange) bow strokes. The x axis is normalized time, as explained before.

69 2.5 2.5

2 2

1.5 1.5

1 1

0.5 0.5 Bow force estimation (V) Bow force estimation (V) 0 0 0 20 40 60 80 0 500 1000 1500 Bow−bridge distance (mm) Bow velocity (mm/s)

Figure 6.5: Bow-bridge distance vs bow force (left) and bow velocity vs bow force (right) scatter plots for forte (blue), mezzoforte (orange) and piano (green) bow strokes. regime and there is little noise, but different things happen at the boundaries. At the beginning, martele has a much higher aperiodicity due to the noise in the transient with a sharp attack, but it goes much faster to the minimum. However, it also goes faster to noise (high aperiodicity); as the decay softens the note, the signal-to-noise ratio decreases and there are more chances to slip out of the Helmholtz regime.

6.3 Dynamics

Another interesting aspect of the dataset is how musicians (in this case, Player 2) interpret dynamics indications. Dynamics are clearer than articulation, because there is a well-defined metric, but the range is still subjective. Figure 6.5 shows that different dynamics populate clearly separated regions of the gesture space, especially in the bow force dimension, which makes sense: given the same note duration, which can lead the musicians to not change much the bow velocity, what has a great effect on the perceived loudness is the bow force, because it modifies the high frequency content of the sound; obviously, forte has the highest bow force, and piano the lowest. Bow-bridge distance also shows an interesting pattern: it is usually lower (the bow close to the bridge) for forte bow strokes. This can be explained with the Schelleng diagram; when playing forte the musician is using more bow force, and therefore playing closer to the bridge is needed in order to mantain the Helmholtz regime.

70 2.5 2.5

2 2

1.5 1.5

1 1 Bow force (V) Bow force (V) 0.5 0.5

0 0 0 20 40 60 80 0 500 1000 1500 Bow−bridge distance (mm) Bow velocity (mm/s)

2.5 2.5

2 2

1.5 1.5

1 1 Bow force (V) Bow force (V) 0.5 0.5

0 0 0 20 40 60 80 0 500 1000 1500 Bow−bridge distance (mm) Bow velocity (mm/s)

Figure 6.6: Bow-bridge distance vs bow force (left) and bow velocity vs bow force (right) scatter plots for forte (blue), mezzoforte (orange) and piano (green) bow strokes. The top plots are legato, the bottom ones martele.

When separating legato and martele strokes on the above plot, as in Figure 6.6, some interesting features arise: the Schelleng region effect is more accentuated in legato strokes, maybe because in martele the strongest part of the note is the attack, where being in a stable Helmholtz regime is not as important. About the bow velocity, in legato strokes it is low only for piano and, to some extent, mezzoforte notes, which suggests the musician was using the whole bow for playing forte and only part of it when playing piano, effectively playing notes of the same duration with less velocity. In the martele strokes, the bow velocity has a very flat profile for piano notes, suggesting that the notes are much closer to legato than to martele. Additionally, the strong peak at velocity 0 that we had detected in martele strokes previously appears almost exclusively in forte strokes, suggesting that these are the notes with an actual sharp attack and fast decay. The same happens with very high velocities, probably corresponding to the segments after the attack, where the bow is drawn very fast to create a loud transient.

71 Figure 6.7: Bow velocity (left) and bow force (right) temporal profiles for legato (top) and martele (bottom) notes, separating forte (blue), mezzoforte (orange) and piano (green) bow strokes. The x axis is normalized time, as explained before.

These findings agree with the temporal profiles of the strokes for gestural features. As pictured in Figure 6.7, both bow velocity and bow force are larger when the dynamics indications increase across the profile. Piano strokes have smoother shapes, probably because the musician was trying to avoid bumps and attacks, and because the range of acceptable bow force or velocity was smaller. Especially interesting is the bow velocity for legato strokes, where when playing in forte has a very accentated peaks at the beginning and at the end of the note, corresponding to the musician accelerating too much and rectifying.

Temporal profiles of audio features also follow the same trend. As happened before, audio energy is highly correlated with bow velocity. Interestingly, the energy different between piano and mezzoforte is about the same as between mezzoforte and forte, which means the musician had good control and consistency over this parameter and could plan correctly the distribution of the different dynamics. There is an exception: the end of legato and forte strokes, where there is an energy peak. This is probably what violin professionals refer as inflation, overanticipation or the

72 Figure 6.8: Audio energy temporal profiles for legato (left) and martele (right) notes, separating forte (blue), mezzoforte (orange) and piano (green) bow strokes. The x axis is normalized time, as explained before. shark effect: at the end of the note, the loudness increases because the musician is already thinking about the next note, which is probably as loud as the preceeding, and this is considered bad practice when playing expressive music.

Finally, comparing legato and martele bow strokes side to side for each dynamics for note-level agglomerative features reveals some new insights, as shown in Figure 6.9. For instance, the velocity range is much greater in martele strokes; while both articulation types reach similar maxima, piano strokes in martele are much slower. However, the opposite happens with bow force: legato strokes have more range, especially in the forte side, where the force is larger. This could be because the musician has more time to adapt and find the limits, thanks to the note being more static. The same happens with bow-bridge distance, which additionally happens to be consistently closer to the bridge for legato strokes. The reason for that is probably the same: since the note is more static, the musician can push more the limits of the Schelleng diagram with enough confidence.

6.4 Tone

Figure 6.10 shows the general distribution of the three main sounding points on the bow-bridge distance versus bow velocity plane. There is a clear distinction in bow-bridge distance when asking the musicians to play at different sounding points,

73 1500 1.5 90 1250 1.25 75 1000 1 60 750 0.75 45 500 0.5 30 250 0.25 15 Bow velocity (mm/s) Bow force estimation (V)

0 0 Bow − bridge distance (mm) 0 f mf p f mf p f mf p

Figure 6.9: Bow velocity, bow force and bow-bridge distance for legato (blue) and martele (orange) bow strokes, grouped by dynamics.

2.5 2.5

2 2

1.5 1.5

1 1

0.5 0.5 Bow force estimation (V) Bow force estimation (V) 0 0 0 20 40 60 80 0 500 1000 1500 Bow−bridge distance (mm) Bow velocity (mm/s)

Figure 6.10: Bow-bridge distance vs bow force (left) and bow velocity vs bow force (right) scatter plots for sul tasto (blue), ordinary (orange) and sul ponticello (green) bow strokes. and they play with more pressure when asked to play close to the bridge. This could be explained by the fact that playing closer to the bridge actually requires more pressure in order to stay within the Helmhotz regime; however, we had asked them not to change dynamics when changing the sounding point. Actually, the figure also shows that when playing closer to the bridge, the musicians use less bow velocity, obtaining a smaller amplitude and therefore compensating for the effect of the increased bow force.

When separating legato and martele bow strokes, as in Figure 6.11, we can observe that in legato the bow-bridge distance for different tones is more separated; in martele, everything is more mixed up, although the regions are still identifiable. In contrast with what happened with dynamics, there is no distinction in tone in the 0

74 2.5 2.5

2 2

1.5 1.5

1 1 Bow force (V) Bow force (V) 0.5 0.5

0 0 0 20 40 60 80 0 500 1000 1500 Bow−bridge distance (mm) Bow velocity (mm/s)

2.5 2.5

2 2

1.5 1.5

1 1 Bow force (V) Bow force (V) 0.5 0.5

0 0 0 20 40 60 80 0 500 1000 1500 Bow−bridge distance (mm) Bow velocity (mm/s)

Figure 6.11: Bow-bridge distance vs bow force (left) and bow velocity vs bow force (right) scatter plots for sul tasto (blue), ordinary (orange) and sul ponticello (green) bow strokes. The top plots are legato, the bottom ones martele. velocity region of the martele scatter plot. The only thing worth noting is that, as already mentioned, sul tasto notes are played with higher bow velocity.

When putting dynamics into the equation, things become much clearer. Figure 6.12 shows that when playing mezzoforte tone correlates with bow-bridge distance, as expected, but when playin piano or forte there is little difference in any of the features when telling the musician to play in another region. This could be explained by the fact that (1) combinations such as piano plus sul ponticello or forte plus sul tasto may fall outside the Schelleng region for the range the musician has chosen; (2) the musician is willing to sacrifice what the score says in order to mantain the Helmholtz regime; and (3) dynamics have more importance in western music than tone, especially in violin playing. Actually, the bow-bridge distance for forte is quite closer to the bridge than for piano, albeit independent of the asked tone. The main difference between legato and martele seen in the box plots is, as already mentioned, that legato is played with substantially higher force and closer

75 1.5 90 1.25 75 1 60 0.75 45 0.5 30 0.25 15 Bow force estimation (V) 0 Bow − bridge distance (mm) 0 f mf p f mf p

1.5 90 1.25 75 1 60 0.75 45 0.5 30 0.25 15 Bow force estimation (V) 0 Bow − bridge distance (mm) 0 f mf p f mf p

Figure 6.12: Bow force and bow-bridge distance in legato (top) and martele (bottom) bow strokes, for sul tasto (blue), ordinary (orange) and sul ponticello (green) notes, grouped by dynamics. to the bridge; what we can add is that this is especially true for louder dynamics: forte, and to a certain extent mezzoforte.

6.5 Duration

The musicians were asked to perform notes with 2 different durations, quarter and half notes. The expected duration d of the notes, in seconds, is given by: n d =60 t Where n is their number of beats per note, 2 for half notes and 1 for quarter notes; and t is the tempo in beats per minute or bpm; in our case, 120 for legato strokes and 132 for martele strokes. Therefore, we expect the following note durations:

76 1.5 2 2 1

1 1 Time (s) 0.5

0 0 0 Legato Martele Bow force estimation (V) 0 20 40 60 80 Bow force estimation (V) 0 500 1000 1500 Bow−bridge distance (mm) Bow velocity (mm/s)

Figure 6.13: Duration of half (blue) and quarter (orange) notes, for legato and martele strokes, and bow-bridge distance vs bow force and bow velocity vs bow force scatter plots, also for half and quarter notes.

Stroke type Quarter notes Half notes

Legato 0.5 s 1 s Martele 0.455 s 0.909 s

Table 6.1: Expected duration of the notes.

A box plot doing the same note grouping, in Figure 6.13, shows that the musicians actually played faster than they were asked to in half notes, but were on time for quarter notes. This could be explained by the fact that we gave them a reference beat with quarter note clicks at the beginning. When they were playing half notes, they had to count to 2 for every note, and this motivates drifting compared to changing notes at every beat. Notice that the mean and median are very close to each other, and the percentiles are quite symmetrical; this is an indicator of a gaussian distribution, which will not be the case in other cases.

Another interesting thing to look at when analyzing changes related to duration is the bow force and velocity plane. As also shown in Figure 6.13, half note strokes tend to have less velocity in general, but more force at the beginning of the note. This could be explained by the fact that we did not tell the musicians how much bow they should use when playing, that is, whether to go from the tip to the frog or viceversa at each stroke, or stay closer to the tip all the time, for instance. They probably tried to use more than half of what they used for half notes when playing quarter notes, and therefore had to go faster to meet the time constraints. The increased force at the beginning in half notes could be simply because they had

77 more time to stop the note and prepare the attack; notice that this region is only covered by martele strokes, as detailed before.

6.6 Player

As a closing for this analysis chapter, Figure 6.14 presents two of the most in- teresting box plots presented previously, each of them repeated for each one of the players. This figure clearly shows the misleadings to which using only one musician for the recordings could lead: most of the conclusions drawn from the plots for Player 2 are not valid for the other musicians. For instance, Players 1 and 3 actually change the bow-bridge distance when playing different tones always, regardless of dynamics, although Player 3 also changes it when playing different dynamics and Player 1 does not. The bow velocity range when playing with differ- ent dynamics, which was different for legato and martele when analyzing Player 2, does not change either within articulation types for Players 1 and 3, and Player 1 plays quite slower in general. This is just a very brief outlook of the capabilities of a dataset such as MUSMAP I, which does not even consider the different violins, and which aims at illustrating the kind of research that it will allow in a near future.

78 90 1500 75 1250 60 1000 45 750 30 500 15 250 Bow velocity (mm/s)

Bow − bridge distance (mm) 0 0 f mf p f mf p

90 1500 75 1250 60 1000 45 750 30 500 15 250 Bow velocity (mm/s)

Bow − bridge distance (mm) 0 0 f mf p f mf p

90 1500 75 1250 60 1000 45 750 30 500 15 250 Bow velocity (mm/s)

Bow − bridge distance (mm) 0 0 f mf p f mf p Figure 6.14: Comparison of bow-bridge distance (left) for sul tasto (blue), ordinary (orange) and sul ponticello (green) and bow velocity (right) for legato (blue) and martele (orange) notes, all grouped by dynamics, separating Player 1 (top), Player 2 (middle) and Player 3 (bottom).

79

Chapter 7

CONCLUSION

In this last chapter, we present a review of the whole project. Since it is part of a longer term Marie Curie IOF action, lots of aspects have always been expected to remain open, and will be probably addressed within the next months. In short, the MUSMAP I dataset has been successfully designed, recorded and processed, although there is still a bit of work to be done until it can be published. Some improvements to the force estimation methodology have been proposed, and the whole process has been extensively documented - although this documentation is also still to be cleaned up before made available.

7.1 Achievements

All proposed objectives have been accomplished; from the methodology and setup definitions for multimodal experiments involving violin performance, to the development of specific software tools for data analysis. Parts of the MUSMAP I dataset are already online, and part of the software for synchronizing multimodal data and organizing the database will be published soon with an open license.

We have also presented a promising preliminary analysis of the MUSMAP I dataset, which should be a starting point for further research and encourage the usage of the data we have generated both within and outside the MUSMAP project.

81 7.2 Future work

As already mentioned, the potential of the MUSMAP I dataset alone is huge com- pared to the depth of the analysis performed in this project. The most immediate future work would be to expand the analysis, and see what happens more in depth:

Better curve registration for the time profiles • Mathematical modeling of the bow strokes as some kind of curves • Further audio analysis (Mel-Frequency Cepstral Coefficients) • Dimensionality reduction on the auditory and motor spaces • Modeling of mapping between auditory and motor spaces •

Besides that, there are many elements of the feature extraction process that could be refined. For instance:

Systematic finding of the optimal load cell regression offset, and evaluation • of alternatives Better documentation of the software and publication as open-source •

Finally, the MUSMAP project still has a long way to go, and it will be in that direction that things will probably go. That includes, apart from what has already been mentioned:

Record similar experiments with more repetitions, more musicians and more • instruments Develop audio synthesis technques taking advantage of the gesture descrip- • tors already computed

7.3 Acknowledgments

This project has been funded by the Agencia` de Gestio´ d’Ajuts Universitaris i de Recerca (AGAUR) through the COLAB scholarship program, and by the MUSMAP Marie Curie IOF action. I would also like to thank the people from

82 the Music Technology department at McGill University for hosting me, especially Gary Scavone and the Computational Acoustic Modeling Laboratory. Finally, I would like to thank Esteban Maestre for co-supervising this project, putting always as much time as needed and often more, and guiding me through this journey from college classes to the academic world.

83

Bibliography

[1] Woodhouse J, Galluzzo PM. The bowed string as we know it today. Acustica - Acta Acustica. 2004;90(4):579-589.

[2] Schoonderwalt E. Mechanics and acoustics of violin bowing. The Royal Institute of Technology (KTH, Stockholm); 2008.

[3] Helmholtz Hv. Lehre von den Tonempfindungen. Vieweg: Braunschweig; 1862.

[4] Cremer L. The physics of the violin. MIT Press; 1984.

[5] Schelleng JC. The bowed string and the player. Journal of the Acoustical Society of America. 1973;53:26–41.

[6] Schumacher RT. Self-sustained oscillations of the bowed string. Acustica. 1979;43:109–120.

[7] McIntyre ME, Woodhouse J. On the Fundamentals of Bowed String Dynam- ics. Acustica. 1979;43(2):93-108.

[8] McIntyre ME, Schumacher RT, Woodhouse J. On the Oscillations of Musical Instruments. Journal of the Acoustical Society of America. 1983;75(5):1325- 1345.

[9] Galamian I. Principles of Violin Playing and Teaching, 3rd edition. Shar Products Co.; 1999.

[10] Garvey B, Berman J. Dictionary of Bowing Terms for Stringed Instruments. American String Teachers Association; 1968.

[11] Fischer S. Basics: 300 exercises and practice routines for the violin. Edition Peters, London. 1997;.

85 [12] Hodgson P. Motion study and violin bowing. American String Teachers Association; 1958.

[13] Trendelenburg W. Die naturlichen¨ Grundlagen der Kunst des Streichinstru- mentspiels. Verlag von Julius Springer, Berlin. 1925;.

[14] Askenfelt A. Measurement of bow motion and bow force in violin playing. Journal of the Acoustical Society of America. 1986;80(4):1007–1015.

[15] Askenfelt A. Measurement of the bowing parameters in violin playing. II. Bow-bridge distance, dynamic range, and limits of bow force. Journal of the Acoustical Society of America. 1989;86(2):503–516.

[16] Paradiso JA, Gershenfeld NA. Musical applications of electric field sensing. Computer Music Journal. 1997;21(2):69–89.

[17] Young D. The Hyperbow: A Precision Violin Interface. Proceedings of the 2002 International Computer Music Conference. 2002;.

[18] Young D. A Methodology for Investigation of Bowed String Performance Through Measurement of Violin Bowing Technique. Massachusetts Institute of Technology; 2007.

[19] Goudeseune CMA. Composing with parameters for synthetic instruments. University of Illinois at Urbana-Champaign; 2001.

[20] Rasamimanana N, Flety´ E, Bevilacqua F. Gesture Analysis of Violin Bow Strokes. Lecture Notes in Computer Science. 2006;3881:145–155.

[21] Schoonderwaldt E, Rasamimanana N, Bevilacqua F. Combining accelerome- ter and video camera: reconstruction of bow velocity profiles. In: Proceedings of the 2006 International Conference on New Interfaces for Musical Expres- sion. Paris; 2006. p. 200–203.

[22] Demoucron M, Causse´ R. Sound synthesis of bowed string instruments using a gesture based control of a physical model. In: Proceedings of the 2007 International Symposium on Musical Acoustics. Barcelona; 2007. .

[23] Demoucron M, Askenfelt A, Causse´ R. Measuring Bow Force in Bowed String Performance: Theory and Implementation of a Bow Force Sensor. Acta Acustica united with Acustica. 2009;95(4):718–732.

[24] Guaus E, Bonada J, Perez´ A, Maestre E, Blaauw M. Measuring the bow pressing force in a real violin performance. In: Proceedings of the 2007 International Symposium in Musical Acoustics. Barcelona; 2007. .

86 [25] Guaus E, Blaauw M, Bonada J, Maestre E, Perez.´ A calibration method for accurately measuring bow force in real violin performance. In: Proceedings of the 2009 International Computer Music Conference. Montreal;´ 2009. .

[26] Maestre E, Bonada J, Blaauw M, Guaus E, Perez´ A. Acquisition of violin instrumental gestures using a commercial EMF device. In: Proceedings of the 2007 International Computer Music Conference. vol. 1. Copenhagen; 2007. p. 386–393.

[27] Schoonderwaldt E, Demoucron M. Extraction of Bowing Parameters from Violin Performance Combining Motion Capture and Sensors. Journal of the Acoustical Society of America (In Press). 2009;.

[28] Maestre E. Modeling Instrumental Gestures: An Analysis/Synthesis Frame- work for Violin Bowing. Universitat Pompeu Fabra; 2009.

[29] Baez R. Using infra-red motion capture data to measure the bow pressing force in bowed string instruments. Universitat Pompeu Fabra; 2013.

[30] Cristianini N, Shawe-Taylor J. An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. New York: Cambridge University Press; 2003.

[31] de Cheveigne´ A, Kawahara H. YIN, a estima- tor for speech and music. Journal of the Acoustical Society of America. 2002;111(4):1917–1930.

[32] Mayor O, Llimona Q, Marchini M, Papiotis P, Maestre E. repoVizz: a framework for remote storage, browsing, annotation, and exchange of multi- modal data. In: Proceedings of the 21st ACM international conference on Multimedia. ACM; 2013. p. 415–416.

[33] Bogdanov D, Wack N, Gomez´ E, Gulati S, Herrera P, Mayor O, et al. ES- SENTIA: an Audio Analysis Library for Music Information Retrieval. In: In- ternational Society for Music Information Retrieval Conference (ISMIR’13). Curitiba, Brazil; 2013. p. 493–498.

[34] Genton MG, Sun Y. Functional Boxplots. Journal of Computational and Graphical Statistics. 2011;20(2):316–334.

87

Appendices

89

Appendix A

MUSICAL SCORES

91

Appendix B

QUESTIONNAIRE

109 Computational Acoustics Modeling Lab, Music Technology Music Research, Schulich School of Music, McGill University

MUSMAP Project Auditory-motor patterning of music performance via multi-modal analysis of playing technique

QUESTIONNAIRE FOR BOWING ANALYSIS EXPERIMENT

Principal Investigator Dr. Esteban Maestre Computational Acoustics Modeling Lab, Music Technology, Schulich School of Music, McGill University Marie Curie Posdoctoral Fellow [email protected]

Faculty Supervisor Prof. Gary ScavoneComputational Acoustics Modeling Lab, Music Technology, Schulich School of Music, McGill University Associate Professor [email protected]

Please answer the following questions:

1. Age (must be over 18): 2. Sex: 3. Years of formal musical training: 4. Years of formal violin performance training:

(continues in next page) Instrument #1: a. From 1 to 10, please assess how comfortable you felt while playing this instrument: b. Please briefly describe the main treats of this instrument in terms of playability:

c. Please briefly describe the main treats of this instrument in terms of tone (or timbre):

Instrument #2: a. From 1 to 10, please assess how comfortable you felt while playing this instrument: b. Please briefly describe the main treats of this instrument in terms of playability:

c. Please briefly describe the main treats of this instrument in terms of tone (or timbre):

Instrument #3: a. From 1 to 10, please assess how comfortable you felt while playing this instrument: b. Please briefly describe the main treats of this instrument in terms of playability:

c. Please briefly describe the main treats of this instrument in terms of tone (or timbre):

5. From 1 to 10, please assess how comfortable you felt while playing in these experimental conditions:

Name Date

Signature

CODE (to be filled by the researcher in charge):