DEGREE PROJECT IN THE FIELD OF TECHNOLOGY MEDIA TECHNOLOGY AND THE MAIN FIELD OF STUDY COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2017

Mapping physical movement parameters to auditory parameters by using human body movement

OLOF HENRIKS

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION Mapping physical movement parameters to auditory parameters by using human body movement

Mappning av fysiska r¨orelseparametrar till ljudparametrar genom anv¨andningav m¨ansklig kroppsr¨orelse

By: Olof Henriks, [email protected]

Submitted for the completion of the KTH programme: Master of Science in Engineering in Media Technology, Master’s programme in Computer Science.

Supervisor: Ludvig Elblaus, KTH, School of Computer Science and Communication, Department of Media Technology and Interaction Design.

Examiner: Roberto Bresin, KTH, School of Computer Science and Communication, Department of Media Technology and Interaction Design.

Work commissioned by: EU Dance Project.

Date of submission: 2017-01-15 Abstract

This study focuses on evaluating a system containing five different mappings of physi- cal movement parameters to auditory parameters. Physical parameter variables such as size, location, among others, were obtained by using a motion tracking system, where the two hands of the user would work as rigid bodies. Translating these vari- ables to auditory parameter variables gave the ability to control different parameters of MIDI files. The aim of the study was to determine how well a total of five partic- ipants, all with prior musical knowledge and experience, could adapt to the system concerning both user generated data as well as overall user experience. The study showed that the participants developed a positive personal engagement with the sys- tem and this way of audio and music alteration. Exploring the initial mappings of the system established ideas for future development of the system in potential forth- coming work. Sammanfattning Denna studie fokuserar p˚autv¨arderingav ett system som inneh˚allerfem olika mapp- ningar av fysiska r¨orelseparametrartill ljudparametrar. Variablerna f¨orde fysiska r¨orelseparametrarnainh¨amtades via ett motion tracking-system, d¨aren anv¨andares b¨aggeh¨anderverkade som stela kroppar. Overs¨attningav¨ dessa variabler till variabler f¨orljudparametrar m¨ojliggjorde styrning av olika parametrar inom MIDI-filer. M˚alet med studien var att besluta hur v¨alfem anv¨andare,alla med tidigare musikalisk kun- skap och erfarenhet, kunde anpassa sig till systemet g¨allandeb˚adeanv¨andargenererad data samt ¨overgripande anv¨andarupplevelse. Studien visade att anv¨andarnautveck- lade ett positivt personligt engagemang med systemet och detta s¨attatt p˚averka ljud och musik. Unders¨okning av systemets initiala mappningar etablerade id´eerf¨or framtida utveckling av systemet inom potentiellt kommande arbete.

i Acknowledgements

EU Dance Project, my principal. Thank you for allowing me to work with you.

Giampiero Salvi, my supervisor at KTH for the first parts of my thesis.

Ludvig Elblaus, my supervisor at KTH for the later parts of my thesis.

Roberto Bresin, my examiner at KTH.

Jimmie Paloranta, my partner in crime and endless source of assistance.

ii Glossary

BPM Beats Per Minute MIDI Musical Instrument Digital Interface OSC Open Sound Control Velocity (MIDI) Representation of how forcefully, or loud, a note is played

iii Contents

1 Introduction 1 1.1 Purpose ...... 1 1.2 Question ...... 1 1.3 Project limitations ...... 2

2 Background 3 2.1 Historical background ...... 3 2.2 Mapping physical parameters to auditory parameters ...... 5 2.3 Similar studies ...... 9 2.4 Description of development tools ...... 10

3 Method 11 3.1 Equipment and experimental setup ...... 11 3.2 System development ...... 12 3.3 User tests ...... 13 3.4 Interview procedure ...... 17

4 Results 18 4.1 Mapping 1: Size-Pitch ...... 18 4.2 Mapping 2: Location-Spatialisation ...... 20 4.3 Mapping 3: Orientation-Tempo ...... 22 4.4 Mapping 4: Motion-Loudness ...... 24 4.5 Mapping 5: Distance-Duration ...... 26 4.6 Overall user experience ...... 27

5 Discussion 28 5.1 System evaluation ...... 28 5.2 Possible improvements ...... 28 5.3 Future studies ...... 29

6 Conclusion 31

References 32

Appendix 35

iv 1 Introduction

There has been done a great amount of research and development in using contem- porary digital technology to aid human musical expressions, and it is still going on to this day as made evident by NIME conferences (NIME, 2016). One way is the mapping of physical parameters to auditory parameters within the context of music making.

Physical parameters in this context are features derived from the movement of the human body, or separate body parts or limbs, as well as location based features. All of these are connected to the way humans interact or express themselves, whether it be in a group or alone. These kinds of body movements can be mapped within a system to a sound representation that corresponds to the action taken.

Auditory parameters include the ways music, or audio in general, can be altered. Some examples would be pitch alteration, amplitude change (loudness), the duration of single or several segments of audio, or the pace of a musical piece (tempo).

1.1 Purpose The subject of the thesis is about mapping five physical parameters to five auditory parameters, based on common parameters derived from a meta study by Dubus & Bresin (2013). This was done by translating human body movement to physical parameters and mapping these to auditory parameters, that in turn was used to alter MIDI files. The purpose is to determine the performance of each mapping utilising user tests and interviews.

1.2 Question Given five mappings based on the meta study by Dubus & Bresin (2013), how well can a user adapt to any of these mappings in terms of both measured numerical data as well as user experience?

Sub-questions:

1. Can the user generated data show a measurable increase in adaptation as the experiment progresses?

2. How do the users perceive their own skill for each of the mappings?

1 The outcome of the experiments will be an evaluation of the physical user movement parameters, the auditory parameters, and their respective mappings. The hypothesis will be tested by determining the functionality of a system used by a person that has prior music experience but does not hold any former knowledge of the system.

1.3 Project limitations Correctly determining all possible parameter variables and mappings will not be cov- ered by the thesis, but will instead give an initial evaluation of implementing physical to auditory parameter mappings. The thesis will focus on the mapping of five physical parameters to five auditory parameters, as it would not be feasible to expand into the vast amount of possible parameter mappings described by Dubus & Bresin (2013). These parameters as well as the mappings are listed below.

Physical parameters Auditory parameters Distance Duration Location Loudness Motion Pitch Orientation Spatialisation Size Tempo

Mappings, labelled M1-M5.

• M1: Size-Pitch

• M2: Location-Spatialisation

• M3: Orientation-Tempo

• M4: Motion-Loudness

• M5: Distance-Duration

2 2 Background

2.1 Historical background

Humans have always had a way to express themselves by the use of music and physical body movement. The different ways of expression ranges from social gatherings and artistic creations to war dances and religious rituals, as well as several other fields of application.

2.1.1 History of embodied musical practice One of the simplest human musical expressions are using the voice to sing, scream, whisper, and similar. According to a study by Arensburg et al (1989), a hyoid bone that dates back around 60,000 years have been discovered in Israel. This bone suggests that Neanderthals would have the same physiological voice capabilities as modern day humans, as the bone has stayed relatively unchanged since.

Musical instruments can be traced back tens of thousands of years, as a flute crafted from a bear bone was found in 1996 that possible is around 43,000 years old. Al- though this has been disputed, as there are possible predator bite marks that could have created the holes of the flute and thereby rendering the crafting of the flute unintentional (Bower, 1998). It is also discussed by Conard et al (2009) that flutes discovered in Germany dates back approximately 40,000 years.

The usage of human body movement in combination with music is ever present in society. It ranges from everything from rave clubs to ballroom dancing, where move- ment of the body is used to further the appreciation of musical pieces. There is however other usages for dance and similar bodily expressions. In native American culture, war dances was used to increase the morale before an upcoming confronta- tion (Wilson et al, 1988), as these exhibitions would acknowledge the warriors place within the tribe and define a sense of purpose.

In the 1970s, a system named Selspot was developed to aid in the study of human motion (Welch & Foxlin, 2002). Human body movement was tracked with the use of several position sensing detectors in the form of analog optical sensors, where the user would also wear a number of light-emitting diodes.

In the year 1984, Michel Waisvisz gave a with a computer music controller called ”The Hands” (Waisvisz, 1999). This controller used sensors attached to the hands of the user, which could manipulate music by translating different physical

3 parameters (distance between hands, bowing gestures, etc.) to input data for MIDI data sequences (Krefeld & Waisvisz, 1990).

2.1.2 Sonification In Hermann & Hunt (2011), sonification is defined as the concept of delivering infor- mation using non-speech audio.

The oldest known usage of sonification can be traced back to 3,500 B.C.E. in the cradle of civilization, Mesopotamia, where auditors (from the latin work auditus, meaning “I hear”) could compare different reports of in-, and outbound goods, as well as those currently in stock, to determine any deficiencies (Worrall, 2009). These reports could then be deemed correct or inaccurate by listening to the other auditors, i.e. hearing them.

A doorbell is a machine age era example of sonification as it generates a sound to convey the information about an upcoming visitor, where that sound is then deliv- ered to a specific location to inform about the guest at hand. The first model of an electromagnetic doorbell that used electric wires to send a signal from a button to ring a bell was invented by Joseph Henry in 1831 and used the same principles that would later be used in the development of the telegraph (Copp & Zanella, 1993).

According to Capp & Picton (2000), a sonification device called the optophone was invented in 1913 and was used to interpret printed text as musical notes. The device scanned a line of text by using a row of five light emitting spots that would reflect unto the page surface. The device would then register the reflection and generate different notes depending on the character that was scanned, thereby aiding blind persons to read.

These are a few examples of sonification. However, to avoid confusion the thesis will be based on the sonification method called parameter mapping sonification (Her- mann & Hunt, 2011) as the term sonification applies to several other methods of usage.

4 2.2 Mapping physical parameters to auditory parameters

Physical parameters define objects or humans in time and space, where the auditory parameters define a sound source or an audio output in relation to the human audi- tory system, or simply the sense of hearing. These different parameters are described in the two sections below, 2.2.1 and 2.2.2. The classification of the parameters was done by Dubus & Bresin (2013) by summarising a total of 179 scientific publications within the field. All the high-level categories as well as the description of the param- eters, if not explicitly stated otherwise, is derived from that summary.

2.2.1 Physical parameters The physical parameters was divided into a total of five high-level categories.

1. Kinematics

2. Kinetics

3. Matter

4. Time

5. Dimensions

The first two categories are linked to the motion of the human body, and thus relevant for the study. The category Kinematics is based on motion of the human body, and includes location of the object, distance between objects, orientation of the object, and several more. Kinetics is also linked to motion of the human body, but focuses on the reason of the movement and what forces that causes the motion. Such reasons can be the energy or force of the object, current state of activity, or pressure.

Matter is basically the description of the body, or object, when it comes to phys- ical attributes. Time is the linear arrangement of intermediate events, which include time elapsed between two specific events, the rate of occurring events, and the fre- quency of the signal, among others. The final category, Dimensions, is essentially the geometry of the object and its surroundings.

5 2.2.2 Auditory parameters The auditory parameters follow the same framework as the physical parameters, where the parameters was partitioned into five high-level categories. The main difference here is that the category named Spatial is referring to a set of parameters that are commonly used in combination with each other, as they are usually not addressed individually. There are also auditory parameters that belongs to more than one high- level category.

1. Pitch-related

2. Timbral

3. Loudness-related

4. Spatial

5. Temporal

Pitch-related parameters are pitch, how high or low a note is perceived, and pitch range, the lower and upper bounds of the output audio. Plack et al (2005) discusses that pitch can be interpreted as the frequencies connected to specific musical notes as well as only the frequency of the sound, which gives that the parameters can be used both in a musical and non-musical sense. This was also mentioned by Stevens & Volkmann (1940) which suggested that pitch is interpreted by the human auditory system whereas a frequency is measured with audio equipment. These are two highly quantifiable parameters, as they do not include any personal interpretations.

The second category is named Timbral, and is the factors that separates sounds that have interchangeable pitch and loudness. This is usually the definition of the pa- rameter , but timbre is instead used in the report by Dubus & Bresin (2013) as a general parameter within the category Timbral that would be used when the timbre would be unspecified. Examples of timbre that would be specified is enhancing the brightness of an audio clip by using an (Z¨olzer,2011). It is also discussed by Patel (2010) that the perceptual features of a properly built instrument are of greater aesthetic value in comparison to a cheaper counterpart.

Loudness-related contains the parameters used to describe the loudness (the am- plitude) of the signal as well as describing the loudness dynamics. Loudness uses the human auditory system as reference (Rumsey & McCormick, 2012), as it applies to if a person experiences certain frequencies of the output audio as loud, moderate, or near silent.

6 The category named Spatial refers to the location of the sound source, and is described by using several connected sub-parameters that usually is present as a combination of two, or several, of these settings. The sub-parameters include different kinds of panning (moving the audio output between sound sources) that includes everything from stereo to surround panning i.e. the usage of two or several loud speakers (or other sound sources). The category also includes parameters of head related transfer functions. Such are the difference in audio delivery time between the ears of the hu- man body, as well as the frequency and amplitude difference that occurs when taking the positioning of the ears in account (Ziegelwanger & Majdak, 2014).

The final category is named Temporal, and defines aspects of the output audio that is related to the tempo, which can be described as the number of beats per minute (BPM) of a musical piece, or different durations, which can be rhythmic, ambient, or other duration properties.

Some auditory parameters can belong to more than one category. Melody, the or- dering of groups of musical notes, belongs to both the Pitch-related as well as the Temporal category. Also, the parameter labelled Harmony, i.e. chords that can be combined within the rules of music theory, belongs to both the Pitch-related and Timbral category.

2.2.3 Examples of parameter mappings The idea of mapping a physical parameter to an auditory counterpart means that the physical parameter is derived from any of the parameters described in section 2.2.1 and then using that parameter output data as the input to a system that matches that data to an auditory parameter. This will allow a user to alter the output audio by movement of its body, or by any other physical attribute.

7 According to the summary Dubus & Bresin (2013), these are the most common mappings within the field. These mappings were present in at least 8 of the total 60 projects analysed within the 179 papers included in the summary.

• Location to Spatialisation

• Location to Pitch

• Distance to Loudness

• Density to Pitch

• Distance to Pitch

• Density to Duration

As we can see from the list above, the usage of location as a physical parameter and pitch as an auditory parameter are common. The first mapping listed is distinct, as the location parameter is based on the location of an object, which could be a user, and spatialisation is, as described in section 2.2, connected to the location of the sound source. Location is commonly mapped to pitch, which density and distance are also commonly mapped to. This could be explained by the notion that pitch is one of the fundamental variables of musical expression (Terhardt,1978).

The mapping of distance to loudness is clear, as the distance to a sound source would affect the perceived sound and thereby effectively affect the audio loudness. Distance can also be mapped to pitch, where the frequency of the audio would be altered based on how far the user is from the sound source, or any other object.

Density can work in many different ways, as it can be how densely populated a room is, i.e. how many people there are in the room in comparison to the room size, but is usually used to measure the amount of matter per volume unit. This parameter is most commonly used with the auditory parameters pitch and duration.

8 2.3 Similar studies The related work of this thesis is divided into various areas. Motion tracking, map- ping of physical and auditory parameters, and fabrication of prototypes and systems that combine both of these ideas.

A commercial motion tracking system, known as Kinect, have been developed by Microsoft. Kinect sold over 10 million copies in the three first months of its launch (Zhang, 2012). The system is mainly used with the Xbox video game system to play video games, but the system is also able to detect depth and colour as well as facial and voice recognition. The system have been used for the mapping of parameters. Yoo et al (2011) utilised this technology to translate a user’s joint velocity and current joint position to MIDI data, consequently generating a mapping of parameters.

Chan et al (2011) proposed a motion tracking system in combination with virtual reality that would be used to teach users to dance. It would work so that the student would try to mimic a dance teacher and receiving feedback on their own performance. The system would also be capable of recording motion tracking data, and thus be able to determine any student improvements.

A bachelor’s thesis made by Paloranta & Renstr¨om(2012) describes mapping the location of a musician to the loudness level of stage monitor . The thesis discusses how musicians can change the loudness level of the stage monitor system by moving closer to a specific monitor, or by moving from one monitor to another. This is achieved by using a motion sensor system that gives output data to a Pure Data patch, where the entire system also supports more than one musician simultaneously. The idea of a multi user system was that a musician could have a dedicated monitor loudspeaker, and thereby would have a prioritised loudness level by that speaker even if another musician moved closer to it.

9 2.4 Description of development tools

2.4.1 Pure Data Pure Data is a data flow programming platform that uses a visual user interface, rather than text based lines of code similar to Java, C++, and so forth. The interface consists of several boxes with different functions (oscillator, arithmetic operations, and much more) that can be connected to each other similar to the audio signal flow of guitar effect units and pedals. Pure Data can be used by artists as well as researchers or developers to create visual or audio based creations. The platform handles different types of input (including the OSC format), makes alterations to that input or render additional sounds, and then output the generated audio in different data formats to additional software or audio emitting hardware (Pure Data, n.d.).

2.4.2 MobMuPlat The name MobMuPlat derives from “Mobile Music Platform” and it is a smartphone version of Pure Data. MobMuPlat is used to create audio interfaces that allow users to make use of the sensors on their smartphone (accelerometer, gyroscope, and so forth) in combination with sound and MIDI data to create music and alter audio. The input data to the platform is derived from the sensors of the smartphone by using the OSC format. The platform allows for the creation of several graphical objects including, but not limited to, faders, knobs, buttons, trigger buttons, and much more (MobMuPlat, n.d.).

2.4.2 OptiTrack Arena motion tracking OptiTrack Arena is a program designed to receive data from a number of Natural Point motion tracking cameras. These cameras would be calibrated to determine a physical tracking space, i.e. a room, and thus, in addition to motion tracking, be able to determine a user’s location within the physical space. Furthermore, the program can broadcast motion data using NatNet network streaming (OptiTrack, n.d.). This broadcast stream can then be translated to the OSC format, where the data could be received on another computer or device as OSC data.

10 3 Method

3.1 Equipment and experimental setup

Figure 1. Representation of the room (ap- proximately 4x4 m) with indicative x- and Picture 1. Glove used for motion tracking. y-axises.

The test area consisted of a room with a total of 16 motion tracking cameras. The cameras were placed in different parts of the room so that the participants could face any part of the room, i.e. be turned in any orientation. As the motion tracking system did not cover the entire room, with blind spots close to the ceiling, in corners, and close to the walls, the system boundaries was marked by white tape on the floor (represented by the octagon in fig 1). Furthermore, the participants would wear a set of gloves that worked as rigid bodies, thereby allowing motion tracking of both hands separately. The gloves were fitted with a set of three spherical silver markers (see picture 1) that would allow the motion tracking system to obtain a rigid body in the form of a triangle.

11 3.1.1 Mappings We have investigated a total of five mappings within the study. • M1 - Size to Pitch

• M2 - Location to Spatialisation

• M3 - Orientation to Tempo

• M4 - Motion to Loudness

• M5 - Distance to Duration

3.1.2 Participants in the tests The participants allowed in the test were limited to people with prior musical ex- perience, as the idea was that knowledge within the field of music would assist in modifying auditory parameters.

All five participants in the tests were male and ranged between 22-31 years of age. They were all, or had been, guitar players. Two of them had experience with the trumpet, one of the participants was a percussionist, and one was a bass player. They had some experience with motion tracking systems such as Kinect, but mainly on a recreational level. Three of the participants were familiar with Pure Data, and three additional participants had experience with the OSC format.

3.2 System development 3.2.1 First system The development process was divided into two different phases. The first phase con- sisted of connecting an Android smartphone to a Pure Data platform using the OSC format, as this would enable the translation of body movement to computer data. Additionally, the Android device would be attached with a training wristband to simulate specific limb tracking. The second step was to combine the Pure Data plat- form with the motion tracking system (described in the section below). The major drawback of using a smartphone is that it is somewhat limited compared to the mo- tion tracking system, as physical parameters such as size, and similar, would be hard, if not impossible, to measure. Parameters such as motion and orientation is based on the three dimensional coordinates that the smartphone outputs, where variable changes represent body movement and placing the phone face down simulates turning away from facing the speakers.

12 Audio was generated by using MIDI files within the system, and the Pure Data platform used the output from the smartphone as input parameters to alter the au- ditory parameters according to the movement of the phone. The platform could use any kind of song that is represented by a MIDI file, but was mainly based on files similar to the ones described in section 3.3. The system also used a monotone file, as this would allow alteration of the pitch in the MIDI sequence to create melodies.

3.2.2 System with motion tracking When reworking the first system into the final version, the idea of input and output data remained somewhat similar. Input data for the Pure Data platform would still be received using the OSC format, where the main difference would be obtaining the motion tracking data. The participants would be outfitted with two gloves, acting as rigid bodies, where three tracking points on each hand would be tracked by the motion tracking system.

3.3 User tests 15 different files were used within the system, all based around the MIDI synthesiser known as Microsoft GS Wavetable. These files were divided into three categories: Reference, test, and training files, where each mapping would have its own set of these three files.

The reference and test files were similar, with the exception that the reference file had one auditory parameter altered. For instance, concerning the parameter pitch, the reference file would have a certain amount of MIDI notes that all had differences in pitch. The corresponding test file would have all notes set to the same pitch, but be alike in any other variable (duration of notes, loudness, etc.). This is so the pitch of the notes in the test file could be altered by a user in an attempt to mimic the reference file. The user altered test file and the reference file would then be used to determine how well the user was able to reproduce the auditory parameters of the reference file. Lastly, the training file were similar to the test file (that they, in this example, had all MIDI notes set to the same pitch), but was of a length of approxi- mately 30 seconds. These files were used in the beginning of each mapping test to help the participants to understand what amount of body movement that corresponded to a certain auditory variable shift.

The user tests begun with a formal introduction to the system, where the differ- ent physical and auditory parameters were explained to determine the participants’

13 prior knowledge as well as assisting in understanding how the system generates input and output data. Thereafter the participants would be subjected to the mappings M1-M5 in consecutive order.

Each mapping was tested by following a two part formula.

1. First part.

(a) Acclimatise to the system, using the training file. (b) Listen to the reference file once.

2. Second part, repeated five times.

(a) Listen to the reference file once. (b) In two consecutive takes try to alter the test file to reproduce the alter- ations of the reference file.

Firstly the participant would do a trial runt to acclimatise to the current mapping by using the training file to learn how the system responds to the current physical movement parameters. Then they would listen to the reference file to determine how the auditory parameter variables should be altered. The second part consisted of listening to the reference file, and then trying to reproduce the alterations in two consecutive attempts with the test file. The participant would repeat the second part a total of five times, which would generate 10 different user created parameter alterations.

3.3.1 M1: Size-Pitch The physical parameter size was represented by vertically holding an imaginary spher- ical ball. If the participant had their hands together, the size of the ball would be considered to be zero, and the size would increase when the participant moved their hands further apart. The largest possible size was 12, which occurred when the hands were approximately one meter apart. These values represented 12 semitones, or sim- ply an octave. The idea is that increasing the size of the imaginary ball also increases the pitch.

In the reference file for this mapping, the first seven notes of the bedtime lullaby “twinkle, twinkle, little star” were included. The participant would then use the ball representation to alter a test file that played seven consecutive notes, all with the same pitch, to mimic the reference file.

14 3.3.2 M2: Location-Spatialisation Location was based on the indicative x-axis of the motion tracking system (see fig 1). This axis was determined in contrast to the y-axis, which was when the partic- ipant faced the audio speaker system. The system would then track the right hand and determine the participant’s location and thereby move the location of the sound source with the location of the user. Initially a hat was considered, as this would force the participant to move their entire body and not an extremity that could eas- ily be stretched away from the body, but the system was unable to sufficiently track a hat. This was due to that the head would be on the edge of the systems tracking capabilities, and consequently the idea was scrapped.

Discussed in section 2.2.2, spatial refers to the perceived location of the sound source. In this mapping, stereo panning was used to represent spatialisation. The reference file consisted of a total of seven notes that were panned differently, and the partici- pant would affect the location parameter to generate stereo panning variables in the corresponding test file. Also, this test was the only test to use wav-files instead of MIDI files as stereo panning was simpler to program this way as the two channels could just be routed to different speakers. The files contained recordings of the same MIDI synthesiser as the other tests, so the participant would not be subjected to a different kind of audio. Finally, the participant were informed that simply extending their right arm and pointing at a specific speaker would change the spatialisation variables as this would make the right hand move on the x-axis as well.

3.3.3 M3: Orientation-Tempo Similar to M2, the third mapping only utilised the right hand. However, the differ- ence was that the participant was asked to hold their hand straight down and instead turn their entire body in a way similar to a volume knob on a common household hi-fi system. Turning the body in a way that a knob would increase the volume on the hi-fi system would in this case instead increase the tempo. The representation of tempo was to alter the time difference between two subsequent notes to increase or decrease the tempo. When the participant faced the left speaker the time difference would be high, rendering a slow tempo. Facing the right speaker would set a lower time difference and thus giving a fast tempo.

The reference file contained a total of 15 notes, thereby having a total of 14 time differences. Looking at the time differences in the reference file, they would go from larger to smaller i.e. generating a transition from a slow to a fast tempo. Further-

15 more, the test file would contain 15 notes with a null time difference between each set of consecutive notes, allowing the participant to create their own time differences.

3.3.4 M4: Motion-Loudness Generating motion variables incorporated a different approach than the previous map- pings, as it was not based on the placement of any of the hands but rather the move- ment of them. This was carried out by having the system track two “trigger points” for each hand, approximately 1.4 m and 1.7 m above the floor. Each trigger point could only be triggered if the other one for that hand had been triggered strictly prior to the first point, forcing the participant to move their hands in a manner that was reasonably similar to running. Subsequently, the trigger points could not individually be triggered multiple times in a row. The idea was that greater motion would increase the loudness, and lesser movement would decrease it.

Representing loudness was done so that each of the four trigger points would increase the velocity of a MIDI note by six, where the system would automatically decrease the output velocity with six every 250 milliseconds. Hence, if four trigger points were activated per second the loudness would remain the same. Triggering more than four points per second would continuously increase the loudness, and triggering less would steadily decrease the loudness. The system spanned a MIDI velocity of -30 up to +30, compared to the original file.

The test phase was based around trying to generate a steady increase in loudness, as the reference file started off with notes of minimal velocity and after a few notes slowly increased the note velocity to finally land on maximal velocity. Within the reference file was a total of 15 notes with continuously increasing velocity.

3.3.5 M5: Distance-Duration Distance was based on how far the participant was from the sound source, i.e. move- ment on the y-axis (see fig 1). However, the system could only track a specific area of the room so the minimal and maximal distance was determined to be slightly before the boundaries of the test area to prevent the participant from entering a location of inadequate motion tracking. Only the right hand was tracked in this test, by the same reasoning as in M2. When calculating the duration of separate notes, a short distance would correspond to a short duration and a longer distance to a longer duration.

16 Testing the mapping was based on the participant trying to generate a total of eight different note durations, where the reference file contained both near maximal and minimal durations as well as values in between.

3.4 Interview procedure Before testing the first mapping, a short interview would gather information about the participant’s own experience with music and music theory as well as determining any experience with any kind of motion tracking system (commercial or noncommer- cial). This part would also determine if the participant was familiar with Pure Data, OSC, or the OptiTrack Arena program.

The participant would then test mapping M1-M5 in consecutive order. After the participant was subjected to the a mapping, the participant would be subjected to a number of questions concerning that specific mapping. These questions would firstly determine if the parameters, both physical and auditory, met the participant’s ex- pectations and if they could be considered a sufficient representation. The next few questions included assessing how well they adapted to the mapping, how well the system responded to the their movement, and if there was anything that was in need of a fine tuning or being reworked altogether. Lastly, the participant was asked to rank the overall experience on a scale from 1-10.

The last interview part took place after all mappings and their corresponding in- terviews were completed. The main focus was to determine how well the participant adapted to the different parts of system where a series of questions based on the accli- matising to the system was answered and discussed. The idea here was to determine what factors, if any, that was essential to initially begin using the system. These answers would also matter in the overall success of the user test, as any hardship to understand the system would affect the results, as well as if any alternative mappings would be preferred instead of the ones within the study.

17 4 Results

4.1 Mapping 1: Size-Pitch

In fig 2, the results show participants P1-P5 and the total error per participant and take. The distance to correct note is based on how many semitones off the participant was for a given note, where the total error of a take is the sum of all note distances within that take.

Figure 2. Results from M1.

The box plots and trend line shows that the participants would decrease the total distance to correct notes, giving that the users adapted to the mapping during the experiment. Also, when comparing the first take with the last ones we can see a clear decrease in errors. Though, take 10 does have an increase in errors compared to take 8 and 9. As previously mentioned, pitch is considered the most fundamental variable in music, so the correlation here could be that something that is well known might be considered easy to adapt to.

18 During the interview procedure, all participants agreed that the representation of pitch was correct (that the alterations made to the test file was in deed a pitch al- teration). When it came to size though, not all were in agreement. P1 felt that the physical movement parameters were more the placement, or location, of the hands, rather than representing the size of an imaginary ball.

Most participants felt that the mapping was fairly easy to understand and to adapt to, but that it took some time getting used to. This correlates well with the trend line in fig 2, as it suggests that the takes gradually decreased the number of errors. P1, P3, and P5 did feel that the 12 different size values were to narrow, and that they should span over a larger horizontal width. P3 and P5 also suggested that some sort of visual aid would assist in finding the correct size values, suggesting a screen that would display the current size. Finally, P2 felt that the mapping was more based on muscle memory than on musical ear training.

19 4.2 Mapping 2: Location-Spatialisation The results from M2 is the total sum of decibel errors for each participant P1-P5 during each take. Decibel values were derived by acquiring the difference between the left and right channel in both the reference file and the file generated by the participant, where the decibel values of both files would be compared to obtain the decibel errors.

Figure 3. Results from M2.

In regards of the box plots for M2, only a modest trend shift can be observed. The results for the first take compared to the last takes shows some user adaption, but the trend line is only of a slight downward value. This suggests that the participants did not significantly improve their test results during the experiment. The mean value for the first and second takes were 80.5 and 74.3 dB. The rest of the takes ranged between approximately 58-60 dB, with an exception of two takes at 53.4 and 55 dB. As seen in fig 3, even if some values did improve, the data values from the mapping does not show a clear increase in user adaption on its own. On the other hand, all agreed on that the mapping was easy to understand and that the system response

20 was splendid. This suggests that the mapping might have been easy to comprehend right away, something that could explain the results in fig 3.

Furthermore, all participants felt that location and spatialisation was represented in a correct manner. The mapping was also approved, but that it sometimes was harder to understand in comparison to M1. System response time was also unani- mously deemed very fast, where the mapping felt logical and easy to understand. P5 did suggest though that there could have been a visual aid, within the line of sight, to help find the point where the difference in decibel between the speakers were of zero value.

21 4.3 Mapping 3: Orientation-Tempo Perhaps the most clear increase of performance comes from M3. The total errors are based on the time difference between sets of notes, as explained earlier, where the y-axis displays the sum of the errors for all 14 time differences.

Figure 4. Results from M3.

For the third mapping, the trend line can be interpreted to mean that there have been a steady decrease of errors during the test phase. Though, the last two shows an increase in user errors compared to previous takes. They are still of lesser error values than the first take, with the exception of one whisker that suggests otherwise. This case could also be seen in M1, but not as obvious, and it might be due to user fatigue or that they got bored.

The participants were all in agreement that the mapping was easy to adapt to, with the exception of P1 that felt that the mapping got harder to control as the tempo increased. Additionally, the representation of tempo was deemed correct by all par- ticipants. The orientation variable reminded P1, P4, and P5 of a knob from a home

22 stereo system, which was the idea of the mapping, but P3 felt that “rotation” might have been a better name for the parameter.

When commenting on system performance, P1 and P4 felt that the orientation area should include orientation points further to the left and right than the current bound- aries. P2 and P5 suggested that measuring a more static part of the body, i.e. the chest or head, would feel more intuitive. P2 also suggested that a visual aid in line of sight would be desirable.

23 4.4 Mapping 4: Motion-Loudness The y-axis shows the sum of the total errors in decibel when comparing all 15 notes of the reference file with the participant generated test files.

Figure 5. Results from M4.

In fig 5 above, we see a multitude of values. Though, 13 of the total 50 takes (five participants, 10 takes each) were made with individual errors <20 dB and over half is below 30 dB. The trend line also suggests that there was a slight decrease of errors. Comparing the first three takes with the last three, there is a decrease of errors. The last take, similar to M1 and M3, shows a increase in errors compared to the pre- ceding take. According to the participants, the representation of loudness was done correctly. The motion parameter was represented sufficiently, as the participants felt that motion could easily be related to M1-M3 as well. P1 suggested that this was indeed one kind of motion, but not an overall representation of motion.

Concerning adapting to the system, the common notion was that the mapping was not sufficient. The participants felt that a certain amount of motion should be con- nected to a specific amount of loudness, instead of affecting the speed of how the

24 loudness changes. Not moving at all would still generate sound, something that P5 first noted at the end of the test phase. P3 also felt that it was hard to grasp how the system decreased the loudness over time.

The minimal and maximal values should have been increased, according to P2, to be able to create even larger differences in loudness. When referencing maximal loud- ness values, P4 felt that it was easy to determine the maximal value and P5 believed it to be hard.

25 4.5 Mapping 5: Distance-Duration For figure 6, the sum of all errors for each take is represented on the y-axis in mil- liseconds. A total of 8 notes were used in the experiments.

Figure 6. Results from M5.

Figure 6 shows a distinct decrease in errors made by the participants, as made clear by the trend line. Furthermore, the first three takes all had individual mean values of above 1,300 ms in comparison to the last two takes that had mean values of below 600 ms. The participants were satisfied with the representation of the parameter duration, but had some concerns with distance. P1 and P3 expressed that distance felt more like positioning in a room than the distance to a speaker. This was mainly due to that the minimal distance was determined by the boundaries of the test area and not the speakers themselves.

The mapping was deemed sufficient by the participants, with a few exceptions. P1 and P3 talked about the limitation of not being able to change the duration of a note once it had been triggered. This might explain the clear decrease in user errors, as the participants could not alter a triggered note and thereby not change their initial

26 decision. Suggested by P5, obtaining the minimal and maximal distance (boundaries of the motion tracking system) would be made easier by utilising some sort of physical boundary (a string, fence, or similar obstruction) to further the participants’ appre- hension of the represented physical limitations.

Another concern expressed was that remembering the duration of eight notes was a bit much, and mimicking that many in a row felt strenuous.

4.6 Overall user experience The data from M1 (Size-Pitch), M3 (Orientation-Tempo), and M5 (Distance-Duration) shows that user adaption increased with the amount of takes. Concerning M2 (Location- Spatialisation), the data was inconclusive but however considered the easiest mapping by the participants. The data for M4 (Motion-Loudness) was also without a perfectly clear trend. Four out of the five participants agreed that M1 was easy, due to the lim- ited amount of parameter values. M3 and M5 was also noted as sufficient mappings by the participants, as it felt easy to comprehend and adapt to as well as regarding the simplicity of finding maximal and minimal values. Regarding the size parameter in M1, the participants felt that it was hard to grasp. The suggestions ranged from visual aids to determining the size of an imaginary vertical object, where the floor would be one reference point and thus only use one hand to determine the parameter size.

Furthermore, the participants were asked to assign a numerical number to the map- pings depending on how well they worked. The summation of these values gave that M2 and M5 were the most adaptable, and M3 ranking in the middle. For mappings M1 and M4 though, four of the participants ranked them as the bottom two. This is in contrast to the fact that M1 was considered easy, but it shows that M4 is the weak link within the user tests.

All in all, the participants expressed that the system met their expectations, was an interesting approach to musical expression, and that they enjoyed themselves dur- ing the user tests. There was also a consensus that the system should be able to change mappings, i.e. connect different physical parameters to auditory parameters than the mappings in this thesis.

27 5 Discussion

5.1 System evaluation Observing the graphs for M1-M5, obvious improvements during the user tests for M1 and M5 can easily be seen. For M3, the trend is not as clear, and for M2 and M4 there is a lack of this kind of evident improvement. Comparing the first take, or takes, with the last ones did show improvement. For M3 though, the last two takes suffered an increase in errors compared to the previous ones. M1 and M4 also showed a slight increase in errors when comparing the last two takes. This might be, as previously mentioned, that the participants grew tired or bored with the tests.

The interviews concluded that M2 was the most easily adaptable mapping, with M5 as a close second. The combination of improvement seen in the graph for M5, as well as the high ranking by the participants, gives that M5 worked in a satisfactory way. The non-existing improvements for M2 could be derived from an extremely easy adaption to the mapping, but this is just speculations. M2 is thereby deemed to work in a probable satisfactory way, as the participants did not need an extensive learning process.

M3 was ranked in the middle by the participants, where the graph shows a steady decrease of errors. This gives that M3 should be considered to be at least adequate. M1 on the other hand was considered easy, but that the physical variable representa- tion could be improved. The results from M1 was good though, so this suggests that the mapping was sufficient but is in need of alteration.

Finally, mapping M4 did not show any major improvements by repeated user in- teraction and was rejected by the participants. This shows that the current approach for M4 was inadequate, and needs to be reworked.

5.2 Possible improvements The first thing to focus more on is determining the mapping of parameters. Con- cerning M1, an increase of size does not correspond to an increase in pitch as Spence (2011) discusses that a smaller size is connected to a higher pitch. Further develop- ment in this area should be considered a top priority.

A major drawback of the way the system used MIDI was that when a note is trig- gered, it cannot be altered in any way. This is because once a note has been triggered, the system would not allow for any changes in input variables for that note. Many

28 participants felt that it would be desirable to change a parameter of a note while it is ringing, as this would further the musical creativity.

During the interviews after M4, there was a lot of criticism that the mapping did not work in the anticipated way. It was suggested that the tempo should match the current amount of motion instead of determining the acceleration of the tempo change.

For mapping M2, M3, and M5, it could have been desirable to use a fixed track- ing point on the chest or the head. This is in contrast to using the hands, which can be extended away from the body. Ergo, using a fixed point would have the participant move their entire body. If the test area could be increased to include further vertical height, this would make way for motion tracking of the head.

The boundaries of the test area could have been limited by better physical limi- tations, instead of white tape on the floor. Examples would be a railing or fence around the test area, or extending the current system with additional cameras to cover the entirety of the test facilities.

In M1, the representation of size could have been different. As pitch is the most common auditory parameter, the dissatisfaction of M1 is probably due to the size parameter. Altering the mapping could provide a higher level of satisfaction and willingness to explore the system.

Defining the physical parameters could have been more definite, as the participants confused physical parameters such as location and distance and even made up own parameters (positioning, and similar).

5.3 Future studies The first mapping, M1, was determined to be in need of alteration, as the participants did not agree with the representation of the physical parameter size. M2 received the grade of probable satisfactory, as the mapping was approved by the participants. Though, considering that the results for M2 only showed the slightest improvement, the mapping would need some further studies. M3 showed a decrease of errors made by the participants, and was valued at a middle ground, suggesting that the mapping was of adequate quality. Furthermore, M4 showed a need to be reworked all together due to only smaller improvements of user generated data and being rejected by the participants. Lastly, M5 was deemed to work in a satisfactory way as both partici- pants and results favoured improvements in an optimistic way.

29 Future improvements of the system would be to increase the number of mappings as well as mapping between different parameters. Allowing for different mappings could determine what mappings, if any, that a majority of users would prefer. In- creasing the number of participants in user tests would allow to better conclude what variable representation of a physical parameter that is preferred, and which are not. It could also improve the success rate of different physical parameters, even if not preferred by the majority. There are certainly a vast number of projects of different magnitude that could be undertaken within the field, where the only limitation would be the amount of resources available.

30 6 Conclusion

This study evaluated five different parameter mappings using five physical and five auditory parameters. The physical parameters were obtained by using a motion track- ing system that in turn sent data to a pure data patch which mapped them to their corresponding auditory parameter. The evaluation was done with two different ap- proaches, how well the participants adopted to the system based on generated data as well as their own opinions of the system. All participants within the user tests had prior musical experience.

Concerning how the physical and auditory parameters were represented, there was a clear conclusion that the auditory parameters was represented correctly. This sug- gests that auditory parameters cannot be interpreted in a great number of ways, and thus evaluating different kinds of physical parameter representations should be the main focus of future studies.

The participants were all in agreement that this is an amusing way of musical ex- pression, and that adapting to this kind of system is based around how interested the user is. However, the system would benefit from a possibility to personalise the user experience. I.e., being able to alter the parameter mappings, change how the param- eters are represented, modify the motion tracked area, being able to activate some sort of visual aid, and similar. All in all, the participants felt sufficiently satisfied with the system as a whole, but further studies in the field is needed to address the imperfections of this study and to further the understanding of parameter mapping.

31 References

Arensburg, B., Tillier, A. M., Vandermeersch, B., Duday, H., Schepartz, L. A., & Rak, Y. (1989). A Middle Palaeolithic human hyoid bone. Nature, 338 (6218), 758- 760.

Bower, B. (1998). Doubts aired over Neandertal bone ‘flute’. Science News, 153 (14), 215-215.

Capp, M., & Picton, P. (2000). The optophone: an electronic blind aid. Engi- neering Science & Education Journal, 9 (3), 137-143.

Chan, J. C., Leung, H., Tang, J. K., & Komura, T. (2011). A virtual reality dance training system using motion capture technology. Learning Technologies, IEEE Transactions on, 4 (2), 187-195.

Conard, N. J., Malina, M., & M¨unzel,S. C. (2009). New flutes document the earliest musical tradition in southwestern Germany. Nature, 460 (7256), 737-740.

Copp, N., & Zanella, A. (1993). Discovery, innovation, and risk: case studies in science and technology. MIT Press.

Dubus, G., & Bresin, R. (2013). A systematic review of mapping strategies for the sonification of physical quantities. PloS one, 8 (12), e82491.

Hermann, T., & Hunt, A. (2011). The sonification handbook. J. G. Neuhoff (Ed.). Berlin: Logos Verlag.

Krefeld, V., & Waisvisz, M. (1990). The hand in the web: An interview with Michel Waisvisz. Computer music journal, 14 (2), 28-33.

MobMuPlat. (n.d.). MobMuPlat Guide. Retrieved April 26, 2016, from http://music.columbia.edu/ daniglesia/mobmuplat/doc/index.htm

NIME. (2016). International Conference on New Interfaces for Musical Expression. Retrieved November 2, 2016, from http://www.nime.org/

OptiTrack. (n.d.). OptiTrack. Retrieved April 26, 2016, from https://www.optitrack.com

32 Paloranta, J., & Renstr¨om,I. (2012). Automized optimization of stage monitor hear- ing: Moving the monitor hearing sweet spot in real time - a solution for the moving musicians on stage? (Bachelor’s thesis). KTH Royal Institute of Technology, Stock- holm, Sweden.

Patel, A. D. (2010). Music, language, and the brain. Oxford university press.

Plack, C. J., Oxenham, A. J., Fay, R. R., & Popper, A. N. (2005). Pitch: neural coding and perception (Vol. 24). Springer Science & Business Media.

Pure Data. (n.d.). Pure Data. Retrieved April 26, 2016, from https://puredata.info/

Rumsey, F., & McCormick, T. (2012). Sound and recording: an introduction. CRC Press.

Spence, C. (2011). Crossmodal correspondences: A tutorial review. Attention, Per- ception, & Psychophysics, 73 (4), 971-995.

Stevens, S. S., & Volkmann, J. (1940). The relation of pitch to frequency: A re- vised scale. The American Journal of Psychology, 53 (3), 329-353.

Terhardt, E. (1978). Psychoacoustic evaluation of musical sounds. Perception & Psychophysics, 23 (6), 483-492.

Waisvisz, M. (1999). Gestural Round Table. STEIM Writings.

Welch, G., & Foxlin, E. (2002). Motion tracking survey. IEEE Computer graph- ics and Applications, 24-38.

Wilson, J. P., Harel, Z., & Kahana, B. (1988). Human adaptation to extreme stress: From the Holocaust to Vietnam. Springer Science & Business Media.

Worrall, D. (2009). Sonification and Information: Concepts, instruments and tech- niques. (Doctoral dissertation) University of Canberra, Canberra, Australia.

Yoo, M. J., Beak, J. W., & Lee, I. K. (2011). Creating Musical Expression using Kinect. In NIME (pp. 324-325).

33 Zhang, Z. (2012). Microsoft kinect sensor and its effect. MultiMedia, IEEE, 19 (2), 4-10.

Ziegelwanger, H., & Majdak, P. (2014). Modeling the direction-continuous time- of-arrival in head-related transfer functions. The Journal of the Acoustical Society of America, 135 (3), 1278-1293.

Z¨olzer,U. (2011). DAFX: effects. Wiley.

34 Appendix

Raw data graphs for P1

35 36 Raw data graphs for P2

37 38 Raw data graphs for P3

39 40 Raw data graphs for P4

41 42 43 Raw data graphs for P5

44 45 www.kth.se