A Gestural Interface in a Computer-based Conducting System

A Thesis

Submitted to the Faculty of Graduate Studies and Research

In Partial Fulfillment of the Requirements

For the Degree of

Master of Science

In

Computer Science

University of Regina

By Lijuan Peng

Regina, Saskatchewan

October, 2008

©Copyright 2008: L. Peng Library and Archives Bibliotheque et 1*1 Canada Archives Canada Published Heritage Direction du Branch Patrimoine de I'edition

395 Wellington Street 395, rue Wellington OttawaONK1A0N4 OttawaONK1A0N4 Canada Canada

Your file Votre r6f6rence ISBN: 978-0-494-55050-2 Our file Notre inference ISBN: 978-0-494-55050-2

NOTICE: AVIS:

The author has granted a non­ L'auteur a accorde une licence non exclusive exclusive license allowing Library and permettant a la Bibliotheque et Archives Archives Canada to reproduce, Canada de reproduire, publier, archiver, publish, archive, preserve, conserve, sauvegarder, conserver, transmettre au public communicate to the public by par telecommunication ou par I'lnternet, preter, telecommunication or on the Internet, distribuer et vendre des theses partout dans le loan, distribute and sell theses monde, a des fins commerciales ou autres, sur worldwide, for commercial or non­ support microforme, papier, electronique et/ou commercial purposes, in microform, autres formats. paper, electronic and/or any other formats.

The author retains copyright L'auteur conserve la propriete du droit d'auteur ownership and moral rights in this et des droits moraux qui protege cette these. Ni thesis. Neither the thesis nor la these ni des extraits substantiels de celle-ci substantial extracts from it may be ne doivent etre imprimes ou autrement printed or otherwise reproduced reproduits sans son autorisation. without the author's permission.

In compliance with the Canadian Conformement a la loi canadienne sur la Privacy Act some supporting forms protection de la vie privee, quelques may have been removed from this formulaires secondaires ont ete enleves de thesis. cette these.

While these forms may be included Bien que ces formulaires aient inclus dans in the document page count, their la pagination, il n'y aura aucun contenu removal does not represent any loss manquant. of content from the thesis.

1+1 Canada UNIVERSITY OF REGINA

FACULTY OF GRADUATE STUDIES AND RESEARCH

SUPERVISORY AND EXAMINING COMMITTEE

Lijuan Peng, candidate for the degree of Master of Science in Computer Science, has presented a thesis titled, A Gestural Interface in a Computer-Based Conducting System, in an oral examination held on September 22, 2008. The following committee members have found the thesis acceptable in form and content, and that the candidate demonstrated satisfactory knowledge of the subject material.

External Examiner: Dr. Thomas J. Conroy, Faculty of Engineering

Supervisor: Dr. David Gerhard, Department of Computer Science

Committee Member: Dr. Daryl Hepting, Department of Computer Science

Committee Member: Professor Brent Ghiglione, Department of Music

Chair of Defense: Dr. Pauline Minevich, Department of Music Abstract

Over the past few years, a number of computer-based conducting systems have been designed and implemented. However, only a few of them have been developed to help a user learn and practice musical conducting gestures. Few systems provide both visual representation for a conducting gesture and aural feedback. This thesis is intended to address research related to this area. It focuses on a gestural interface designed and developed for a computer-based conducting system.

This gestural interface utilizes an infrared technique to track the motions of the right arm and an acceleration sensor for the gestures of the left arm. The infrared sensor enables the system to be used in a natural environment and has little influence on the conducting. The gesture recognition is based on the inherent characteristics of conducting gestures including positions and amplitudes. It is an accurate and relatively simple process. The conducting is interpreted using a few visual items that show a conducting gesture very clearly and straightforwardly reveal its qual­ ity. In addition, this gestural interface supports both following and dynamics following and provides straightforward visual representations for them. The aural representation included in the interface is to inform users of the occurrence of beats

i or errors.

ii Acknowledgements

I would like to take this opportunity to express my sincere gratitude to my supervisor,

Dr. David Gerhard. His guidance, encouragement, and financial aid ensured the completion of my thesis. His invaluable suggestions and comments placed me on the right path and were essential to the process of completing my graduate studies.

I also want to thank the members of my thesis committee, Professor Brent Ghiglione and Dr. Daryl Hepting, for their time and comments.

In addition, I would like to thank the University of Regina, the Department of

Computer Science, and the Faculty of Graduate Studies and Research for giving me the opportunity to study here and for providing me with financial support.

Finally, I would like to thank my family for their love and for always being there for me.

in Contents

Abstract i

Acknowledgements iii

List of Tables viii

List of Figures x

1 INTRODUCTION 1

1.1 Motivation and contribution 2

1.2 Thesis overview 3

2 BACKGROUND AND RELATED RESEARCH 4

2.1 Human Computer Interface 4

2.2 Gestural interface in musical systems 7

2.3 Visual representation of musical parameters 8

2.4 Conducting 9

2.5 Overview of computer-based conducting systems 11

2.5.1 Summary table 11

iv 2.5.2 Summary 12

3 DESIGN OF A GESTURAL INTERFACE 18

3.1 Gestures 18

3.1.1 Conducting gestures 19

3.1.2 Beat patterns 20

3.1.3 Dynamics 21

3.2 Gesture tracking 22

3.3 Gesture analysis 22

3.3.1 Segmentation 23

3.3.2 Feature extraction 24

3.4 Gesture recognition/following 25

3.4.1 Recognition 25

3.4.2 Following 27

3.5 Response 28

3.5.1 Visual representation 28

3.5.2 Aural representation 29

4 IMPLEMENTATION AND EVALUATION 31

4.1 Development and runtime environment 31

4.1.1 Remote 32

4.1.2 A baton-like infrared stick 35

4.1.3 WiTiltv2.5 37

4.1.4 Software 39

v 4.2 Main window 40

4.3 Gesture tracking 42

4.4 Gesture analysis 44

4.4.1 Coordinates 44

4.4.2 Beats 44

4.5 Gesture recognition/following 46

4.5.1 Separate beat pattern recognition and accuracy 46

4.5.2 Mixed beat pattern recognition and accuracy 50

4.5.3 Tempo tracking and accuracy 52

4.5.4 Dynamics tracking 53

4.6 Aural Representation 55

5 DISCUSSION 57

5.1 Video camera 57

5.1.1 Segmentation 58

5.1.2 Feature extraction 60

5.1.3 Comparison between video camera and 62

5.2 WiTilt v2.5 64

5.2.1 Gesture tracking 64

5.2.2 Feature extraction 65

5.2.3 Gesture recognition 66

5.2.4 Comparison between WiTilt v2.5 and Wii Remote 69

6 CONCLUSION AND FUTURE RESEARCH 70

vi 6.1 Conclusion 70

6.2 Future research 71

Glossary 75

vn List of Tables

2.1 Computer-based conducting systems 13

3.1 Gesture recognition rules for three beat patterns 26

3.2 The mapping between beats and MIDI notes in the system 30

4.1 The results of recognition only based on the downbeat detection ... 51

4.2 Comparison between the calculated average tempo and the real average

tempo 54

5.1 Comparison between video camera and Wii Remote 63

vin List of Figures

2.1 The relationship between the interface and the computer system . . 5

2.2 4-beat patterns (drawn based on the pictures in [34]) 10

2.3 4-beat patterns (drawn based on the pictures in [27] and [21]) .... 11

3.1 Five aspects to design a gestural interface 19

3.2 Three beat patterns 21

3.3 An example of visual representations 29

4.1 The setup 32

4.2 Wii Remote (from its website) 33

4.3 The interface of the DarwiinremoteOSC 34

4.4 A baton-like infrared stick 35

4.5 The WiTilt 2.5 used in our system 37

4.6 The coordinates of the WiTilt v2.5 38

4.7 The interfaces of W20 38

4.8 An example of a Max/MSP patch 40

4.9 A snapshot of the main window 41

4.10 A Max/MSP patch to receive sample data via UDP 43

ix 4.11 A correct gesture for a 2-beat pattern 45

4.12 A correct gesture for a 3-beat pattern 47

4.13 A correct gesture for a 4-beat pattern 47

4.14 An error downbeat for a 2-beat pattern 49

4.15 An incorrect gesture for a 2-beat pattern 49

4.16 An example of 12-beat patterns 50

4.17 A screenshot of tempo practice 53

4.18 Examples of the changes of dynamics 55

5.1 The segmentation of a moving hand 60

5.2 The blobs in a moving hand at a certain time 61

5.3 A trajectory before and after smoothing 61

5.4 The effect of linear interpolation 64

5.5 The effect of the smoothing 65

5.6 Three HMMs 66

5.7 A visible states sequence generated using different methods 68

x Chapter 1

INTRODUCTION

Gestures axe widely used to aid face-to-face communication between people. Such gestures include hand movements, body language, and eye contact. They deliver information to others without relying on speech.

Conducting is leading a musical performance with conducting gestures, such as hand gestures and eye contact. These gestures convey the understanding and intent of a conductor to members in an . Therefore, conducting gestures contain not only standard structures, like tempo and volume, but also personal characteristics and interpretations.

Since the 1980s, various computer-based conducting systems have been designed and developed [2] [4] [31] [42] to allow a user to conduct a piece of music using a digital system. They retrieve tempo (and sometimes support dynamics as well) and then use it to control the playback of a piece of music. Most of them are manipulated using a gestural interface.

1 1.1 Motivation and contribution

In conducting an ensemble or orchestra, gestures play a crucial role in delivering information and organizing players. Consequently, it is important for a conductor to understand these gestures and present them correctly.

Although a lot of computer-based conducting systems [3] [13] [18] [25] [31] have been developed in recent years, most of them focus on the conducting instead of conducting gestures. They may not be able to help users learn and practice conducting gestures and provide gestural interfaces that are specifically designed for pedagogical purposes. The research described in this thesis is intended to present a gestural interface that is designed and implemented for pedagogy.

Most current systems support only audio feedback while conducting. Visual repre­ sentation as a straightforward interpretation for a gesture has only been implemented in one system [9]. For pedagogy, it is valuable to include visual representation in a conducting gestural recognition/following system because conducting textbooks al­ ways contain some pictures to show gestures. Conducting gestures are a kind of visual demonstration of a piece of music. Aural feedback is still required because conduct­ ing is intended to direct musical performance. Aural and visual feedback explain a conducting gesture from two different perspectives.

The gestural interface presented in this thesis can be applied to pedagogical music purposes, such as conducting students, as well as some musical systems for entertain­ ment.

2 1.2 Thesis overview

This thesis presents a novel gestural interface designed and developed for a computer- based conducting system. The input device consists of a Wii Remote, an infrared stick and a WiTilt v2.5. The aural and visual representations are given out by the system.

Chapter 2 will introduce the related background, including the Human Computer

Interface (HCI), gestural interfaces in musical systems, visual representation of mu­ sical parameters, and conducting. After that, an overview of current computer-based conducting systems will be presented. Chapter 3 will describe the design of the ges­ tural interface used in this thesis, and the details of implementation and evaluation of the system will be contained in Chapter 4. Chapter 5 will compare the tracking of the right hand based on a Wii Remote with those based on a video camera and a

WiTilt v2.5. The thesis will conclude with a summary of this gestural interface and suggestions for future research.

3 Chapter 2

BACKGROUND AND RELATED

RESEARCH

2.1 Human Computer Interface

The way that you accomplish tasks with a product—what you do and how it responds— that's the interface.

—Jef Raskin, On The Humane Interface, 2000

In this thesis, Human Computer Interface (HCI) is a broad term including not only the Graphical User Interface (GUI), but also any interaction between a user and a computer system. Figure 2.1 demonstrates the relationship between an interface and a computer system. A user manipulates a computer system via certain input devices. Then, the input data is processed in functional modules and some results are generated and presented back to the user.

4 Computer System Function modules

Human Computer Interface

IP Processing t i Input Output ! .

Operation Visual. Aural feedback . _ . " _. .. User

Figure 2.1: The relationship between the interface and the computer system

Various interfaces have been designed and developed over the past few years. In accordance with how to operate computer systems, they can be grouped into four primary categories. Their usability is analyzed with respect to "ease of learning, ease of use, and user satisfaction" [33].

Graphical User Interface (GUI) Today, a GUI is widely used in desktop ap­ plications that use the computer's keyboard and mouse as the main input devices.

Users click on menus or buttons to trigger functions provided by a computer system.

WIMP (window, icon, menu, pointing device) are fundamental items in traditional

GUI systems. Since the popularization of GUIs, users no longer need to remember complex commands. A person familiar with one GUI may learn to use another similar

GUI easily and quickly.

5 Conversational Interface Agents Conversational interface agents provide ani­ mated characters to communicate with users, mimicking conversation between peo­ ple. Animated characters are able to talk, make arm and head movements, and even use facial expressions. Personified interaction is easy for users to learn. It brings a kind of active communication between users and computer systems.

Gestural Interface A gestural interface makes use of diverse gestures to control computer systems. Hand gestures, body movements, and even facial actions can be used. Generally, a computer system recognizes a certain set of gestures whose meanings are defined beforehand. Once these gestures are captured by sensors such as a video camera, and features are extracted, they will be identified by the computer and a corresponding response will be given out. It is natural to interact with a computer system using gestures because gestures are widely used in daily life. People acquire additional information from body language or eye contact during face to face conversation. Children "talk" with their parents using gestures when they are still little babies and are not able to speak. The hearing impaired communicate with others using a gestural language called sign language. Therefore, gestural interfaces are very effective and powerful.

Tangible User Interface (TUI) A TUI utilizes physical objects as an interface to communicate with a computer system. These physical objects represent certain information in the digital world by referring to their inherent characteristics and are used to manipulate that information naturally. The representation and the control

6 are seamlessly integrated and both input and output are part of the same device [41].

Consequently, people acquire a natural, unique, and haptic1 interaction with com­ puter systems. It is easy for users to perceive the interaction because of their experi­ ences with physical objects in the physical world. [16]

All interfaces mentioned above have been applied in musical systems with a va­ riety of applications. These applications may provide interactive musical education systems for young children by using an animated character [17]. They also can be graphical tutor systems [45] or coursewares [37]. Some applications are music com­ position systems controlled by hand movement [14] or interactive toys for children to manipulate physical notes [44].

2.2 Gestural interface in musical systems

Gestural interfaces have been used in musical performance, composition, and con­ ducting systems. Almost all conducting systems employ a gestural interface (details will be presented in Section 2.5). Applications related to a certain instrument may be used in a way that is similar to playing the corresponding acoustic instrument such as Le SuperPolm [11]. Furthermore, gestural interfaces can inspire creativity in musicians or provide opportunities for the public to join in music activities (e.g. improvisation). BodyMusic [14] is such a system, allowing the composition of music

:Of or relating to the sense of touch

7 using hand gestures.

In most cases, gestures cannot be entered using a keyboard and mouse, the most common input devices for a computer. Generally, gestures are captured by sensors such as a video camera or acceleration sensor. After processing, gestures are mapped to some musical parameters and are used to control the response to users. For ex­ ample, in Body Music [14], hand gestures, which are acquired using a pair of data gloves and a tracker, control the of a music piece composed by a user. A sound synthesizer or sequencer is involved in a lot of musical systems for the sake of an inherent characteristic to generate sound effects or musical pieces. Therefore, it is not surprising that most musical systems give out aural feedback to users.

2.3 Visual representation of musical parameters

In the real world, music is captured by our ears. Therefore, it is natural for computer- based musical systems to provide an aural response to users. However, the expressive information contained in a music piece is difficult for everyone to perceive. As a result, some musical systems support visual representation for musical information and try to explain it from a different perspective. Certainly, visual representation is supplementary to the aural representation and cannot replace it totally.

pianoFORTE [38] is a good example of such a system. It is intended to improve the communication between a teacher and a student during education and improve the expressiveness of a student's performance. Traditionally, teachers and students figure out the difference between their performances just by listening. Visual

8 representations, such as different colors and shapes on the original score, provide a more accurate and easier way. It allows musical parameters, including articulation, dynamics, synchronisation, and tempo, to be visually represented. As a result, the difference between teachers and students is straightforwardly shown on screen.

2.4 Conducting

A conductor using conducting gestures directs multiple musicians in an orchestra or ensemble to play with unified timing and delivers his/her understanding of a musical score to control the expressiveness of the performance. The arms and hands provide explicit communication, but body language, eye contact and even breathing are used to convey more subtle and intrinsic interaction. Generally, the right arm and hand are responsible for timing information such as tempo, and the left arm contributes to macro-level expressive information. This thesis primarily focuses on gestures deliver­ ing timing information but still addresses gestures for expressive information.

The conducting window, a chest-high virtual rectangular area, conains movements for four directions (up, down, left and right). The movements follow standard con­ ducting patterns and may be continuous or have a stop between two motions. Legato and staccato are two of the primary types of beat-patterns. Legato consists of con­ tinuous and curved motions. Staccato has a stop in a moment at each count. It contains relatively straight motions. Each beat-pattern has two variants. Legato can be neutral with plain motions or expressive with curved motions. For staccato, the pattern with straight motions is called light-staccato. Full-staccato employs slightly

9 curved motions. [34] [21]

(a) Neutral legato (b) Expressive legato stop

(c) Light staccato (d) Full staccato

Figure 2.2: 4-beat patterns (drawn based on the pictures in [34])

The style of a certain beat pattern is different in different textbooks. Figure 2.2 is drawn based on the pictures in a textbook written by Max Rudolf [34]. Another author, Brock McElheran [27], uses beat patterns shown in Figure 2.3. Joseph Labuta

[21] utilizes a similar style to those presented by McElheran.

10 2 14 3

(a) Legato (b) Staccato

Figure 2.3: 4-beat patterns (drawn based on the pictures in [27] and [21])

2.5 Overview of computer-based conducting sys­

tems

A summary of computer-based conducting systems is given with respect to different aspects of a gestural interface.

The first computer-based conducting system, the Microcomputer-Based Conduct­ ing System [4], was developed in 1980 and did not implement a gestural interface. It accepted an input using a graphics tablet, switches, or slides. After that, researchers began to utilize gestures to manipulate systems.

2.5.1 Summary table

Table 2.1 compares computer-based conducting systems with respect to different as­ pects of a gestural interface.

11 2.5.2 Summary

As shown in Table 2.1, different sensors have been employed in computer-based con­ ducting systems. Most systems support only aural feedback to users. Few systems involve functions for music pedagogical purposes.

Conducting gestures Gestures used in a gestural interface should be defined be­ forehand. Either standard conducting gestures or simplified gestures are utilized in these systems. Up and down are the two simplest gestures used to conduct a piece of music. They provide opportunities for non musicians to join in a conducting pro­ cedure.

Sensors Various sensors have been used in computer-based conducting systems.

A system may employ more than one sensor or a complex sensor system bought from a company. Sensors that have been used can be grouped into four categories: acceleration sensors, cameras, infrared sensors, and other sensors.

Several systems utilized acceleration sensors [26] [42]. These sensors can be equipped on baton like devices to track the movement, which may change the weight and balance of the controller. The acceleration values have to be mapped to certain conducting gestures to reveal what kind of gestures have been performed.

12 Table 2.1: Computer-based conducting systems

Year System Name Author Sensor Analysis/Recognition/Following Musical Parameters Response Pedagogical

Purpose

1983 Conductor Follower [12] Haflich Ultrasonic reflection, range- 2D trajectory Tempo, Dynamics MIDI - finders 1989- MIDI baton [19] [20] Keane A baton controller (acceleration) Pulse signals Tempo MIDI - 1991 1989 Computer Music System that Morita CCD camera, White Special feature extraction hard­ Tempo, Volume MIDI - Follows a Human Conductor [29] glove/marker ware, 2D trajectory

1990- Update version of the above sys­ Morita CCD camera, Infrared source, 2D position, velocity and accel­ Tempo, Dynamics MIDI Self- 1991 tem [13] VPL data glove eration evaluation 1991 Radio Baton [28] Mathews 2 radio batons, A plate with an­ 2D, Position above metal plate Tempo, Dynamics, Voice bal­ MIDI - tennas ance 1992 Light Baton [1] Bertini A strong LED baton, CCD cam­ 2D trajectory, An image acquisi­ Tempo, Volume MIDI - era tion board

1992- Adaptive Conductor Fol­ Lee A Mattel Power Glove, a Buchla 2D position, Tempo tracker Tempo MIDI 199S lower/Conductor Follower baton (time difference/characteristic [22] [3] points/neural network(NN))

1995 The Ensemble Member and the To bey A Buchla baton 2D position Tempo MIDI Rehearsal Conducted Computer [40]

1996 Extraction of Conducting Ges­ Tobey Two Buchla batons 3D position Tempo, Dynamics, Beat MIDI tures in 3D Space [39] pattern/style, Accentuation, Timbral balance

1997 Digital Baton [26] Marrin A baton (acceleration, pressure Volume (several tracks) MIDI sensors, an infrared LED), A position-sensitive photodiode, A camera Table 2.1: Computer-based conducting systems

Year System Name Author Sensor Analysis/Recognition/Following Musical Parameters Response Pedagogical

Purpose

1998 Multi-Modal Conducting Simula­ Usa Acceleration sensor, A cam- 2D, Hidden Markov Model Tempo, Volume, Stac­ MIDI

tor [42] era(eyes), A breathing sensor (HMM), Fuzzy logic production cato/legato beat patterns rules

1998 Conductor's Jacket [30] Marrin A multitude of sensors built into Physiological and motion data Tempo, Note, Channel vol­ MIDI a jacket, UltraTrak motion cap­ umes, Articulations, Accents, ture system Pitch, Number of voices, In­ struments balance

1999 Virtual Conducting Practice En­ Garnett A Buchla baton 2D Tempo Aural , Visual Learning vironment (continuous work of Conductor Follower) [9] 1998- Conductor Following with Artifi­ Ilmonen MotionStar magnetic sensors 3D positions, Neural Network Tempo, Volume, Stac­ MIDI - 1999 cial Neural Networks [15] cato/legato beat patterns 2000 Virtual Dance and Music [35] Segen 2 synchronized cameras 3D trajectory, Polynomial pre­ Tempo MIDI - dictor 2001 Interactive Virtual Ensemble [8] Garnett MotionStar magnetic sensor sys­ 3D position, Hand/head motion, - - - tem, Wireless receiver Neural Networks 2002 Personal Orchestra [5] Borchers A Buchla baton 2D position Tempo, Volume, Instrument Audio, Video - groups 2003 You're The Conductor (Personal Lee A baton-like device(light), Cam­ " Playback speed, Volume Audio, Video - Orchestra 2) [24] era 2003 Conducting Audio Files via Murphy 1 or 2 cameras Position and velocity, Parametric Tempo, Dynamics Audio - Computer Vision [25] template functions

2004 Conducting Gesture Recogni­ Kolesnik 2 USB cameras, Colored glove Beat and Amplitude, Expressive Tempo Audio, Video tion, Analysis and Performance information, HMM System [18] Table 2.1: Computer-based conducting systems

Year System Name Author Sensor Analysis/Recognition/Following Musical Parameters Response Pedagogical Purpose 2006 iSymphony [23] Lee A Buchla Lighting II 2D trajectory(position, velocity, Tempo, Volume, Instrument Audio, Video - acceleration) emphasis 2006 Wii Music Orchestra (A game ~ Wii remote Gestures: up and down Tempo, Volume Music, Video - demo)2 [31] 2007 Wireless sensor interface and ges­ Bevilacqua Accelerometers, Gyroscopes HMM (gesture following) - Time warped Gestural ture follower [2] gestures pedagogy

I—»

2http://www.youtube.com/watch?v=MVNIU0475dc, Retrieved May 18, 2008 Some systems set one or two cameras to capture the front view or both the front and the side views of a conductor. In general, 2-dimensional information is acquired and analyzed, such as in the system from Murphy [25] and the system from Kolesnik

[18]. They show a 2-dimensional trajectory of a conductor's motions. Cameras can also provide 3-dimensional information. For example, in the Virtual Dance and Music system [35], 2 synchronized cameras were used to get a 3-dimensional trajectory. The third dimension shows the distance from the conductor's hand to his/her body and reveals some expressive information. In addition, a camera can work under some restrictions to improve the accuracy of recognition, but this reduces the natural feel that this method might have for a user.

Infrared sensors have been used in a few systems. They can be an infrared LED on a baton [26] or a Buchla Lighting baton3, which is a MIDI controller and can track the movement based on infrared light [40]. Infrared sensors only track the movement of infrared light sources thus avoiding the influence of anything in the background.

It is a kind of wireless method, providing more freedom for a user to manipulate a system in a relatively long distance from the computer.

Except the above sensors, conducting systems have applied many other sensors that are not easily categorized. For example, a sensor based on ultrasonic range finding has been used in Conductor Follower [12]. Conductor Following with Artificial

Neural Networks [15] utilized a MotionStar magnetic sensor system. Recently, the

Wii Remote, a controller for 's Wii console, has been involved to conduct a

3http://www.buchla.com/lightning/descript.html, Retrieved May 18, 2008

16 virtual orchestra [31].

Analysis/Recognition/Following After gestures are captured by sensors, sample data is analyzed and features are extracted. Generally, 2-dimensional data is acquired.

Sometimes, 3-dimensional data is used to provide more information [15]. The Hidden

Markov Model (HMM) and the Neural Network (NN) are two primary techniques for gesture recognition/following [42] [18] [2] [15].

Response In a conducting system, aural response, which may be either MIDI or audio, is an essential response sent to users. Video [23] or animated virtual orchestra

[31] are supplementary feedback to sound. Only one system, the Virtual Conducting

Practice Environment [9], provides visual representation of conducting gestures for conducting students such as beat windows.

Purpose Most conducting systems are developed for research. A few systems have a pedagogical purpose. For example, the updated version of Computer Music Sys­ tem that Follows a Human Conductor [29] implemented a self-evaluation function.

Virtual Conducting Practice Environment [9], a continuation of the work of Con­ ductor Follower, tried to support appropriate aural and visual feedback according to the skill level and learning goals of a student. Wireless sensor interface and gesture follower [2] tried to find the problem in a student's gesture compared to a teacher's gesture.

17 Chapter 3

DESIGN OF A GESTURAL

INTERFACE

Generally, there are five aspects to the design of a gestural interface. These five aspects contain gestures, gesture tracking, gesture analysis, gesture recognition/following, and response. Figure 3.1 shows the relationship among these five aspects.

3.1 Gestures

There is a group of gestures regarded as standard and agreed on by all people in the area of conducting. These gestures or their simplified versions are able to be utilized in a gestural interface.

18 Gesture Recognition/Following

. t

Gesture Analysis

Gesture Tracking

. i

Gestures

Users

Figure 3.1: Five aspects to design a gestural interface

3.1.1 Conducting gestures

Conducting gestures contain right arm movement, left arm movement, eye contact, breathing, and other gestures. The movement of the right arm is the fundamental one for tempo and beat detection. Therefore, all conducting systems support the tracking of the right arm. The movement of the left arm is for expressive information and is tracked by several systems. Eye contact, body movements, and breathing are supplementary to hand movements but are not supported by most conducting recognition systems at all. The gestural interface presented in this thesis is designed to track the movement of both the right hand and the left hand.

During conducting, a conductor can hold a baton in his/her right hand or just conduct without it [34] [21]. Using a baton in the right hand makes it easier to attract the attention of performers in a large group. The use of a baton is standard

19 for most ensembles and . In a computer-based conducting system, it is a little different. Whether a baton is employed or not depends in part on what kind of sensors are to be used.

The conducting gestures used in current computer-based conducting systems can be complete conducting gestures or simplified ones such as down and up. In general, the systems for children or for the public support only simplified gestures to improve usability. In this thesis, complete conducting gestures are considered and used to operate the system because of its purpose, mentioned in Chapter 1.

3.1.2 Beat patterns

The movement of the right arm is pre-defined for a certain beat-pattern. Figure 3.2

(drawn based on the pictures in [27]) shows the expressive legato gestures for three beat patterns: a 2-beat pattern, a 3-beat pattern and a 4-beat pattern. Even for a certain beat-pattern, the style of the movement is not identical. Generally, a legato gesture involves more curved motions, whereas the movement of a staccato gesture follows almost straight lines as shown in Figure 2.2 and Figure 2.3. Therefore, it is necessary to specify the scope of beat patterns to be supported. In this thesis, beat patterns between 2 and 12 beats are implemented. A general beat recognition mode is implemented that counts beats but does not identify whether or not the beat pattern is correct. Another mode is a specific one that only supports three beat patterns

(from 2 to 4 beats per measure). It is intended to figure out the correctness of a gesture according to a desired pattern. Full details will be presented in Section 3.4.1.

20 The style for these beat patterns is expressive legato as shown in Figure 3.2.

(a) 2-beat pattern (b) 3-beat pattern

(c) 4-beat pattern

Figure 3.2: Three beat patterns

3.1.3 Dynamics

Dynamics control the loudness and the softness of a piece of music. It can be delivered using the gestures of the left hand. When a conductor wants an orchestra to play louder, he/she holds the left hand with the palm facing up. If a conductor's left hand is held vertically, with the palm facing out, it means the playing should be softer. In addition, the increased intensity of each of these gestures corresponds to an increased intensity in dynamics.

21 3.2 Gesture tracking

After the gestures used to control the system have been determined, it is time to choose sensors to transfer hand gestures into digital signals. Various sensors applied in computer-based conducting systems have been reviewed in Section 2.5.2. It is difficult to say that one sensor is better than another without experiment action.

For conducting, the conducting window is a 2-dimensional space. Therefore, a sensor capturing 2-dimensional sample data can be a potential option for this gestural interface. We tried all three types of sensors summarized in Section 2.5.2. They are an acceleration sensor named WiTilt v2.5 that supports both 2 dimensional and 3 dimensional data, an infrared sensor in the Wii Remote, and a video camera. For beat information, we employed the infrared camera contained within the Wii Remote.

The details will be presented in Chapter 4. Chapter 5 contains the analysis and experiments related to using the video camera and the WiTilt v2.5 for the right hand. It justifies the choice of the infrared camera in the Wii Remote. For dynamics, we chose the WiTilt v2.5 to reveal the changes in dynamics because of its relationship with the moving direction and speed. This will also be addressed in Chapter 4.

3.3 Gesture analysis

Gesture analysis may consist of segmentation and feature extraction [32] or only feature extraction depending on what kind of sample data is acquired.

22 3.3.1 Segmentation

Segmentation is the process of extracting the required target from noise. After that, the gestures can be analyzed to extract features for further processing. Capturing gestures using a video camera is a good example of gesture analysis involving two parts.

In general, a video camera records all objects and their movements within its field of view and cannot figure out by itself which one is the object to be tracked.

Consequently, it is necessary to perform segmentation before features are extracted.

The segmentation is not easy to achieve because of the complexity of real world data.

A widespread method to make segmentation easier is to add a few restrictions to the gesture tracking procedure such as a complete white background, a specific color object, and so on [32]. These restrictions normally do not affect the gestures, resulting in a user's interaction with the computer system feeling just as natural as it would feel without such restrictions. However, in some situations, these restrictions do present some limitations. For example, a white background means the environment in which the system can be used is restricted.

Segmentation is not essential for all gesture tracking approaches. Some tracking methods, such as infrared tracking, can extract features directly because only the infrared radiation is the object to be tracked and most natural objects do not produce infrared radiation. Therefore, the background information will not be involved in the tracking.

23 3.3.2 Feature extraction

Feature extraction is intended to generate the representative data based on the cap­ tured sample data. The features and the extraction approaches depend on the appli­ cation.

In this gestural interface, for the right hand, no matter what sensors are chosen, the low level features are the coordinates of that hand or the tip of a baton at each time. The coordinates of a right hand are the average centroid of the right hand.

For the left hand, 3 dimensional acceleration data are the low level features and will be used directly in gesture recognition/following because acceleration data can reveal the dynamics.

Low level features may be used to generate high level features. For the right hand, high level features (beats) are produced based on the low level features (po­ sition at time). They are the fundamental components of a beat pattern and have some inherent characteristics. The characteristic presented in the textbook from

Brock McElheran [27] and Joseph Labuta [21] is employed in this gestural interface to identify a beat. It states that a beat always occurs at the vertical minimum of a movement as shown in Figure 3.2. At that position, the movement will change its direction.

24 3.4 Gesture recognition/following

3.4.1 Recognition

This gestural interface supports two kinds of gesture recognition for a beat pattern conducted by the right hand.

One gesture recognition is able to tell whether a beat pattern is conducted cor­ rectly or not on the basis of an assumption that the beat pattern is known beforehand.

Generally, the beat pattern to be used during conducting is known by the conductor and performers in an orchestra because they have the score that indicates the time signature. Therefore, it is reasonable to provide a kind of gesture recognition that is able to make a decision on the correctness of a gesture in terms of a certain beat pattern. This gesture recognition does recognize the type for a correct gesture.

Currently, it only supports 2-beat patterns, 3-beat patterns, and 4-beat patterns.

It is not difficult to extend to other patterns. Table 3.1 shows the predefined rules for the three beat patterns. They are the recognition criteria for this gesture recogni­ tion. Initially, the downbeat1 is detected as a downward motion and represented by a straight line as shown in Figure 3.2. By working like this, the downbeat is differ­ entiated from other beats easily [27]. Subsequent beats are detected according to the position and amplitude of the other beats relative to the downbeat.

XA downbeat is the first beat of all conducting gestures, indicated by a strong downward vertical motion.

25 Table 3.1: Gesture recognition rules for three beat patterns

Beat Pattern Beat number Position Size

1 A vertical straight line A large motion2 2-beat pattern 2 On the right side of the downbeat No larger than the downbeat

1 A vertical straight line A large motion

3-beat pattern 2 On the right side of the downbeat No larger than the downbeat

3 On the right side of the downbeat No larger than the downbeat

1 A vertical straight line A large motion

2 On the left side of the downbeat No larger than the downbeat 4-beat pattern 3 On the right side of the downbeat No larger than the downbeat

4 On the right side of the downbeat No larger than the downbeat

A second gesture recognition mode provides a more general recognition. It can know which beat pattern is employed based on the number of beats between down­ beats. It figures out the downbeat first and then accumulates the number of beats until the next downbeat is found. The amount of beats reveals the beat pattern per­ formed. This gesture recognition is suitable for any expressive legato beat pattern.

It supports the beat patterns between 2 and 12 beats in this interface. The quality of the downbeat affects the result of gesture recognition directly.

2In this table, a large motion means the downbeat has a much larger amplitude than other beats as shown in Figure 3.2. The method to measure the "large" will be presented in Chapter 4.

26 3.4.2 Following

Gestures can also be mapped to musical parameters to perform gesture following. The extracted features are employed to control parameters. Consequently, the changes in gestures result in changes to parameters and produce some specific effects. Gesture following has been used in music composition systems to create music pieces using var­ ious gestures [14]. The gesture following supported in this gestural interface includes tempo tracking (right hand) and dynamics tracking (left hand).

Tempo is a music term and indicates the speed of a music piece. It can be repre­ sented by beats per minute (BPM). The tempo tracking in this system is intended to estimate the value of the tempo and reveal the speed of conducting gestures. Once a beat occurs, the value of the tempo will be calculated. As the speed of gestures changes, the value of the tempo will increase or decrease correspondingly. The value of the tempo is represented in both average value and instant value in this system.

Average value reveals the trend of the speed of conducting gestures. Instant value is intended to show current speed of conducting. Equation 3.1 and Equation 3.2 are the formulas to calculate them.

1 Tf,' if Nb < 10, Tempoa = { " (3.1) 10 if Nb > 10. I r6>

1 TempOi — (3.2) •'•one

Among them, Tempoa and TempOi are the average value and the instant value of the tempo, respectively. Nb denotes the number of beats. Tb and T^e represent

27 the duration to produce TVj,/10 beats and one beat, respectively. As shown in Equa­ tion 3.1, the average value is a moving average that is estimated based on the past

10 beats if more than 10 beats have been conducted. Therefore, it can follow the changes, especially the significant changes, in tempo quickly and also show the speed trend of conducting clearly.

For the left hand, this gestural interface controls dynamics in accordance with the orientation of the palm of the left hand as introduced in Section 3.1.3.

3.5 Response

After gestures are fed into a computer, users may reasonably expect to get some responses. In this gestural interface, visual and aural feedback will be presented once a gesture is recognized.

3.5.1 Visual representation

Visual feedback is the primary response supported by this gestural interface because it is intended to present a more straightforward interpretation to gestures. As mentioned in Section 2.5.2, few researchers implemented visual representations for conducting gestures. Video and animated characters have been used in several computer-based conducting systems to work as a virtual orchestra.

In this gestural interface, visual representations as shown in Figure 3.3 include a conducting window, a gesture trajectory consisting of sample points and lines to connect them, beat dots, beat numbers, and so on. The details will be presented in

28 Chapter 4.

Figure 3.3: An example of visual representations

3.5.2 Aural representation

Visual representation is not sufficient for a conducting system. Aural representation is still required.

The aural feedback in this gestural interface is the playing of an individual MIDI note for a beat instead of a piece of music. A certain MIDI note corresponds to a certain beat. For example, C4 will be played while the downbeat is found. The second beat will trigger a D4 and so on. Table 3.2 shows the relationship between beats and MIDI notes in the system. This kind of aural feedback gives users the feeling of conducting. It also informs the user of any mistake in a gesture because a specific note has been associated with each beat.

In addition, aural feedback is utilized to inform the user of system errors as well.

So far, only one system error sound is considered in this gestural interface. It is the sound for a certain amount of incorrect gestures. The number will be chosen through

29 experimentation. The MIDI note, G7, is chosen to represent it. G7 has a much higher pitch than beat sounds and can be differentiated from them easily. Once this sound is played, users know that they have made too many errors. The system will then be reset.

Table 3.2: The mapping between beats and MIDI notes in the system

Beat number MIDI note Used by Beat Pattern

1 C4 from 2 to 12 beat patterns

2 D4 from 2 to 12 beat patterns

3 E4 from 3 to 12 beat patterns

4 F4 from 4 to 12 beat patterns

5 G4 from 5 to 12 beat patterns

6 A4 from 6 to 12 beat patterns

7 B4 from 7 to 12 beat patterns

8 C5 from 8 to 12 beat patterns

9 D5 9, 10, 11, 12 beat patterns

10 E5 10, 11, 12 beat patterns

11 F5 11, 12 beat patterns

12 G5 12 beat pattern

30 Chapter 4

IMPLEMENTATION AND

EVALUATION

4.1 Development and runtime environment

The implementation of this gestural interface makes use of a specific set of hardware.

An iMac personal computer is used for development and as a runtime environment.

A Wii Remote and a baton-like infrared stick work together to capture the movement of a conductor's right hand. A WiTilt v2.5 is utilized for the motions of the left hand. The development of this system involves Max/MSP, Jitter, and Java. The Wii

Remote, the stick, the WiTilt v2.5, and the programming environment/language will be addressed in detail. Figure 4.1 shows the setup of this system.

31 Figure 4.1: The setup

4.1.1 Wii Remote

The Wii Remote, a controller for Nintendo's Wii game system, is employed as an infrared camera in this gestural interface. Figure 4.2 shows a picture of Wii Remote from Nintendo's website1.

Generally, the Wii Remote is used as a game controller. It has a motion sensor that can acquire 3-dimensional acceleration data. The acceleration sample data is able to be shown in a graph using an OS X software named DarwiinremoteOSC2.

Figure 4.3 shows the interface of the DarwiinremoteOSC.

DarwiinremoteOSC receives sample data from the Wii Remote via a bluetooth connection. After that, it transfers both acceleration and infrared data into Open

Sound Control3 (OSC) messages. OSC is a protocol for communication among multi-

1http://www.nintendo.com/wii/what/controllers/remote, Retrieved May 18, 2008 2http://code.google.com/p/darwiinosc/, Retrieved May 18, 2008 3http://archive.cnmat.berkeley.edu/OpenSoundControl/, Retrieved May 18, 2008

32 Figure 4.2: Wii Remote (from its website) media applications similar to MIDI. These OSC messages are sent out by Darwiinre- moteOSC to gesture analysis for further processing via the User Datagram Protocol

(UDP), one of the core protocols of the Internet. Because UDP does not check the delivery of each packet, it is fast and suitable for time-sensitive applications.

Using the Wii Remote, users do not need to pay for a specific sensor bought or developed only for this gestural interface. Max/MSP supports both the OSC protocol and UDP, so it is easy to acquire the sample data with the help of the Darwiinre- moteOSC. More importantly, the infrared camera only captures the movement of infrared sources and has no specific requirements for the background. Therefore, it is suitable to be used in a natural environment.

One problem for the Wii Remote is its field of view. The horizontal and vertical

33 O • •» I -\ WHRemote Event Loq

JdiLJ y:^lf^rA^^^^JiM- \l

LED: ^ [Motion Senso's] Furce feedback' [IRSensor|

I DarwIinRemote OSC 05.1 1 default osc remote address: 127.0.0.1 ;5600 (make changes in the preferences) I default osc receiving port is 5601

( Please press button 1 and button 2 simultaneously 1 ===== Connected to WHRemote =====

I

Figure 4.3: The interface of the DarwiinremoteOSC

34 field of view are approximately 31 degrees and 41 degrees, respectively4. A motion beyond this range cannot be tracked and will be lost. Thus, the position of the Wii

Remote has to be adjusted in accordance with the position of users before conducting.

4.1.2 A baton-like infrared stick

A baton-like infrared stick (shown in Figure 4.4) is used as an infrared light source in this interface. It consists of a conducting baton, a button, a one-cell battery holder, a 1.5v battery and an infrared LED. The handle of the conducting baton is removed and replaced with the battery. The infrared LED is put on the tip of the baton to emit infrared radiation. The infrared LED used in this gestural interface is chosen because of its large viewing angle of 110 degrees. If the viewing angle and radiant intensity are too small, the tracking will be easily lost while conducting. This infrared

LED has a 950 nm wavelength and top view orientation.

Figure 4.4: A baton-like infrared stick

While using this stick, users have to hold it in their right hand and press the button with their thumb. The infrared LED on the tip of the stick will emit infrared radiation, which is captured by the infrared sensor in the Wii Remote. This stick can be employed as not only a baton but also a mouse. During conducting, the stick

4http://www.wiili.org/index.php/Wiimote\#IR\_Sensor, Retrieved May 18, 2008

35 works like a baton. The trajectory of the infrared LED reveals the beat pattern conducted by users. If the stick is used as a mouse, the position of the stick helps users choose from one of five options. A uniform controller eliminates the extra time spent on switching between the stick and a mouse and provides a smooth operation of the system.

This stick is a compact and light weight controller. It looks very much like a real baton as shown in Figure 4.4. Actually, it is made of a real baton with the handle removed. A one celled battery is equipped at the bottom of the baton where the handle was. Thus, it is slightly longer than a real baton. The weight of this stick derives mainly from the 1.5v dry battery and is therefore heavier than a real baton.

During conducting, the difference from a real baton is that the button on the stick has to be pressed by the thumb of a user. This button controls the connection between the infrared LED and the battery. However, while holding a real baton, the thumb of a conductor has to be put on some position of the baton with a little strength.

Thus, it may not affect the conducting of users except that a little more strength is required.

The technical problem for this stick is its viewing angle. Although 110 degrees is a very large viewing angle, it still may lose some tracking data if too large a motion is used. However, this is not a big problem because users rarely conduct with such a large gesture, and further, such a large conducting gesture would be incorrect. A loss of tracking data can therefore be considered an incorrect gesture.

36 4.1.3 WiTilt v2.5

Figure 4.5: The WiTilt 2.5 used in our system

The WiTilt v2.5 produced by Sparkfun Electronics5 is an acceleration sensor. It contains an accelerometer and a Bluetooth module. An accelerometer is a device that measures the acceleration. In general, the output value of a static accelerometer is 1 g since it is always in the earth's gravitational acceleration field. The output value changes when the accelerometer is moving or vibrating. The Bluetooth module is responsible for transmitting captured raw data to a host computer. Figure 4.5 shows the WiTilt v2.5 used in our system. It is fixed on a battery box. The WiTilt v2.5 will be tied on the back of a user's left hand using a velcro strap during conducting as shown in Figure 4.6.

The WiTilt v2.5 can provide 2-dimensional or 3-dimensional data depending on whether the status in the configuration menus is set to active or inactive. In this ges­ tural interface, 3-dimensional sample data are used. Before conducting, the WiTilt v2.5 should be calibrated using configuration menus. In addition, the WiTilt v2.5 should connect with a host computer using bluetooth. This connection can be estab-

5http://www.sparkfun.com, Retrieved May 18, 2008

37 Figure 4.6: The coordinates of the WiTilt v2.5

lished with the help of software named WiTilt to OSC {W20f. W20 is developed in

Objective-C by Woon Seung Yeo, a Ph.D. student in Stanford University. It receives sample data from the WiTilt v2.5 and then sends it out in OSC messages via UDP.

The same calibration values should be used in W20 as in the configuration menus of the WiTilt v2.5. Figure 4.7 shows the interfaces of W20.

r-. r\ r\ I^JQ Preferences r\ Q r\ W20

Calibration Data Bluetooth Device Information Device Name: SparkFun-BT X max. 798 i mm: ' Z78~ MAC Address 00-a0-96-17-aa-38 Y max 766 mm: : 272 Oct-', Close X max: 758 min". 253

Network Settings IP: 127.0.0.1 Port. 1228

Da ta Fields &

Figure 4.7: The interfaces of W20

shttp://ccrma. stanford.edu/woony/software/w2o, Retrieved May 18, 2008

38 Actually, the Wii Remote can be employed instead of the WiTilt v2.5 for the sake of its ability for capturing 3-dimensional acceleration data. However, the WiTilt v2.5 is smaller than the Wii Remote and can be tied on the left hand. The Wii Remote is bulkier and cannot be easily tied on the user's hand; therefore, the user would have to hold the Wii Remote while conducting.

4.1.4 Software

Both Max/MSP and Jitter are products from Cycling '747. Max/MSP is a graphical programming environment for real-time multimedia applications. During program­ ming, a user selects objects from the palette and connects them with cords as shown in Figure 4.8. Jitter is an extension to Max/MSP specifically designed for graphic manipulations and is very valuable for developing realtime video applications and processing matrix data.

The Java programming language is utilized to develop functions that Max/MSP does not support or are difficult to implement such as data sorting. Users can use any editor to code. Cycling '74 provides an integrated environment named mxj quickie to edit and compile Java codes.

cv.jit8, a collection of tools for computer vision, has been used in some trials and is not employed in the final version. It is written in Max/MSP/Jitter.

7http://www.cycling74.com, Retrieved May 18, 2008 8http://www. iamas .ac.jp/~jovan02/cv/, Retrieved May 18, 2008

39 iCC". BEIEIEHCaQiniHBnBlfflHaBJSSBiaiSlA/^ i

[vexpr $11/285. j

fanpacl^Q 6^6. •MMMMMMpnaMMMHi EZjj E2ZIF~l ] EH!

Figure 4.8: An example of a Max/MSP patch 4.2 Main window

The main window is intended to display the menu, the conducting window, and the information related to tempo and dynamics as shown in Figure 4.9.

For the right hand, there are two modes in this gestural interface, the option selection mode and the conducting mode. The stick can be used as a mouse or an infrared light source as described in Section 4.1.2. The focus, indicated by the color of two short bars, Menu and Reset, is switched between the menu and the conducting window in accordance with two modes. The "Menu" bar in the conducting window mimics a switch button and is used to go back to the menu area. There is no specific switch bar in the menu. If users stay on an option, for example 2-Beat, for a period of time, the focus will be moved back to the conducting window. In addition, a "Reset" bar is located at the left side of the conducting window. It is utilized to reset the system at any time if the focus is in the conducting window. This bar provides the

40 option for users to restart their conducting as they like.

«nn.

Receiving data indicator

infrared data

Acceleration data |||

Dynamics: 1352111117

Reset

Average Tampo {W*Wf. to

Bnstartt Tempo (SPM): jo

Figure 4.9: A snapshot of the main window

The menu contains five options to support two types of gesture recognition for the right hand and the gesture following described in Section 3.4. The first three options belong to the gesture recognition based on an assumed beat pattern, either

2, 3, or 4 beats. The fourth option corresponds to the generalized gesture recognition based only on the detection of the downbeat and capable of identifying 2 to 12 beat patterns. The last option is for tempo practice only, without beat pattern or gesture identification. Any option can be chosen using the infrared stick, while the focus is in the menu area. Generally, the menu of a Windows or Mac OS application is located on the top of the window. Therefore, the menu area of this system is put on the top

41 of the main window as well.

The visual representation of a beat pattern is shown in the conducting window.

In addition, the result of gesture recognition is displayed in this area as well. Thus, this is the main work area for users and will attract users' attention most of the time.

The conducting window is located in the center of the main window and is a large area for showing a lot of information.

Dynamics are represented using a slider that is common for displaying the volume.

Because of aesthetics, a horizontal slider is utilized instead of a vertical slider. From the left side to the right side, dynamics change from soft to loud. Therefore, pp, p, f, and ff are marked from the left to the right to demonstrate the trend of dynamics.

The indicator of data receiving and the visual representation of both the dynamics and the tempo are all on the left side of the main menu. Dynamics are conveyed by the left hand of a conductor. Therefore, it is natural to display it on the left side.

The indicator of data receiving and the information related to tempo are put there for the sake of aesthetics.

4.3 Gesture tracking

The Wii Remote and the baton-like infrared stick have been used to track the move­ ment of the right hand. Before gesture tracking, the Wii Remote has to connect to a computer using bluetooth. While conducting, users press the button on the stick and move their right hand as though they are holding a baton. The DarwiinremoteOSC gets sample data containing the position of the infrared LED from the Wii Remote,

42 and then sends them out in OSC messages via UDP.

Figure 4.10 is the Max/MSP patch used to receive sample data via UDP. Max/MSP provides two objects, [udpreceive] and [udpsend], for UDP. The contents in an OSC message are retrieved using a [route] object. Therefore, it is not difficult to capture the motions of a conductor and acquire the corresponding sample data. For the right hand, the sample data retrieved by a [route] object consists of three parts: the hori­ zontal coordinate, the vertical coordinate, and the infrared intensity. The former two parts are the basis of gesture analysis.

#00 Untitled CD

|host 127.0.0.1 I

udpreceive 5600

route /wii/irdata

unpack 0. 0. 0.

>o. 1 |>o. 1

/,.

Figure 4.10: A Max/MSP patch to receive sample data via UDP

The WiTilt v2.5 has been employed for the tracking of the left hand. It also needs to establish a bluetooth connection with a computer before the conducting. While a user's left hand is moving, the W20 captures the acceleration data and sends it to a Max/MSP patch in OSC messages via UDP. A [route] object is used to retrieve sample data as well. Sample data is composed of the acceleration values of the x, y,

43 and z axes.

4.4 Gesture analysis

For the right hand, no segmentation is required for the sample data from the Wii

Remote. The coordinates are the output of the gesture tracking and can be used to produce high level features: beats. For the left hand, the acceleration data are acquired from gesture tracking and sent to gesture following to be processed.

4.4.1 Coordinates

The coordinates from gesture tracking reveal the position of the infrared LED within the field of view of the Wii Remote.

All coordinates are represented using small dots on the main window as shown in

Figure 4.11. A curve connects these dots and represents the trajectory of the move­ ment of the right hand. By displaying the trajectory of a motion, a conducting gesture of the right hand is presented graphically. It is a kind of quantitative interpretation and let users know what their gestures look like exactly. Therefore, this representation is clear and accurate. The graph can be compared with the corresponding diagram in a textbook to improve users' gestures. Figure 3.2(a) is a reference for Figure 4.11.

4.4.2 Beats

Beats are extracted based on the coordinates of sample data. As addressed in Sec­ tion 3.3.2, all beats occur at the vertical minima of a trajectory. An algorithm is

44 f*'2f} .... _. =J3!*HS!^S£«&KmSS?

Receiving oats indicator

Infrared dat3 g|

Acceleration data |||

Mem

Dynamics:

op p f ft

Reset

Avarcgs Tempo (BPM): J34 j

Instant Tempo (BPMi: [3? |

Figure 4.11: A correct gesture for a 2-beat pattern developed to detect the peaks and troughs of a trajectory by looking for the local maxima and minima of the vertical coordinate. Sample points with the same ver­ tical coordinate are discarded because they do not assist in finding troughs on the trajectory of users' movements. It utilizes a threshold to remove noise that results in unexpected peaks or troughs. After a few trials, we chose 0.029 as the threshold be­ cause a large threshold will omit the real peaks/troughs and reduce the sensitivity of the interface, whereas a small one may incorrectly interpret noise as the peaks/troughs and give out incorrect recognition/following results.

As shown in Figure 4.11, the highest point before a downbeat is displayed using

9The boundaries of the conducting window are from -1 to 1 in both the x and y direction.

45 a square on the screen. A trough (beat) is represented with a larger dot than that for a sample point. It is easy to differentiate one feature from another because their visual representations have different shapes and sizes.

4.5 Gesture recognition/following

4.5.1 Separate beat pattern recognition and accuracy

The first three options in the main window are employed for the recognition of 2-beat patterns, 3-beat patterns, and 4-beat patterns, respectively. They identify a gesture as an expected beat pattern according to the position and size of beats. If the gesture does not correspond to the desired beat pattern, the system recognizes it as an "error" gesture. Too many error gestures will trigger the system to reset and wait for a small period of time. This causes the user to take notice and improve their conducting gestures.

Figure 4.11, Figure 4.12, and Figure 4.13 show three examples for correct gestures.

They follow the rules presented in Table 3.1. The recognition result is displayed on the upper left side of the conducting window. Beside it are the number of correct gestures and error gestures to show the quality of users' gestures.

This recognition involves the downbeat detection, which is based on horizontal coordinates. Each beat occurs at the trough of the trajectory of a right hand motion and has a corresponding peak. The horizontal coordinate of a peak is stored and serves as a baseline value. All sample points between this peak and its correspond-

46 Al- ' ~^g^|g^^TMgW.W»M»a

Acceleration data

Menu

Dynamics:

AVEnge Tempo £aP«): 3S

Instant Tempe (8PM): jBZ

l.toi.Jaili

Figure 4.12: A correct gesture for a 3-beat pattern

...gn^nngQn«,T»oULw«lfiZ . ,.- ^J&.J.

1 p« cewmg data indicator ' Infrared data -

Actsseration data > D

h

1 D namics: &!Iili§I21Sl*Ji $ s

I AvtngB Ttmpn (SF>«J; [74 1

Ifistant Tempo (SPH); [10T'""'']

p -. •**. -p^Ti.—.TT.^flW-- WT;^5 jr"*". r.?j^T c;?r,.!,nns" yw*,1! *• *• :f T'

Figure 4.13: A correct gesture for a 4-beat pattern

47 ing trough (beat) compare their horizontal coordinates with that baseline value to generate difference values. A beat is the downbeat if and only if all these differences are below a threshold. That means the line connecting a peak and its trough (beat) is almost a vertical straight line. Currently, a fixed threshold, 0.08 (range from -1 to 1), is chosen for experimentation. Figure 4.14 shows an example of an incorrect downbeat that is not a vertical straight line. In the future, a more flexible method, such as a relative threshold, may be considered to provide a better decision.

An incorrect gesture may contain an error in the downbeat or in other beats. A downbeat is not correct if it is not a straight line. An incorrect downbeat can also be such a small motion that its magnitude is smaller than a threshold, 0.3 (range from -1 to 1). If an incorrect downbeat occurs, the system will inform users using the beat number "0" and the message "error first beat" as shown in Figure 4.14. The second beat cannot be accepted until a correct downbeat is performed. A gesture with a correct downbeat and other incorrect beats are identified as an error gesture as shown in Figure 4.15. It demonstrates a 2-beat pattern with the error position of the second beat, which should be in the right side of the downbeat. "Error gesture" is displayed in the conducting window as the recognition result. The count of error gestures will increase one.

The accuracy of this recognition depends on the downbeat identification and the quality of other beats. It cannot always give out correct results because, for example, some conductors may not perform a straight downward motion for a downbeat. But it does work well and produces correct identification results for those gestures that follows the rules employed in this recognition.

48 t \ • ' ^^talf\^Qo^J»^i\^/»isati\i ^ .fl.

infrared data Acceferation data •

Dynamics:

PP P f ff

Average Temps ESPN}: 128 'JJ Instant Temps fSPN): [43 1

Figure 4.14: An error downbeat for a 2-beat pattern

i ..^A.- s.ss&g&GssJaiAaaa&.mt.

'ins tlata indicator

infrared flats (|

Acceleration data |||

Dynarwca:

pp p f it

AyersgB TamjiQ JBFH}: J41 1

i

lrnt«ntT>m|»tBPMJ: |« ;

i

ft-» . -I—•••- ^ — • •.*. «. -^ ;;.i /•••. -• 'vi,ar-;7s--.5W|WY. WiT,;*» "TE?n,E»*JWS'',:,T5~r".*"

Figure 4.15: An incorrect gesture for a 2-beat pattern

49 4.5.2 Mixed beat pattern recognition and accuracy

Mixed beat pattern recognition corresponds to the fourth option, "Mixed", on the main window. The same downbeat detection algorithm is employed in this recogni­ tion. Once a downbeat is found, it triggers a counter to count the number of beats until another downbeat occurs. The number of beats between these two downbeats is the beat pattern performed. Because only twelve beat patterns are supported in this option, a gesture containing more than 12 beats cannot be identified and is regarded as an incorrect gesture.

_i_fcr andi-sui^anc. **oli.wrnwiJ

Ree»t«ing data indicator

intrarea data

Acceleration data |g

Dynamics:

Awfr«g«Tim[H!fSPt«): J63 |

Instant Tempo (8PM): (65 j

yMiJttji(iH,ifctiifiiiiiM

Figure 4.16: An example of 12-beat patterns

Figure 4.16 show an example of 12-beat patterns. The recognition result is still displayed on the upper left side of the conducting window. There is no information for the amount of correct and error gestures because the system does not know which

50 beat pattern will be employed in this option. If the system is waiting for a downbeat, but error downbeats are performed, a second beat cannot be accepted. The "0" beat number and "error first beat" will be shown in the conducting window to inform users.

Table 4.1 shows a few recognition results for three beat patterns. It illustrates how the downbeat of a subsequent gesture affects the recognition accuracy. Because this gesture recognition is intended to count the number of beats between two downbeats, once a downbeat is conducted, the counter is triggered and has to be increased until a subsequent downbeat is found. An incorrect downbeat will be identified as other beats, so it cannot stop the counter. Finally, an incorrect beat pattern will be returned to the user as a recognition result.

Table 4.1: The results of recognition only based on the downbeat detection

Current beat pattern Correct/Incorrect The next beat pattern Correct/Incorrect Current beat pattern rec­

downbeat downbeat ognized as

Correct Any beat pattern Correct 2-beat pattern

2-beat pattern Correct 2-beat pattern Incorrect 4-beat pattern or above

Incorrect No result

Correct Any beat pattern Correct 3-beat pattern

3-beat pattern Correct 2-beat pattern Incorrect 5-beat pattern or above

Incorrect No result

Correct Any beat pattern Correct 4-beat pattern

4-beat pattern Correct 2-beat pattern Incorrect 6-beat pattern or above

Incorrect No result

51 4.5.3 Tempo tracking and accuracy

Tempo tracking consists of both average tempo and instant tempo estimation. They correspond to the last option, "Tempo", on the main window. Actually, the value of the tempo will be calculated even if users are working on other options. Therefore, users can choose the last option to practice how to control the tempo or check the information related to tempo after the conducting if other options are chosen.

Figure 4.17 shows a screenshot taken while "Tempo" is chosen. Only a single beat is displayed in the conducting window because the users' attention should be on two tempo areas. The numerical tempo value is represented in BPM and gives users a concrete measure of their moving speed. It is a value at a certain time. The diagram clearly shows the changes of the tempo in terms of remaining steady, the increase, and the decrease. It helps users know the history of the tempo and controls their speed as desired.

The accuracy of "Tempo" involves two aspects, numerical accuracy and graphical accuracy. The diagram of the tempo is a graphical interpretation of the numerical tempo. Therefore, its accuracy relies on the numerical tempo. The numerical tempo is calculated using our program and depends on the computation speed. An experiment was designed to estimate the difference between the tempo calculated by the system and the tempo calculated by a human being. The experiment was conducted using a debug status. Firstly, the time to find a beat by a human being is compared with that of the system. A person cannot perceive the difference between them. Thus, the amount of beats counted by a human being is regarded as the same as that by the

52 mmmamammm^^^^

Figure 4.17: A screenshot of tempo practice system. Secondly, a stopwatch is used to record the time to conduct a certain amount of beats. The time will be compared with that recorded by the system. Three groups of conducting gestures are done. They include 15, 30, and 45 2-beat pattern gestures, respectively. The comparison as shown in Table 4.2 reveals that there are no obvious differences.

4.5.4 Dynamics tracking

Dynamics tracking is done on the basis of three-dimensional acceleration values. As the orientation of the left palm changes, the acceleration values change correspond­ ingly and dynamics change from loudness to softness, or vice versa. The loudness is associated with a negative value of z axis and a value of zero for the other two

53 axes. A positive value of the y axis and a value of zero for both the x and y axes correspond to the softness gesture. Through experimentation, a range from negative

0.38 to positive 0.38 (the acceleration data is mapped to a range of-1 to 1.) is chosen as a tolerance area. An axis has a zero value if and only if it falls into this area. The larger or smaller acceleration values can also reveal the degree of loudness or softness.

Table 4.2: Comparison between the calculated average tempo and the real

average tempo

Group Amount Style Time by the sys- Time by the stop­ Calculated Real

tem(s) watch (s) tempo(BPM) tempo(BPM)

1 15 2-beat pattern 15.6 15.56 115.38 115.68

2 30 2-beat pattern 36.2 35.93 99.45 100.19

3 45 2-beat pattern 42.1 41.91 128.27 128.85

Dynamics tracking is an independent function in this system. It can be separately used to practice the control of dynamics using the WiTilt v2.5. It can also work with gesture recognition and tempo tracking together so that a user may practice conducting with two hands.

Figure 4.18 shows the changes in dynamics using a slider. It is a kind of conven­ tional visual representation for the volume and easy to understand for users.

54 Ojffl amies:

Pf> P f ff

Qynimfcsj

pp P f ff

Dynamics: K9flHm*ssi!a£ 1 PP p f ff

Dynamics: MB I PP P f ff

Dynamics;

Iffl f PP p f ff

Figure 4.18: Examples of the changes of dynamics 4.6 Aural Representation

The implementation of aural feedback takes use of the objects related to MIDI in

Max/MSP.

The system error is set to ten errors by experiments. That means once the number of error gestures accumulates to ten while a user is conducting, the system will play the system error sound and restart the system counter. Then, the counter will decrease from three to zero. During this period of time, beat tracking is disabled giving the users enough time to adjust their gestures. After the system counter reaches zero, the system begins to track the conducting again. The number of errors, ten, is chosen because it may be annoying for a user if too many errors are found during conducting.

On the contrary, a fewer number of errors, such as two or three, will waste too much

55 time on the restart of the system counter.

56 Chapter 5

DISCUSSION

Chapter 4 describes an implementation using the Wii Remote to track the motions of the right hand. Actually, a video camera and WiTilt v2.5 have been potential sensors for the tracking of the right hand. This chapter will present the trials on these two sensors and discuss the reason to abandon them.

5.1 Video camera

While a user is standing in front of a video camera and conducting, the movement of the right hand is tracked to produce 2-dimensional position data (coordinates).

There are three possible approaches for a video camera to track the movement in this interface.

Video camera alone A video camera can be employed alone to track the position of the moving hand. It is more convenient than the Wii Remote for a user who is used to conduct without a baton. Because the information captured by a video

57 camera includes the moving hand and the background, it is necessary to perform some segmentation techniques to extract the moving hand. The effect of the segmentation affects the gesture recognition/following directly. Through experimentation, we found that it is difficult to acquire the moving hand from a natural background. This will be described in Section 5.1.1 in detail.

Video camera plus a color object A video camera is still used to track the posi­ tion at any time. The difference from the above scenario is that a color object is held by or attached to the right hand. As a result, the position is the moving color object other than the right hand itself. This method adds a restriction to the moving hand using a pure color object. The color should not exist in the background. Through experimentation, we found that it is easy to lose the moving object. Therefore, it is not a good choice.

Video camera plus an infrared light source This scenario is similar to the Wii

Remote approach. A video camera with an infrared filter captures the movement of an infrared light source held by the right hand. The 2-dimensional data is the position of the infrared light source. We chose to use the Wii Remote as an infrared camera because it is designed for infrared tracking and, therefore, has higher sensitivity.

5.1.1 Segmentation

Segmentation techniques have been utilized for the first approach, a video camera alone, to filter out a moving hand from the complex natural background.

58 The extraction of the moving right hand makes use of a foreground detection tech­ nique [6]. We assume that all objects in the background are static during conducting.

Therefore, if a frame at time t is used as a background, the difference between this frame and its next frame at time t + 1 can detect the moving object. A pixel in the frame at time t + 1 belongs to the moving object if and only if Equation 5.1 is satisfied.

\Pt+1-Pt\>T (5.1)

Pt+i and Pt denote the values of a pixel at time t + 1 and t, respectively. T is the threshold to decide if a pixel is moving or not and has been determined through experimentation to be 801. During conducting, in the ideal case, only the right hand is moving. However, in reality, the conductor's head and body often move a little bit as well, adding noise to this method. It is difficult to keep the whole body static while only the right hand is moving. The screenshot (a) in Figure 5.1 shows the result of the detection containing a user's head and body as well as her hand. As a result, all moving objects are detected. By increasing the threshold and trying to keep the whole body moving in a small range, we can resolve this problem and get the screenshot (b) in Figure 5.1. The sensitivity of foreground detection is reduced a little bit by increasing the threshold to get rid of noise.

1Here, 80 is the intensity value of a pixel in a frame.

59 (a) Hand detection (b) Optimized hand detection

Figure 5.1: The segmentation of a moving hand

5.1.2 Feature extraction

After the moving hand is segmented, low level features (coordinates) will be extracted for further processing. For a hand, the coordinates representing the position are the coordinates of the centroid of this hand. Equation 5.2 and Equation 5.3 illustrate the calculation of the centroid of a hand. The identification usually consists of a number of separate "blobs" (regions of interest) of the detected object.

_ Alxx1 + A2xx2 + ... X (5 2) ~ A1 + A2 + ... "

A1xy1+A2xy2 + ... , (5 3) y - —AI+A,+...— -

Here, x and y represent the horizontal and vertical coordinates of the centroid, respectively. x\ and y\ are the coordinates of the centroid of the first blob. A\ denotes the area of the first blob. x2, y2 and A2 are for the second blob. The circles in Figure 5.2 represent the blobs for a moving hand. The idea of these two equations is that x/y equals the sum of the weighted horizontal/vertical coordinates of each

60 blob.

Figure 5.2: The blobs in a moving hand at a certain time

Since the coordinates of the centroid have been calculated, a trajectory of the moving hand can be drawn as shown in screenshot (a) in Figure 5.3. However, it is not smooth and hence not suitable for further processing. Therefore, an average filter is employed to smooth the trajectory.

(a) Before smoothing (b) After smoothing

Figure 5.3: A trajectory before and after smoothing

Equation 5.4 illustrates the formula for a simple finite impulse response (FIR) averaging filter. Among it, x(n) is the current input signal. x(n — 1) and x(n — 2)

61 are both previous input signals. They are added together to generate the output signal, y(n). The number of previous signals is called the order of a filter. Through experimentation, a 3-order average filter was determined to be the best filter for smoothing the trajectory of the centroid. The screenshot (b) in Figure 5.3 illustrates the trajectory after the smoothing.

. , x(n) + x(n- 1) +x(n- 2) + ... . ,.

5.1.3 Comparison between video camera and Wii Remote

If a video camera is utilized alone for this gestural interface, it requires more compu­ tation time than the system based on the Wii Remote. The Wii Remote performs several steps in hardware that are required for the system, but a camera-based system has to perform these steps in software. The moving hand should be segmented from the background image. The coordinates of a motion have to be calculated instead of being acquired directly. Furthermore, the coordinates should be smoothed by a filter to remove the noise introduced in these initial steps. All of these processing steps add extra computation time compared to using the Wii Remote. In addition, although this method is good for a conductor who does not like a baton, it has not been chosen as being optimal because the purpose of this system is to help a user learn and practice the conducting gestures. For a conducting student, he/she should start with using a baton because "it seems more difficult to begin using a baton after learning to conduct without one than to take the opposite approach" [21].

62 A video camera plus a color object is not suitable for a fast moving hand because when an object moves quickly through the field of view of a camera, the color blurs and the RGB value of the color changes, causing a loss in tracking. A video camera plus an infrared light source is similar to, but slower than, the system based on the

Wii Remote. Thus, they are both not employed for this gestural interface.

Table 5.1 summaries the differences between a system based on a video camera and that based on the Wii Remote.

Table 5.1: Comparison between video camera and Wii Remote

Sensor Function Accessory Field of View Gesture Sensitivity Price Technical limit

Wii Re­ Infrared A stick equipped Horizontal:41 de­ Any motion, About Not a very large

mote Sensor an infrared LED grees, Vertical:31 Smooth tracking $40 field of view

on the tip degrees

Video Motion Approximately A large motion, About Keep static except

Camera 1 Sensor 45 degrees Smoothing required $40 or the moving hand

above

Video Motion A color object Approximately Not too fast move­ About Great contrast

Camera 2 Sensor 45 degrees ment $40 or between the color

above object and back­

ground, the speed

of motion

Video Motion An infrared fil­ Approximately Any motion, About A short distance

Camera 3 Sensor ter and a stick 45 degrees Smooth tracking $40 or depending on the

equipped an in­ above intensity of the in­

frared LED on frared light source

the tip

63 5.2 WiTilt v2.5

The WiTilt v2.5 is another potential sensor to track the movement of the right hand for this gestural interface. It makes use of 3-dimensional acceleration data to reveal the gestures of the right hand.

5.2.1 Gesture tracking

The WiTilt v2.5 is employed to track the movement of the right hand in a similar way that has been used for the left hand as described in Section 4.3. The WiTilt v2.5 is tied on the back of the right hand as shown in Figure 4.6. W20 software receives the sample data, translates it into OSC messages, and then sends them out via UDP to a Max/MSP patch. The Max/MSP patch retrieves 3-dimensional data and sends them to feature extraction.

Z axis' befea interpolation

([

K, LnwAjA IMA

'.-' tmra*!if •si—s'ewi

il"S J1.*! ".VI/K.M J"

\Ai\"\h hnhhnhf'/^

i-« -j-ifcwr of sample data

Figure 5.4: The effect of linear interpolation

64 5.2.2 Feature extraction

During feature extraction, two preprocessing steps are first performed. They are linear interpolation and smoothing. The purpose of linear interpolation is to generate dropped sample points and produce a complete set of sample data. Figure 5.4 shows a signal of the z axis before and after the interpolation. It illustrates that the pattern is more obvious after the interpolation. However, the signal is still not smooth enough.

Thus, a 10-order average filter is utilized to remove noise in the signal. The order is determined through experimentation. Figure 5.5 shows three signals for a certain beat pattern before and after the smoothing.

Xaier after iiSefpoJaiiwi

i-v~. -2000; 103 ' 200 300 430 500 600 700 100 200 300 400 500 600 700 800 the number cf sampte'data I-* isb-rivi ywrr'e da's Y sxrs after interpolation v an 'i-eied 2000.— -— - 1

i.'\ .' •iff-:

i -l£2C •r.j : M!J iCJ Ins -_T.:C of sa^pis ijta "••o '•i.-r Der n- 3fl"i- t> is'a 2 S-iS 3T.C V.Z'pl.aX.Z* 7 axis f 'coc

>c-: N l-' \ri' VV '•'"

D JJC J."5 :33 6"C ."JC 200 300 400 500 300 700 800 L-ii ;, rat cf u i | the numbei of sampJs data

Figure 5.5: The effect of the smoothing

The sample data preprocessed by these two steps is ready for feature extraction.

Principle components analysis (PCA) [36] has been chosen to convert 3-dimensional sample data to 1 or 2-dimensional data depending on different processing in gesture recognition. Because conducting motion typically exists in a 2-dimensional plane in front of a conductor, we chose PCA to convert the 3-dimensional sample data. PCA

65 is a technique to reduce the dimensions of a data set. Its purpose is to construct a new coordinate system composed of eigenvectors of the covariance matrix of the data set. The origin of the new coordinate system is the mean of data set. By ordering the eigenvectors, the importance of axes is revealed. Finally, the data is transformed into the new coordinate system. The order of dimensions in the new data set is changed and re-arranged according to their importance.

5.2.3 Gesture recognition

The system based on the WiTilt v2.5 employs the Hidden Markov Model (HMM) to identify the beat pattern conducted by a user. The HMM is a classic approach for gesture recognition. It is suitable for the temporal data because it tracks the state transition from time t to time t + 1 based on some probabilities. We establish a simple HMM for 2-beat patterns, 3-beat patterns, and 4-beat patterns, respectively.

They are shown in Figure 5.6. The visible states sequence is generated on the basis of sample data using two methods, vector quantization (VQ) [43] and a self-designed method.

Q Hidden . Hidden state transition 2-beat HMM:

3-beat HMM: f i

4-beatHMM: (, ) J2 ) /j ) J4

Figure 5.6: Three HMMs

66 VQ is widely used for data compression. It divides a data set into several parts and represents a data point in a certain part with the centroid of that part. As a result, the data set is transferred to a series of representative centroids. If these centroids are assigned some integer numbers, the data set becomes a sequence composed of integer numbers. The first diagram in Figure 5.7 shows the sequence generated by VQ. A self-designed method is developed based on the observation of the signal. A data signal always has four states that are the positive signal going up and down and the negative signal going up and down. As a result, a signal can be represented by these four states. The sequence generated by this method is shown in the second diagram in Figure 5.7.

The number of subsets in the VQ can be changed, whereas the self-designed method can only support four states. However, the sequence generated by the self- designed method is shorter than that of the VQ because a few sample points are assigned to one state in the self-designed method. Since we cannot decide which one is better, the sequences generated by these two methods have both been fed into

HMMs.

The Baum-Welch algorithm was implemented for the training of the HMMs. The

Forward algorithm was used for the testing. The test results of the HMM based on the VQ are not very good for 3-beat pattern gestures. The HMM based on the self-designed method cannot work well for 2-beat patterns.

67 Results of Vector Quantization

35

j -

25

2

' 5

100 2G0 300 400 500 600 700 800 tie number of sample data

(a) A sequence generated using VQ

the obseivation sequence

100 200

(b) A sequence generated using a self-designed method

Figure 5.7: A visible states sequence generated using different methods

68 5.2.4 Comparison between WiTilt v2.5 and Wii Remote

According to the above description, it is obvious that the process using the Wii

Remote for a right hand is totally different from that using the WiTilt v2.5 because they capture different types of sample data to perform analysis and recognition.

The Wii Remote tracks the trajectory of an infrared light source, whereas the

WiTilt v2.5 uses the acceleration data. The sample data from the Wii Remote can be analyzed and identified using a relatively simple feature extraction process. Fur­ thermore, the response of a system based on the Wii Remote can support visual representations because it is easy to draw on the screen in terms of coordinates. On the contrary, it is difficult to produce the same effect from a system based on the

WiTilt v2.5. Therefore, the WiTilt v2.5 has not been utilized for the tracking of the right hand.

69 Chapter 6

CONCLUSION AND FUTURE

RESEARCH

6.1 Conclusion

The purpose of this thesis is to develop a gestural interface for computer-based con­ ducting systems. The interface aims to help conducting students learn and practice conducting gestures. The sensor used in the interface is not expensive and is easy to acquire. A baton like controller makes users feel comfortable during conducting.

Users spend little time on learning how to manipulate the whole system because real conducting gestures are employed. Visual and aural representations work together to help users practice a certain conducting gesture.

The gesture recognition implemented in this interface is based on fundamental rules of conducting gestures rather than a certain complex model such as the HMM.

It can give the identification result for each gesture in a gesture stream. Thus, it is

70 a realtime gesture recognition system. It can also recognize or follow users' gestures quickly and accurately.

6.2 Future research

The primary goal of this research is to develop a gestural interface that is to be employed for music pedagogical purposes. Thus, user testing is the first task of future research. This includes the use of a questionnaire related to this gestural interface and observing the use of this interface. It collects users' opinions as the guidance for revising and improving this interface. More functions will be designed and implemented for this system. For example, a graphical indicatior (different color) can be utilized for the beat that begins to speed up or slow down. Colors and sounds can reveal specific errors. Capturing and storing a list of coordinates are valuable for re-rendering later. A student could then check his/her movements after conducting.

Support of a preparatory beat would also be very helpful for practicing conducting.

In addition, the gesture analysis/recognition/following approach can be improved and involve more complex rules. Currently, a few fundamental rules are applied to make a decision based on beats and legato beat patterns. They cannot differentiate a legato beat pattern from a staccato beat pattern. In the future, by including more information, this gestural interface can support both legato beat patterns and staccato beat patterns.

This gestural interface focuses on the beat patterns performed by the right hand.

It only involves a little bit of the left hand (dynamics). In future systems, more

71 gestures will be considered. More sensors may be required to collect the corresponding information for musical parameters.

72 Glossary

Beat patterns consist of motions of the right hand and vi­

sually show the musical structure of a piece

of music. Legato and staccato are two of the

primary types.

Computer-based Conducting System allows a user to conduct a piece of music using

a digital system. They retrieve tempo (and

sometimes support dynamics as well) and then

use it to control the playback of a piece of

music. Most of them are manipulated using a

gestural interface.

Conducting to lead a musical performance with conduct­

ing gestures, such as hand gestures and eye

contact.

73 Conducting window a chest-high, virtual rectangular area. It con­

tains movements for four directions (up, down,

left and right).

Downbeat the first beat of all conducting gestures, indi­

cated by a strong downward vertical motion,

Dynamics control the loudness and the softness of a piece

of music.

Forte shortened as /. It represents a level of dynam­

ics and means loud or strong.

Legato contains continuous and curved motions. It

can be neutral with plain motions or expres­

sive with curved motions.

Piano shortened as p. It represents a level of dynam­

ics and means soft.

74 Staccato has a stop in a moment at each count. It con­

tains relatively straight motions. The pattern

with straight motions is called light-staccato.

Full-staccato employs slightly curved motions.

Tempo reveals the speed and mood of a piece of music.

75 Bibliography

[1] Graziano Bertini, Paolo Carosi, Light baton: A system For conducting com­

puter music performance, In Proceedings of the International Computer Music

Conference, pages 73-76. ICMA, 1992.

[2] Frederic Bevilacqua, Fabrice Guedy, Norbert Schnell, Emmanuel Flety and

Nicolas Leroy, Wireless sensor interface and gesture-follower for music peda­

gogy, In Proceedings of the 7th international conference on New interfaces for

musical expression, pages 124-129, 2007.

[3] Bennett Brecht and Guy E. Garnett, Conductor Follower, In Proceedings of

the International Computer Music Conference, pages 185-186. ICMA, 1995.

[4] W. Buxton, W. Reeves, G. Fedorknow, K. C. Smith and R. Baecker, A

microcomputer-based conducting system, Computer Music Journal, Vol. 4, No.

1, pages 8-21, 1980.

[5] Jan O. Borchers, Wolfgang Samminger, Max Muhlhauser, Engineering a realis­

tic real-time conducting system for the audio/video rendering of a real orchestra,

76 In Proceedings of the 4th International Symposium on Multimedia Software En­

gineering, pages 352-362.

[6] Sen-Ching S. Cheung and Chandrika Kamath, Robust techniques for background

subtraction in urban traffic video, Visual Communications and Image Processing

2004, Proceedings of the SPIE, Vol. 5308, pages 881-892, 2004.

[7] Richard O. Duda, Peter E. Hart and David G. Stork, Pattern classification

(second edition), John Wiley & Sons, Inc., 2001.

[8] Guy E. Garnett, Mangesh Jonnalagadda, Ivan Elezovic, Timothy Johnson,

Kevin Small, Technological Advances for Conducting a Virtual Ensemble, In

ICMC Proceedings 2001, pages 167-169. ICMA, 2001.

[9] Guy E. Garnett, Fernando Malvar-Ruiz and Fred Stoltzfus, Virtual conduct­

ing practice environment, In Proceedings of the International Computer Music

Conference, pages 371-374. ICMA, 1999.

[10] Ingo Grull, conga: a conducting gesture analysis framework, Master's thesis,

University Ulm, 2005

[11] Suguru Goto and Takahiko Suzuki, The case study of application of advanced

gesture interface and mapping interface, In Proceedings of the 2004 Conference

on New Interfaces for Musical Expression (NIME04), pages 207-208, 2004.

[12] Stephen Haflich, Markus Burns, Following a conductor: the engineering of an

input device, In Proceedings of the International Computer Music Conference.

ICMA, 1983.

77 [13] Hideyuki Morita, Shuji Hashimoto and Sadamu Ohteru, A computer music

system that follows a human conductor, Computer, Vol. 24, No. 7, pages 44-53,

1991.

[14] Horace H S Ip, Belton Kwong and Ken C K Law, Bodymusic: a novel frame­

work design for body-driven music composition, In ACE'05: Proceedings of the

2005 ACM SIGCHI International Conference on Advances in Computer Enter­

tainment Technology, pages 342-345, New York, NY, USA, 2005. ACM Press.

[15] Tommi Ilmonen and Tapio Takala, Conductor following with artificial neural

networks, In Proceedings of the International Computer Music Conference,

pages 367-370. ICMA, 1999.

[16] Hiroshi Ishii and Brygg Ullmer, Tangible bits: towards seamless interfaces be­

tween people, bits and atoms, In CHI'97: Proceedings of the SIGCHI Conference

on Human Factors in Computing Systems, pages 234-241, New York, NY, USA,

1997. ACM Press.

[17] Lisette Jansen, Betsy van Dijk and Jose Retra, Musical multimodal child com­

puter interaction, In IDC'06: Proceeding of the 2006 Conference on Interaction

Design and Children, pages 163-164, New York, NY, USA, 2006. ACM Press.

[18] Paul Kolesnik, Conducting gesture recognition, analysis and performance sys­

tem, Master's thesis, McGill University, 2004.

[19] David Keane, Peter Gross, The MIDI Baton, In Proceedings of the International

Computer Music Conference, pages 151-154. ICMA, 1989.

78 [20] David Keane, Kevin Wood, The MIDI Baton III, In Proceedings of the Inter­

national Computer Music Conference, pages 541-544. ICMA, 1991.

[21] Joseph A. Labuta, Basic conducting techniques, fourth edition, Prentice Hall,

2000.

[22] Michael A. Lee, Guy Garnett and David Wessel, An adaptive conductor follower,

In Proceedings of the International Computer Music Conference, pages 454-455.

ICMA, 1992.

[23] Eric Lee, Henning Kiel, Saskia Dedenbach, Ingo Grull, Thorsten Karrer, Mar-

ius Wolf, Jan Borchers, iSymphony: an adaptive interactive orchestral conduct­

ing system for digital audio and video streams, In CHI'06: Proceedings of the

SIGCHI Conference on Human Factors in Computing Systems, pages 259-262,

2006.

[24] Eric Lee, Teresa Marrin Nakra and Jan Borchers, You're the conductor: a

realistic interactive conducting system for children, In International Conference

on New Interfaces for Musical Expression, pages 68-73, 2004.

[25] Declan Murphy, Tue Haste Andersen and Kristoffer Jensen, Conducting au­

dio files via computer vision, In Proceedings of the 2003 International Gesture

Worshop, 2003.

[26] Teresa Anne Marrin, Toward an understanding of musical gesture: mapping

expressive intention with the digital baton, Master's thesis, Massachusetts Insti­

tute of Technology, 1996.

79 [27] Brock McElheran, Conducting technique for beginners and professionals revised

edition, Oxford University Press, 1989.

[28] Max V. Mathews, The radio baton and the conductor program, or: pitch-the

most important and least expressive part of music, Vol. 15, No. 4, pages 37-46,

Computer Music Journal

[29] H. Morita, S. Ohteru, S. Hashimoto, Computer music system which follows a

human conductor, In Proceedings of the International Computer Music Confer­

ence, pages 207-210. ICMA, 1989.

[30] Teresa Marrin and Rosalind Picard, The conductor's jacket: a device for record­

ing expressive musical gestures, In Proceedings of the International Computer

Music Conference, pages 215-219. ICMA, 1998.

[31] Nintendo (2006), Wii Music Orchestra, http://www.gamespot.com/wii/

puzzle/wiimusicorchestra/index.html, Retrieved May 18, 2008.

[32] Vladimir I. Pavlovic, Rajeev Sharma, Thomas S. Huang, Visual interpretation

of hand gestures for Human-Computer Interaction: a review, IEEE Transac­

tions on Pattern Analysis and Machine Intelligence, Vol. 19, No. 7, pages 677-

695, 1997

[33] Mary Beth Rosson and John M. Carroll, Usability engineering: scenario-based

development of human-computer interaction, Morgan Kaufmann Publishers

Inc., San Francisco, CA, USA, 2002.

80 [34] Max Rudolf, The grammar of conducting: a comprehensive guide to baton tech­

nique and interpretation (3rd edition), Schirmer Books, 1994.

[35] Jakob Segen, Aditi Majumder, Joshua Gluckman, Virtual dance and music

conducted by a human conductor, Eurographics 2000.

[36] Lindsay I Smith(2002), A tutorial on principal components analysis, http:

//csnet.otago.ac.nz/cosc453/studenttutorials/principalcomponents.

pdf, Retrieved May 18, 2008.

[37] David Sonnenschein, Technology as a tool for attaining music cognition, In

IPCC/SIGDOC'OO: Proceedings of IEEE Professional Communication Soci­

ety International Professional Communication Conference and Proceedings of

the 18th Annual ACM International Conference on Computer Documentation,

pages 361-366, Piscataway, NJ, USA, 2000. IEEE Educational Activities De­

partment.

[38] Stephen W. Smoliar, John A. Waterworth and Peter R. Kellock, pianoFORTE:

a system for piano education beyond notation literacy, In MULTIMEDIA'95:

Proceedings of the 3rd ACM International Conference on Multimedia, pages

457-465, New York, NY, USA, 1995. ACM Press.

[39] Forrest Tobey and Ichiro Fujinaga, Extraction of conducting gestures in 3D

space, In Proceedings of the International Computer Music Conference, pages

305-307. ICMA, 1996.

81 [40] Forrest Tobey, The ensemble member and the conducted computer, In Proceed­

ings of the International Computer Music Conference, pages 529-530. ICMA,

1995.

[41] B. Ullmer and H. Ishii, Emerging frameworks for tangible user interfaces, In

Human Computer Interaction in the New Millenium, Addison-Wesley, page

579-601, 2001.

[42] Satoshi Usa and Yasunori Mochida, A multi-modal conducting simulator, In

Proceedings of the International Computer Music Conference, pages 25-32.

ICMA, 1998.

[43] Vector Quantization, http://www.data-compression.com/vq.shtml, Re­

trieved June 14, 2008.

[44] Tsung-Hsien Wang, Ellen Yi-Luen Do, Mark D Gross (2006), Tangible Notes,

http://code.arc.cmu.edu/lab/html/projectl07.html, Retrieved May 18,

2008.

[45] Jun Yin, Ye Wang and David Hsu, Digital tutor: an integrated system for

beginning violin learners, In MULTIMEDIA'05: Proceedings of the 13th Annual

ACM International Conference on Multimedia, pages 976-985, New York, NY,

USA, 2005. ACM Press.

82