Icub Tries to Play an Electronic Keyboard

ICUB TRIES TO PLAY AN ELECTRONIC KEYBOARD

PEIXIN CHANG

THESIS

Submitted in partial fulfillment of the requirements for the degree of Bachelor of Science in Electronic and Computer Engineering in the Undergraduate College of the University of Illinois at Urbana-Champaign, 2017

Urbana, Illinois

Adviser:

Professor Stephen E. Levinson

Abstract

Can a robot understand natural language? This thesis discusses the first stage of a project that tries to answer this question using the approach of music performance. One of the most demanding cognitive challenges for a human mind is music performing which is closely related to language acquisition. The first stage of the project aims at exploring the potential of an iCub humanoid robot, the platform used in this project, to play an electronic keyboard. The robot is expected to analyze the electronic keyboard with its visual ability, listen to a sequence of musical notes a human played on the keyboard, and press the same musical notes in the same order on the keyboard. The three modules built in this project will be introduced and discussed, and the next stage of the project will be revealed.

Key words: Humanoid robot, music.

Acknowledgments

I would like to thank my advisor, Professor Stephen Levinson, who has supported me throughout my research and given a lot of constructive advice. I would like to thank Felix Wang, a Ph.D. student, who has always been supportive of me. I scheduled experiments on the robot with him every time. He always gives me many useful and heuristic suggestions. He is enthusiastic about making my project work. I would like to thank Luke Wendt, who gave me advice for my Motor Control Module. I would also like to thank Yuchen He, who taught me about YARP and ICUB-main, software for controlling the robot, and gave me advice for my Computer Vision Module.

iii

Contents

1. Introduction ...... 1 1.1 Motivation and Aim ...... 1 1.2 Basic Music Theory and Notation ...... 2 1.3 Platform ...... 3 1.4 Keyboard Playing Procedure ...... 3 2. The Computer Vision Module ...... 6 2.1 Step 1: Pre-processing ...... 7 2.2 Step 2: Contours ...... 8 2.3 Step 3: Post-processing ...... 9 3. The Pitch Detector Module ...... 13 3.1 Onset Detection ...... 14 3.2 Pitch Detection ...... 16 4. The Motor Control Module ...... 20 4.1 Arm and Hand Movement ...... 20 4.2 Cartesian Control ...... 21 5. Performance and Discussion ...... 24 6. Conclusion ...... 26 References ...... 27

1. Introduction 1.1 Motivation and Aim Can a robot understand natural language? We believe that if it is going to acquire human language, it has to build a mental model of the world by physically interacting with the environment around itself. The interaction enriches the robot’s sensorimotor experiences and helps the robot develop its associative memory that is the central component of its cognition. Along with its other cognitive abilities, the robot can acquire the linguistic competence [1]. The members of Language Acquisition and Robotics Group at

UIUC have successfully make the iCub, a humanoid robot, finish tasks such as ball balancing [2] and solving 3D mazes [3]. In both cases, the robot learned fine-motor skills and gain sensorimotor experiences. However, no one in the group has tried a music-related project, which in fact has potential value. Music performance is one of the most demanding cognitive challenges for the human mind and basic motor control functions, such as timing, sequencing, and spatial organization of movement, are required. Fulfilling the requirements of music performing leads to extraordinary and unique sensory- motor interplay such that actions, such as reaching and grasping, do not capture [4]. Music is also entangled with the language acquisition process in early life. Brandt et al. [5] describe language as a special type of music and deny the viewpoint that music is a by-product of cognitive and behavioral

“technology” adapted for language.

A project with the aim of teaching the iCub robot to play an electronic keyboard was proposed by the author of this thesis. The robot is expected to analyze the electronic keyboard with its visual ability, listen to a sequence of musical notes a human played on the keyboard, and perform the same musical notes in the same order on the keyboard. This project will provide the iCub with a unique opportunity to interact with the environment in a new way. It is expected that music learning can reveal some clues for the question of robot language acquisition and maybe provide an analogy to language acquisition. It is also a challenging project. That is why this project should be divided into several stages. In the first stage, the stage this paper is about, the potential of the robot to play an electronic keyboard is explored. It is a good opportunity for its human advisor, the author of this thesis, to familiarize his student, iCub. There is no

1 need to apply any learning techniques in this stage of the project so that the limit of hard coding methods can be clearly seen. In the next stage, however, learning algorithms will be applied to the system built in the first stage. In the second stage, iCub will begin learning music. The skills it learned will be incorporated into an associative memory.

1.2 Basic Music Theory and Notation This section informally introduces to the readers the basic music theory and music-related notations used in this paper.

Figure 1: An example keyboard labelled with scientific pitch notation and integer numbers on the pitch axis. C4 is used as the origin.

All the keys in an electronic keyboard are divided into different octaves. The dashed rectangle in Figure 1 identifies one of these octaves. An octave consists of 7 white keys and 5 black keys. The next 12 keys to the right of the dashed rectangle form another octave. Each white key in an octave can be denoted as a capital letter from A, B, C, D, E, F, G with an integer subscript. C4 means that it is the first key in the octave numbered as 4. C4 is called middle C, which always associates a frequency about 261.6 Hz. The sharp symbol # means that this black key has a pitch or frequency a half step higher than the white key to its left but a half step lower than the white key to its right. Generally, a key always has a higher pitch than all the keys to its left and lower pitch than all the keys to its right. Each key or pitch is modeled as a point on the integer number line with middle C as the origin. The key number plus 1 whenever there is a transposition up by a half step. F3, for example, has a key number -7 because it is 7 keys to the left of middle C. The speed of playing musical notes is called tempo with unit beats per minute (bpm).

1.3 Platform The iCub robot was chosen as the platform for this project. It is a child-like humanoid robot designed for research in embodied cognition. It is expected to interact with humans and the environment and perform tasks useful for learning as a child does [6]. iCub has 53 degrees of freedom (DOF) allocated to its head, torso, right and left arms, and right and left legs. Each of its arms including the hand has 16 controlled

DOF, and its head has 6 controlled DOF [7]. iCub is equipped with 2 PointGrey Dragonfly cameras. The images taken from these cameras have a resolution of 320x240 in RGB color mode. It also has two microphones for ears. All the software for iCub is written in C++ programming language, and an embedded PC104 card is used to handle all sensory and motor-state information [6]. The researchers in the Language Acquisition and Robotics Group in UIUC call their iCub Bert. Bert is short for Bertrand

Russell, a great mind who contributed to the early days of cybernetics. Figure 2 shows Bert.

Figure 2: A photo for Bert.

The electronic keyboard Bert is going to play is Casio CTK2400 61-key Portable Keyboard.

1.4 Keyboard Playing Procedure This project has three modules, and they are the Computer Vision Module, the Pitch Detector Module, and the Motor Control Module. The Computer Vision Module processes visual input from Bert’s eyes,

3 while the Pitch Detector Module processes auditory input. The Motor Control Module sends commands to Bert’s joints such that Bert can move his arms and hands to the desired places. All these three modules interact with each other and with the environment around Bert. The process for Bert to play the sequence of musical notes specified by the users is illustrated in the process flowchart in Figure 3. Only one of the three modules works in each step. The other two modules are either waiting for the reply from this working module or for the user inputs. The name of the working module is indicated before the colon symbol in each step.

Motor Control: Go to the inital pose

Motor Control: An image for the keyboard is saved, preparation pose achieved

Motor Control: Notify the computer vision module of the image saved

Computer Vision: Analyze the image for 2D pixel coordinates for all keys

Computer Vision: Query the motor control module for 3D locations based on 2D coordinates

Motor Control: Reply the computer vision module with 3D locations calculated

Pitch Detector: Analyze the musical notes a user played with the keyboard

Pitch Detector: Inform the computer vision module the keys played by the user

Computer Vision: Send to the motor control module 3D locations for keys played in order

Motor Control: Robot right arm reaches to the location of the first key the users played

Motor Control: Perform a key press

Motor Control: Right arm goes back to the preparation pose

Motor Control: Repeat the last 3 steps for the rest of the musical notes played by users

Figure 3: The process flowchart for playing electronic keyboard.

In Chapter 2, the Computer Vision Module will be introduced. It will show the tasks that the Computer

Vision Module needs to finish and will reveal the method used to complete these tasks. Chapter 3 will

4 introduce the Pitch Detector Module and the algorithm used for pitch detection. The Motor Control

Module will be presented in Chapter 4 which will suggest the expectation for this module and explain how Bert moves his arm to the keys he should play. The methods used in this project will be discussed in

Chapter 5. Finally, Chapter 6 will be the conclusion for this thesis.

2. The Computer Vision Module The task of this module is to help Bert obtain the 3D locations for each electronic keyboard key in his sight. The method used in this module should be able to finish the following tasks autonomously or with the help of a human:

(a) Filter out irrelevant background objects in sight and keep only electronic keyboard keys

(b) Successfully identify each key that Bert can see on the keyboard

(d) Figure out which keys are the musical C note, C3, C4, C5 notes etc., following piano teaching

conventions

(e) Approximate the 3D locations of those keys based on what Bert sees

Bert is expected to be a skilled keyboard player, so no additional markers such as colored tape, symbols, or characters are attached to a key. Figure 4 shows an image taken by the right-eye camera of Bert. This image has been converted to grayscale.

Figure 4: A scene that Bert may see when he is going to play the electronic keyboard.

This section presents the method step by step to show how the tasks mentioned above are achieved. The main idea is to find the contour for each key. The input images must be pre-processed to obtain plausible contours. These contours also need to be post-processed to achieve all the tasks mentioned earlier.

2.1 Step 1: Pre-processing As the simplest segmentation process, gray-level thresholding is the transformation of an input image to a segmented binary image [8]. A threshold value ranges from 0 to 255. For each pixel in the picture, if the intensity of that pixel is greater than the threshold value, this pixel becomes black. Otherwise, it becomes white. Thresholding can be effective if the keyboard is evenly lit. The task of choosing a reasonable threshold value is left to users.

(a) (b) Figure 5: (a) A successful thresholding with a threshold value 218. (b) An unsuccessful thresholding with a threshold value 170.

The left image is an example of successful thresholding because each key is clearly separated by black lines and a complete octave is kept. Preserving a complete octave is important for the later steps. Users need to specify two threshold values to draw two white lines and one black line, based on the figure obtained after thresholding (see Figure 5). These two white lines isolate the black keys region from the rest of the black pixels belong to the background so that the contour generated later for each black key will not encompass those unwanted black pixels. For convenience and practicability, the percentage of white pixels in a row is calculated for each row of pixels in the image to prevent the lines being drawn moving freely in the vertical direction and help the lines be attracted to the desired place. Similarly, the black line also clearly defines the white key region so that no unwanted white pixels will be included.

Figure 6: Based on the binary image obtained earlier, one white line at the top of black keys, one white line just below the bottom of black keys, and a black line 5 pixels below the lower white line should be drawn to separate the black keys region with white keys region. 2.2 Step 2: Contours Contours for each black key and white key can be generated by following the border following algorithm mentioned in [9]. The algorithm does a TV raster scan to the input binary image and tries to find a starting point that satisfies certain conditions for border following. From that starting point, it marks the pixels on the border between the connected component of pixels with densities 0 and connected component of pixels with densities 1. After the entire border is marked, the raster scan resumes, and it tries to find the next starting point. The algorithm stops when the raster scan reaches the lower right corner of the image.

Since in pre-processing steps the connected components are well grouped, the contours generated will encompass each key. After the generation of contours based on Figure 6, many unwanted contours exist.

They come from the residues of keys that have incomplete shapes due to thresholding. By restricting the minimum contour areas, these noisy contours can be easily erased. Only the contours that have areas bigger than the restriction are kept. There are also restrictions on the maximum contour areas, which will guarantee erasing the contours encompassing some parts of the background. As shown in Figure 7, there exists a one-to-one correspondence between contours and keys, and each contour can be processed individually.

(a) (b) (c) Figure 7: After the generation of contours for each key, unwanted contours can be erased if their areas do not satisfy the restriction set by the parameters. The parameters chosen in (a) keep all the contours. In (b), unwanted white key contours are erased by choosing a proper minimum area restriction. (c) shows a plausible processing result. Each red contour corresponds to a white key, and each cyan contour corresponds to a black key. 2.3 Step 3: Post-processing Bert needs to know at least where he should press within these contours. The mass center for each contour can be calculated with the theory of image moment mentioned in [10]. The result is shown in Figure 8.

Figure 8: Center of mass for each contour marked by purple dots.

These purple dots in Figure 8 represent the desired positions where Bert needs to press within the contours. Their 2D pixel coordinates should be saved for later use. The next step is to identify musical C notes from these dots. The method here is to mark all the purple points inside contours for black keys to be 0, mark those points inside contours for white keys to be 1, and sort all the points according to how far

9 they are away from the left edge of the image and number all of them according to the sorting result. The leftmost point thus has number 1 after the sort. Then, for an octave consisting of 12 keys, the binary sequence must be “101011010101”, and the left most “1” in the sequence must be a musical C note.

(a) (b)

Figure 9: (a) White keys are marked as 1 and black keys are marked as 0. The binary sequence for an octave must be "101011010101". (b) Musical C notes are successfully found and marked by white points.

The C note in the incomplete octave in Figure 9 (b) can also be easily detected because it is 12 contours away from the first white point. In this project, it is assumed that the first marked white point represents middle C. So, the method identifies the contour that corresponds to middle C among all contours.

The final task to finish is to approximate the 3D locations (푋, 푌, 푍) in robot root frame of mass centers for those contours given their 2D pixel coordinates (푢, 푣). Each 3D location associates a keyboard key position that Bert is going to press. A pixel coordinates (푢, 푣) corresponds to a 3D point (푥′, 푦′, 푧′) in the camera frame. The relationship between the 3D point (푥′, 푦′, 푧′) in the camera frame and the 2D pixel coordinates (푢, 푣) can be expressed as

푥′ (2.1) 푢 = 푓 + 표 푥 푧′ 푥

푦′ (2.2) 푣 = 푓 + 표 푦 푧′ 푦

Equation (2.1) and Equation (2.2) can also be expressed in matrix form,

푥′ 푢 푓 0 표 0 (2.3) 푥 푥 푦′ 푧′ (푣) = ( 0 푓 표 0) ( ) 푦 푦 푧′ 1 0 0 1 0 1 where 푓푥, 푓푦, 표푥, and 표푦 are all camera intrinsic parameters, which can be obtained from camera calibration files for the iCub robot. The transformation from camera frame to robot root frame is described by Equation (2.4).

푋 푥′ (2.4) 푌 푦′ 푯푐푎푚푒푟푎 ∗ ( ) = ( ) 푟표표푡 푍 푧′ 1 1

푐푎푚푒푟푎 where 푯푟표표푡 is the homogeneous transformation between from the robot root frame and the camera frame. With the help of iCub Gaze Interface [11], which combines the information of torso joint angles,

푐푎푚푒푟푎 head joint angles, and other internal robot parameters, 푯푟표표푡 can be calculated. Figure 10 shows these two coordinate systems provided by [7].

(a) (b)

Figure 10: (a) Robot camera coordinates. The x-axis is in red. The y-axis is in green. The z-axis is in blue. (b) Robot root reference frame.

The electronic keyboard that Bert is going to play is usually higher than its root reference frame origin, but Equation (2.4) does not consider this fact, and thus the 푋, 푌, and Z values obtained by Equation (2.4) are not the 3D coordinates located on the keyboard. But, this problem can be easily solved if the surface that contains all the white keys on the electronic keyboard is modeled as a plane with function 푎푥 + 푏푦 +

푐푧 + 푑 = 0, where 푎 = 푏 = 0, 푐 = 1, and 푑 is the height difference between root reference frame origin and where white keys located. The plane function for surface containing all black keys has higher value of height difference 푑. The 3D locations for each key projected on these two planes can be attained by applying simple geometric properties of triangles and vector addition and subtraction, if the value of 푑 is measured and provided by users. At this point, all the tasks mentioned at the beginning of this chapter is done.

Middle C is usually used as a reference in music. It is also the origin for the pitch axis shown in Figure 1.

Identifying which mass centers for these contours corresponds to middle C is a necessary step, because musical meaning is embedded into each mass center point through this step. This step makes the pitch relationships between each point well defined. Otherwise, these points only individually associate with a

3D location without any other music information. Besides, this step simplifies the communication between the Computer Vision Module and the Pitch Detector Module. For example, if a musical note C5 is detected, the pitch detector only needs to report a 12 because C5 is 12 keys to the right of middle C, and the Computer Vision Module can efficiently find out which mass center corresponds to musical note C5,

collect 3D Cartesian coordinates (푈퐶5, 푉퐶5, 푊퐶5), and send the information to the Motor Control Module.

3. The Pitch Detector Module In this stage of the project, iCub was not expected to autonomously learn the auditory-motor relation.

Instead, a pitch detector was used. Given a record in Microsoft WAVE PCM sound file format of several musical notes played with an electronic keyboard, this module should inform the Computer Vision

Module about what musical notes in what order are played by the users. The input to this module will go through 3 processing steps. In the first step, the module decodes the WAVE format and extracts the sample data. In the second step, this module analyzes the sample data and decides when or at which sample a new note has been played. In the final step, a pitch detection algorithm is applied to the samples between every two neighbor onset events to compute the frequency of the corresponding musical notes.

For testing purposes, a sequence of musical notes was played on the electronic keyboard and the sound was recorded as a WAVE format sound file. The sampling frequency is 44,100 Hz. These musical notes are C4, D4, E4, F4, and G4. Due to the fact that the microphone instrumented inside the robot is seriously affected by the CPU fan noise, an external microphone is used to finish the task of recording.

Figure 11: A time-domain input that consists of four musical notes with frequencies 261.6 Hz (C4), 293.66 Hz (D4), 329.63 Hz (E4), 349.23 Hz (F4), and 392.00 Hz (G4), respectively from left to right along the timeline. The onset events happen at about the 44,070th, 85,250th, 126,100th, 164,100th, and 205,200th samples. These estimated values were manually selected and are usually greater than the true values. The tempo is about 69 bpm.

The discussion in this chapter will be focused on onset detection, the second step, and pitch detection, the third step.

3.1 Onset Detection The aim of onset detection is to find out when each musical note starts [12]. There are many approaches to onset detection, such as spectral difference method, phase deviation method, and negative log- probability method. The procedures used in most of these methods are similar. The original signal will be pre-processed first. Then, a detection function is obtained. The peaks of the function should coincide with the samples at which musical notes have been played. Finally, the locations of the onset events are found by applying peak-picking algorithm [13]. Based on Dixon’s experimental results [12], spectral flux turns out to be the method that satisfies the most the project needs.

Short-time Fourier transform (STFT) will be applied first to the time-domain input signals. The discrete- time STFT can be expressed as

푁 −1 (3.1) 2 2푗휋푚푘 − 푋푘(푛) = ∑ 푤[푚]푥[ℎ푛 + 푚]푒 푁 푁 푚=− 2 where, 푛 represents nth frame, 푘 means the kth frequency bin, 푤[푚] is the window applied to the input,

푁 is the window length, and ℎ is the hop size signifying how much to shift 푤[푚] for a given 푛. Spectral flux calculates the difference in magnitude between each frequency bin in the current frame and the corresponding frequency bin in the previous frame, emphasizing only on energy increase frequencies, and add the difference together for all frequency bins in a frame. It is given by [12]

푁 −1 (3.2) 2 푆퐹(푛) = ∑ 퐻(|푋(푛, 푘)| − |푋(푛 − 1, 푘)|) 푁 푘=− 2

푥+|푥| where 퐻(푥) = . 퐻(푥) makes sure that those frequencies where there is a decrease in energy have no 2 effect. Bello et al. [13] suggest spectral difference method which is similar to Equation (3.2), and it is given by

푁 −1 (3.3) 2 푆퐷(푛) = ∑ [퐻(|푋(푛, 푘)| − |푋(푛 − 1, 푘)|)]2 푁 푘=− 2

The square in Equation (3.3) magnifies the magnitude difference and is helpful for the peak-picking process, and it was chosen to implement the onset detection. The result after processing the input mentioned at the beginning of this chapter is shown in Figure 12. Hamming window, whose length 푁 is

2048, is used. Hop size ℎ is 882.

Figure 12: Detection function obtained by following Equation (3.1) and Equation (3.2) for input signal mentioned at the beginning of the chapter. Peaks are manually selected. The horizontal dash line represents the threshold value.

Because other learning programs will probably replace the pitch detector and limited time was available for this project, pick-picking algorithm suggested in [12] was not followed. Instead, a simpler threshold method was applied. The method accepts any data point around the real peak with a magnitude higher than the threshold as candidates and always selects the first candidate as the peak by ignoring several frame number after it. It was found that the onset detection gave acceptable results when the threshold

15 value was set to 20,000 as long as the musical notes were played in adagio tempo (around 70 bpm) or slower. The sample at which an onset event happens 푠 can be computed approximately as

푠 = ℎ ∗ 퐹 (3.4) where ℎ is the hop size, and 퐹 is the frame number. The onset event happens approximately at the

44,100th, 84,672th, 125,244th, 164,052th and 204,642nd sample reported by the program. Although the results obtained may be different from the true values of onset samples, these estimated results are enough for dividing the entire recording into five pieces. Each piece includes the samples for only one musical note.

3.2 Pitch Detection The remaining task after the onset detection is to analyze the samples within each piece suggested by the estimated onset samples and to find out the fundamental frequency of the samples within each piece. A wide variety of algorithms for pitch detection has been proposed in the literature. Noll [14] suggests

Cepstrum pitch determination method. Ross et al. [15] describe the method of using average magnitude difference function (AMDF) as a variation of the autocorrelation function (ACF) for pitch detection. The pitch detection algorithm used in this project is decided to be YIN [16], an improved method for the classic ACF. This algorithm divides the process of pitch detection into 6 steps. In the first step, the autocorrelation function is calculated. The autocorrelation function 푎푡(휏) of lag 휏 at time 푡 defined as

푡+푁 (3.5) 푎푡(휏) = ∑ 푥푛 ∗ 푥푛+휏 푛=푡+1 where 푁 is the integration window size and 푥 represents a sample. In this project, the value of 푁 is 882 again and the maximum lag is 588, which is 2/3 of 푁. The sampling frequency is 44,100 Hz. The pitch period is the period between zero delay and the delay value of a strong peak [15]. The fundamental frequency is then the inverse of the period. However, ACF method alone will make a lot of errors. In the second step, a difference function is calculated, and it is defined as

푁 (3.6) 2 푑푡(휏) = ∑(푥푛 − 푥푛+휏) = 푎푡(0) + 푎푡+휏(0) − 2푎푡(휏) 푛=1 derived from a periodic function with period 푇: 푥푡 − 푥푡+푇 = 0, for all 푡, supposing that the signal is periodic. It is suggested that the multiples of the pitch period will be a 휏 that makes 푑푡(휏) = 0 or nearly zero because of the imperfect periodicity (see Figure 13). The error rate is reduced when difference function is used because ACF is sensitive to amplitude changes. Figure 13 shows the difference function for the F4 in the input file.

Figure 13: The difference function for the F4 in the input file. The two data tips show two dips where 풅풕(흉) are nearly zero. They happen at lag values that equal 126 and 252, respectively. Zero-lag dip has a 풅풕(흉) value that equals zero. Because of imperfect periodicity, 풅풕(흉) is non-zero at the period of F4. In step 3, error rate is further reduced by replacing difference function with cumulative mean normalized difference function, which is defined as

1, 휏 = 0 (3.7) 휏 푑′ (휏) = { 1 푡 푑 (휏)/[( ) ∑ 푑 (푛)], otherwise 푡 휏 푡 푛=1

The cumulative mean normalized difference function prevents the algorithm from choosing zero-lag dip.

Zero-lag dip also has 푑푡(휏) equals zero, but using this dip will not generate the correct fundamental frequency. Step 3 also benefits for the next step.

Figure 14: The cumulative mean normalized difference function for F4 in the input file. The dashed line indicates the absolute threshold value, 0.2.

Step 4 is mainly for solving the well-known problem of autocorrelation method of pitch detection.

“Octave error” usually happens when one of the higher-order dips is deeper than the dip which suggests a period. In Figure 14, for example, 252 instead of 126 may be chosen as the lag value. Then, the fundamental frequency is incorrectly approximated as 175 Hz, which is F3, a musical note one octave below F4. In step 4, an absolute threshold is set. All dips below the threshold will be considered as candidates. The algorithm chooses the first dip below the threshold. So, 126 instead of 252 is chosen as the lag value. This threshold value is chosen to be 0.2 for this project. Step 5, parabolic interpolation, and step 6, best local estimate, suggested in [16] make the algorithm has even lower error rate. It was found that the first four steps have produced results with error rates lower than 5%, which satisfies the project needs. Therefore, only first four steps were implemented. The same step is gone through for each musical note in the input file, and the results will be a set containing all the fundamental frequencies of musical notes played by users in order.

As mentioned in the previous chapter, the Pitch Detector Module should report to the Computer Vision

Module which musical notes are played in integer numbers. This requirement can be done with a lookup

18 table. The keys for the entries are the true values of fundamental frequencies for each piano key. The true frequencies 푓(푛) can be calculated by

푓(푛) = (12√2)푛−49 ∗ 440 퐻푧 (3.8) where 푛 is the nth key in 88-key piano because pianos is in twelve-tone equal temperament. The content under each key in that table is the integer number indicating how far left or how far right they are to the middle C, the origin for the integer number line in Figure 1. Each detected frequency will be compared with the keys in the lookup table. The content under the key that has the closest value with the detected frequency will be sent to the Computer Vision Module. Because of this lookup table, the errors produced in onset detection and pitch detection are in some degree tolerated.

The Pitch Detector Module suggests that there are 5 musical notes in the input file and their fundamental frequencies are 262.5 Hz, 294 Hz, 331.5 Hz, 350 Hz, and 393.7 Hz. Compared to the true value of each musical notes recorded in the input file, namely, 261.6 Hz (C4), 293.66 Hz (D4), 329.63 Hz (E4), 349.23

Hz (F4), and 392.00 Hz (G4), the results are close enough. Finally, the Pitch Detector Module reports to the Computer Vision Module that the keys played are 0, 2, 4, 5, and 7.

4. The Motor Control Module This module is designed to accomplish any motor-related tasks. In this project, Bert is expected to play the electronic keyboard with only his right hand, and he will not move his left hand, left arm, and torso during his performance. The focus of this chapter will be on describing in detail how the last four steps in

Figure 3 are done. One method to achieve the goal of moving his arm and hand to the right position and performing a key press is to manually move his arm to an electronic keyboard key, sample all joint angles for his right arm along the trajectory, and do the same thing for all electronic keyboard keys in Bert’s sight. When Bert is asked to play a certain key, he only needs to play the corresponding trajectory recorded earlier. The arm movement performed will be human-like and graceful. However, the method fails if the relative position between Bert and the keyboard changes. This project expects Bert to play the keyboard even if the position of the keyboard is different every time. In addition to the difficulties raised by fulfilling this expectation, the fact that there are no touch sensors on his fingertips for iCub v1.1 and v1.0 also complicates the problem. As planned for the next stage of the project, his vision may be primarily occupied by the process of reading musical scores. All these facts suggest that his motions have to depend on accurate position control.

4.1 Arm and Hand Movement When Bert is going to show his skills in playing an electronic keyboard, he will first go to a preparation pose from his initial pose after his start-up. Bert can do this by going through a sequence of intermediate poses or steps. For each pose, he follows the joint angles specified for his head and left and right arms. As shown in Figure 15 (b), he has moved his left arm backward and his right arm upward so that both of his arms will not block his sights when he looked downward to see the electronic keyboard in front of him.

What he sees could be similar to Figure 4 and will be saved as an image in one of the intermediate steps.

After he achieves the preparation pose, his end effector will be changed to the middle finger tip in his right hand, which is originally his palm in his right hand. iCub Cartesian Interface [17] helps achieve this goal. Due to the fact that the motor at the proximal end of his index finger does not respond, his middle finger is used instead to finish a key press.

(a) (b) Figure 15: (a) The initial pose after Bert’s start-up. (b) The preparation pose. Bert is ready for playing a musical note on the keyboard. Every time Bert is going to play a note, he should be in this preparation pose.

Every time Bert should be in his preparation pose when he is going to play a note (see Figure 15 (b)). Bert is asked to first reach with his right middle finger to a position several centimeters right above the key he will play. During the period of moving his right arm and hand to that position, he keeps his hand pose as it is in his preparation pose, although his fingers can point towards different directions. Because his palm in the right hand is always facing down, he is able to press the desired key using his right middle finger once he successfully reaches above the desired key. When his middle finger is pressing down, all other fingers are not moving. He will raise his middle finger after he finishes the key press and moves his arm and hand back to the preparation pose. He will repeat the same procedure starting from his preparation pose and play the next musical note until all the notes are played.

4.2 Cartesian Control Given the 3D Cartesian coordinates just several centimeters above a key to be played and the 3D

Cartesian coordinates for his end effector when he is in his preparation pose, Bert needs to perform a point-to-point motion for each musical note. Since the 3D Cartesian coordinates for the target key are known, the joint configuration for that position can be approximated by inverse kinematics technique, supposing that the Cartesian coordinates for these keys are in the workspace and that the inverse kinematics calculations have solutions. So, Bert is able to move his arm and hand from the starting position to the target position by changing the joint angles at the starting point to these at the end point.

However, it is well-known that this type of movement suffers from some problems such as multiple solutions problem and unsafe path problem. Luckily, even though inverse kinematics may return multiple joint configurations, the arm and hand movements are still reasonable enough for Bert to finish the task of reaching, especially when the palm in his right hand is required to face downward. Besides, the workspace is almost free of obstacles, except for the electronic piano, so the problem that the path his arm and hand follows may not be safe is a secondary issue.

Figure 16 shows the paths that Bert’s right middle finger followed to play four musical notes on the electronic keyboard in an experiment. In this experiment, the four musical notes were played with the order of C4, G4, C5, and A4.

Figure 16: Paths that Bert’s right middle finger followed to play C4, G4, C5, and A4 four musical notes on the electronic keyboard. The samples were recorded in robot root reference frame (see Figure 10 (b)). The three data tips show three significant points coordinates along the path for playing C5.

The path for C5, which is the farthest curve from X axis, is used as an example for illustrating the figure.

In his preparation pose, Bert’s middle finger tip had the 3D Cartesian coordinates of (-0.3469 m, 0.2734 m, 0.1749m) in robot root reference frame. The end effector moved towards a position about 5 centimeters above the key C5. The coordinates for this position were (-0.312 m, 0.1821 m, 0.06803 m). A key press was performed after the end effector had reached just above the desired key and the coordinates of end effector changed to be (-0.2866 m, 0.1784 m, 0.01876 m). Finally, the robot moved back to the preparation pose with the same path.

5. Performance and Discussion This short chapter discusses the methods used in this project, especially the potential problems for the three modules. The method used in the Computer Vision Module depends heavily on the relative positions between dots inside contours for black keys and dots inside contours for white keys. Since Bert does not see from a perfect orthographic view of the keyboard, this method will return a wrong result due to convergence effect of parallel lines. Since Bert only has one entire octave after thresholding, this problem doesn’t occur. However, this potential problem makes Bert fail at recognizing C notes and, more importantly, play the wrong key.

Figure 17: The white points shown in the right image are believed to be the musical C notes. The method gives a wrong result because of the convergence effect. In the yellow circles in the left image, the black key, which should be at the left of a white key, has a center of mass at the right of that white key. Besides, the order for pitches is different from Figure 1 and is messed up.

One solution could be to calculate a homography matrix 퐻 [18]. At least four corresponding points should be provided for this method to work. Another solution is to use the C note that is correct. The second white point, for example, is correct because the center part of the image is distorted less by the effect.

Representing the black and white keys as binary number eases the process of error checking. The correct pattern of an octave, “101011010101”, and the correct C note can be used as a reference to fix the wrong pattern.

It was found that the problem in the Pitch Detector Module usually relates to onset detection. It has low accuracy for finding the onset samples when the notes are played above 90 bpm. The problem occurs maybe because of the naïve pick-picking method used for onset detection. A more reliable peak-picking

24 algorithm may solve the problem. In Figure 18 for example, this method believes frame 142 has a higher magnitude than frame 143 does.

Figure 18: An enlarged view of the detection function consisting only the portion corresponds to the fourth musical note, F4. Data points near the peak data point may have close magnitude value.

In the Motor Control Module, although inverse kinematics generates plausible joint configurations and the path taken by Bert’s arm and hand is safe and natural, this simple method may not lead to natural and reasonable movements of arm and hand when the starting point is not fixed at certain Cartesian coordinates. After all, pianists do not go back to a preparation pose after they press a key. More sophisticated motion planning and trajectories planning algorithms should be used so that Bert can have a more human-like music performance.

6. Conclusion This paper discusses the first stage of the project which aims at teaching a humanoid robot, Bert, to play an electronic keyboard. This project is expected to provide the iCub with a unique opportunity to interact with the environment around him. The tasks for the first stage of this project were marked as accomplished when Bert accurately pressed the desired electronic keyboard key the first time. This moment proved that the system described in this paper works. However, as suggested in Chapter 5, the potential issues of the system also indicate its limitation. In the next stage of the project, Bert is expected to have a better performance because more modern methods will replace the classic ones. The skills he learned will be incorporated into an associative memory. The ultimate task will be to map the world around Bert with an associative memory so that symbolic and abstract ideas can be constructed into a language framework.

References

[1] S.E. Levinson, K. Squire, R.S. Lin, and M. McClain, "Automatic language acquisition by an autonomous robot," in AAAI Spring Symposium on Developmental Robotics, 2005.

[2] A. F. Silver, "Multi-dimensional pre-programmed and learned fine motor control for a humanoid robot," M.S. Thesis, Urbana, 2012.

[3] V. N. Kamalnath, "Usage of computer vision and machine learning to solve 3D mazes," M.S. Thesis, Urbana, 2013.

[4] R. J. Zatorre, J. L. Chen and V. B. Penhune, "When the brain plays music: auditory-motor interactions in music perception and production," Nature Reviews Neuroscience, vol. 8, no. 7, pp. 547-588, 2007.

[5] A. Brandt, M. Gebrian, and L. R. Slevc, "Music and early language acquisition," Frontiers in psychology, vol. 3, p. 327, 2012.

[6] G. Metta, G. Sandini, D. Vernon, L. Natale, F. Nori, "The iCub humanoid robot: an open platform for research in embodied cognition," in Proceedings of the 8th Workshop on Performance Metrics for Intelligent Systems, Gaithersburg, Maryland, 2008.

[7] "Wiki for iCub and Friends," [Online]. Available: http://wiki.icub.org/wiki/Main_Page. [Accessed March 2017].

[8] M. Sonka, V.Hlavac, and R. Boyle, Image processing, analysis, and machine vision, Cengage Learning, 2014.

[9] S. Suzuki, "Topological structural analysis of digitized binary images by border following," Computer vision, graphics, and image processing, vol. 30, no. 1, pp. 32-46, 1985.

[10] M.-K. Hu, "Visual pattern recognition by moment invariants," IRE Transactions on Information Theory, vol. 8, no. 2, pp. 179-187, 1962.

[11] A. Roncone, U. Pattacini, and L. Natale, "A cartesian 6-DoF gaze controller for humanoid robots," in Proceedings of Robotics: Science and Systems, Ann Arbor, MI, 2016.

[12] S. Dixon, "Onset detection revisited," in Proceedings of the 9th International Conference on Digital Audio Effects, Montreal, Quebec, Canada, 2006.

[13] J. Bello, L. Daudet, S. Abdallah, C. Duxbury, M. Davies, and M. Sandler, "A tutorial on onset detection in music signals," IEEE Transactions on Speech and Audio Processing, vol. 13, no. 5, pp. 1035-1047, 2005.

[14] A. M. Noll, "Cepstrum pitch determination," The Journal of the Acoustical Society of America, vol. 41, no. 2, pp. 293-309, 1967.

[15] M. J. Ross, H. L. Shaffer, A. Cohen, R. Freudberg, and H. J. Manley, "Average magnitude diffrence function pitch extractor," IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 22, no. 5, pp. 353-362, 1974.

[16] A. D. Cheveigné and H. Kawahara, "YIN, a fundamental frequency estimator for speech and music," The Journal of the Acoustical Society of America, vol. 111, no. 4, pp. 1917-1930, 2002.

[17] U. Pattacini, F. Nori, L. Natale, G. Metta, and G. Sandini, "An experimental evaluation of a novel minimum-jerk cartesian controller for humanoid robots," in IEEE/RSJ International Conference on Intelligent Robots and Systems, Taipei, Taiwan, 2010.

[18] G. Bradski and A. Kaehler, Learning OpenCV: Computer vision with the OpenCV library, O'Reilly Media, Inc., 2008.