Visualising Singing Style Under Common Musical Events Using Pitch-Dynamics Trajectories and Modified TRACLUS Clustering

†Kin Wah Edward Lin, †Hans Anderson, †Natalie Agus, ‡Clifford So, †Simon Lui † Singapore University of Technology and Design and ‡ Chinese University of Hong Kong {edward lin, hans anderson, natalie agus}@mymail.sutd.edu.sg, [email protected], simon [email protected]

Abstract—We present a novel method for visualising the tempo and articulation as features. The style can be extracted singing style of vocalists. To illustrate our method, we take and retargeted by a support vector machine (SVM). Lui et 26 audio recordings of A cappella solo vocal music from two al. use each feature separately during the synthesis stage. different professional singers and we visualise the performance style of each vocalist in a two-dimensional space of pitch and Dannenberg et al. [2] have a strong emphasis on modelling dynamics. We use our own novel modification of a trajectory high quality frequency and amplitude envelopes (pitch and clustering algorithm called TRACLUS to generate four rep- dynamics) for wind instrument synthesis. Widmer et al. [3] resentative paths, called trajectories, in that two dimensional also use these features to represent piano performance style. space. Each trajectory represents the characteristic style of Nakano et al. [4] propose a singing synthesis system called a vocalist during one of four common musical events: (1) Crescendo, (2) Diminuendo, (3) Ascending Pitches and (4) VocaListener, in which they model singing style based on Descending Pitches. The shapes of these trajectories pitch, dynamics and timbre. Saitou et al. [5] verify that characterize the singing style of each vocalist with respect micro-tonal fluctuations in tone, such as vibrato and pitch to each of these events. We present the details of our mod- bends, are essential to human perception of singing style. ified version of the TRACULUS algorithm and demonstrate In this paper, we focus on two of the most significant graphically how the plots produced indicate distinct stylistic differences between singers. Potential applications for this performance features: dynamics and macro-tonal pitch, and method include: (a) automatic identification of singers and their relationship throughout the time. automatic classification of singing styles and (b) automatic G. Widmer et al. [6] demonstrate a machine learning al- retargeting of performance style to add human expression gorithm capable of identifying pianists by their performance to computer generated vocal performances and allow singing style. They use the technique called a performance worm [7] synthesisers to imitate the styles of specific famous professional vocalists. in which variations in tempo and loudness are plotted against each other. The performance worm separates an entire audio Keywords-Visualising; Singing Style; Music Event; TRA- recording into segments of length two beats and constructs CLUS Clustering; a symbolic representation for the stylistic content in each I.INTRODUCTION segment. The complete set of stylistic elements in a given song is represented symbolically and the collection of all Automatic characterization of musical performance is these symbols is called an alphabet. In this way, the style important primarily for two applications. First, it provides of an entire performance can be symbolically represented a set of rules that can be applied to musical synthesis to as a string of alpha-numeric symbols. After collecting a set permit computers to produce more expressive, more human of performance strings from different pianists, they find the sounding performances. And secondly, it is useful for infor- sub-strings that occur most frequently in the performances mation retrieval applications such as automatic recognition of a particular pianist. They claim that these frequently of individual performers and automatic classification of occurring sub-strings characterize the performance style. musical styles. Many existing methods of characterizing For example, in a string representing a particular pianist’s vocal performance style have provided insight into which performances of Mozart, they frequently find a sub-string audio features are most significant and into how we should representing a crescendo followed by a slight accelerando process these features so that they can be applied in a useful followed again by a decrescendo with nearly constant tempo. manner to the applications mentioned above. Inspired by their method, the method we present in this paper Lui et al. [1] show that the expressive performance style is an alternative way to visualise the singing style using paths of violin playing can be represented by using dynamics, in pitch-dynamics space. This work is supported by the SUTD-MIT International design center In this paper, a trajectory represents a plot of the dynamics (IDC) Research Grant (IDG31200107 / IDD11200105 / IDD61200103). and pitch (in a two-dimensional space) over the entire Duration Time Song Name Tempo Key duration of a song. Specifically, a trajectory is a set of (m:ss) sig. ordered pairs (pt, dt), representing a time series of samples 0:28 112 B[ major 4/4 And I Am Telling You 0:28 118 B[ major 4/4 of pitch pt and loudness dt at time t, taken at regular (I’m Not Going) intervals. We use a modified version of Jae-Gil Lee’s trajec- 1:01 126 E[ major 4/4 Don’t Wanna Lose You 1:40 80 F major 4/4 tory clustering algorithm called TRACLUS [8] to compare Hate On Me 1:15 120 E[ major 4/4 between vocal performances of two or more songs and Hell To The No 1:35 133 G major 4/4 identify similar sub-trajectories, representing portions where 1:27 104 G major 4/4 1:11 66 A major 4/4 the performances share stylistic similarities. We then group Shake It Out 0:18 112 F major 4/4 those similar sub-trajectories into clusters and represent each Sweet Transvestite 1:23 104 E major 4/4 0:28 118 F major 4/4 cluster with a representative trajectory. Each representative Try A Little Tenderness 1:48 80 G major 4/4 trajectory of a cluster of sub-trajectories indicates a stylistic USA National Anthem 1:09 104 A[ major 4/4 performance event that occurs in a similar way across several performances. Table I: 13 A cappella Solo of Amber Riley. The remainder of the paper is organised as follows. In Duration Time Song Name Tempo Key Section II, we describe the audio samples in the database (m:ss) sig. we used to illustrate our method. Then, we explain how Being Good Isn’t Good Enough 2:00 108 G major 4/4 Dont Cry For Me Argentina 2:50 87 D major 4/4 we process each sample to generate the pitch-dynamics Firework 1:47 124 A[ major 4/4 trajectories in Section III. Since our design makes use of the Get It Right 3:22 84 D major 4/4 TRACLUS algorithm to further process these trajectories, Go Your Own Way 1:52 136 F major 4/4 Jar Of Hearts 2:26 75 E[ major 4/4 we briefly state the TRACLUS algorithm and justify its use My Man 2:01 102 E major 4/4 in this application in Section IV. Our main contribution, Oops!...I Did It Again 2:09 95 E major 4/4 Take A Bow 2:01 85 E major 4/4 which is to use a modified version of the TRACLUS algo- The Only Exception 1:40 95 B major 6/8 rithm to visualise the singing style based on representative Torn 1:07 96 F major 4/4 What I Did For Love 2:32 72 C major 4/4 trajectories in a plot in two-dimensional space of pitch and Without You 2:31 128 D major 4/4 dynamics is presented in Section V. Finally, several possible future directions are discussed in section VI. Table II: 13 A cappella Solo of .

II.SINGING SAMPLES III.PITCH-DYNAMICS TRAJECTORIES Using a publicly available database is always desirable In this section, we illustrate how the pitch-dynamics as it provides a benchmark for researchers to scientifically trajectories are generated, using the songs Disco Inferno justify their research finding. However, for the current re- and Being Good Isn’t Good Enough as examples. We first search, no publicly accessible database was ideally suitable cut the songs into frames of fixed length (4096 sample points for our needs. The database would be at best similar to the or 92.9 ms) with 50% frame overlap. Then we measure one G.Widmer used [6], wherein there is a set of singers, the loudness and pitch of each frame and represent the S = {s , ..., s } and a set of songs (or performance), 1 n measurement as a point on the graph of pitch/dynamics P = {p , ..., p } such that ∀p ∈ P , p is performed 1 m j j space. by each s ∈ S. The requirement that all vocalists have to i We measure the dynamics on the sone scale [9] using sing the same songs makes it impossible to assemble such Glasberg and Moore’s loudness model [10]. Our implemen- a database using existing recordings. tation is based on the one from Genesis Audio 3, using Despite the above requirement, clean recordings without 4096 sample points per frame and a sampling frequency excessive reverberation or other mixed vocal effects are also of 44.1kHz. We also use this same frame size for the pitch major criteria to be selected in our database. We searched detection step. For our measurement of pitch we use YIN4 and found two popular female vocalists have such sets of [11], a common algorithm for pitch detection in melodic clean recordings: Amber Riley1 and Lea Michele2. Table I music. Since we are measuring the pitch of female voices, shows all 13 A cappella solo of Amber Riley. Table II shows we discard the frames for which the estimated pitch which all 13 A cappella solo of Lea Michele. We select these 26 is outside the average female vocal range, F3(53) - C6(84). recordings because they represent diverse musical genres This helps avoid negative effects of the occasional spurious with different dynamics, tempos and keys. The recordings data point. It is typical for pitch estimation algorithms to came in standard audio CD format - 44.1kHz sampling rate generate spurious output when processing frames in which and 16 PCM bit samples, and we converted them into mono. the input signal is in transition between silence and tone

1http://en.wikipedia.org/wiki/Amber Riley 3http://www.genesis-acoustics.com/en/loudness online-32.html 2http://en.wikipedia.org/wiki/Lea Michele 4Matlab implementation - http://audition.ens.fr/adc/ Figure 1: (a) The dynamics-pitch trajectory of Amber Riley’s performance of the song Disco Inferno from 17.25 to 19.57 seconds. (b) The dynamics-pitch trajectory of Lea Michele’s performance of the song Being Good Isn’t Good Enough from 47.91 to 50.23 seconds. or between tone and silence or when the singing voice is issues. At the end of this section, we justify the use of not present at all. We call the remaining frames that contain TRACLUS. For the details of the TRACLUS algorithm and pitches within the normal frequency range of the female a C++ implementation, please refer to [8] and the Jae-Gil voice vocal frames. There are 997 vocal frames in the song Lee’s personal website5. Disco Inferno and 1,783 in the song Being Good Isn’t Given a set of trajectories, the TRACLUS algorithm com- Good Enough. Overall, there are 14,425 vocal frames in putes a set of clusters and a set of representative trajectories. the entire database of A cappella solos of Amber Riley and To do so, the TRACLUS algorithm has two phases, namely 25,014 frames in the data from recordings of Lea Michele. the partitioning phase and the grouping phase. After measuring the dynamics and the pitch of all vocal In the partitioning phase, it partitions the trajectories frames in a song, we plot each pitch, loudness measurement into a set of line segments using the minimum description pair in a two-dimensional space and connect the points to length (MDL) principle [12]. In plain English, the work the form a path. The complete path for one song is called the algorithm does in the partitioning phase is to take a trajectory pitch-dynamics trajectory of that song. Figure 1 (a) shows path full of bends, twists, and discontinuities, and cut it the pitch-dynamics trajectory of Amber Riley’s performance into a set of small, smooth line segments, also called sub- on the song Disco Inferno in the time interval from 17.25 trajectories. To do so, the algorithm first iterates point-by- to 19.57 seconds. The solid line between two round circles point along the original trajectory and with each new point, indicates the dynamics and the pitch of singing performance it decides whether it is worth to cut the original trajectory extracted at the end of the 19.57 seconds. Similarly, Figure 1 at the previous point to form a new sub-trajectory. If the (b) shows the pitch-dynamics trajectory of Lea Michele’s cost of cutting the original trajectory at the previous point performance on the song Being Good Isn’t Good Enough is higher than the cost of not doing so, the algorithm cuts during the interval from 47.91 to 50.23 seconds. The solid the original trajectory into two. Otherwise, it continues on line between two round circles indicates the dynamics and to the next point. the pitch of the singing performance extracted at the end of After a set of line segments (sub-trajectories) are com- the 50.23 seconds. puted, the set passes on to the grouping phase for density- based clustering. Simply speaking, a cluster is formed IV. THE TRACLUSALGORITHM around the core line segments. A line segment is called In this section, we first summarise the TRACLUS algo- a core line segment if the number of its neighbours is rithm, then we describe how we carry out the TRACLUS more than a threshold, MinLns. A line segment Li is a algorithm to analyse both Amber Riley’s and Lea Michele’s singing style along with a description of some technical 5http://dm.kaist.ac.kr/jaegil/ Figure 2: (a) Entropy for Amber Riley. (b) Entropy for Lea Michele.

neighbour of line segment Lj if the distance between them, or corner. This is most likely to occur when the melodic dist(Li,Lj), is less than a threshold, ε. The dist(Li,Lj) contour changes direction (i.e. it switches from ascending is symmetric and is defined in term of the perpendicular to descending pitch or vice versa) or when a musical phrase distance, the parallel distance and the angle distance between starts or ends. These events indicate the onset of each Li and Lj. The optimal ε is obtained by minimizing the musical phrase in the song. Also, observe that a line segment entropy (which is defined as in information theory) via a (sub-trajectory) is a core line segment if it has more than simulated annealing technique [13]. With the optimal ε, MinLns neighbours. Since the line segments represent the the neighbourhood Nε(L) of each line segment L can be singing performance style, the core line segments represent calculated. Then MinLns is simply avg|Nε(L)| + 6. the stylistic characteristics which the singer most frequently Now we describe how we carry out the TRACLUS algo- employs in her songs. In other words, the threshold MinLns rithm mentioned above to analyse both Amber Riley’s and determines how frequently a stylistic events must occur in a Lea Michele’s singing styles. First, we calculate the pitch- singing performance before we consider it to be representa- dynamics trajectory for each of their songs as mentioned in tive of the artist’s style. Despite the different music genres Section III. Given these two sets of trajectories, TRACLUS of the songs, human listeners are always able to recognise partitions the trajectories into line segments. In order to the unique style of a famous singer. It is very difficult for improve the clustering quality, as suggested by Jae-Gil Lee a famous singer to exhibit two or more very different and et al [8], we increase the cost of not cutting the original unique styles. Hence, the partition and grouping procedures trajectory by 30% to prevent from over-clustering by short of TRACLUS reveal the most representative performance line segments. Before we use TRACLUS’s density-based information across all of the songs. This justifies that only clustering to group the line segments, the optimal ε is one cluster is formed from each set of trajectories. obtained via the Matlab implementation 6 of the simulated annealing technique. For the set of Amber Riley’s trajecto- V. INTERPRET SINGING STYLE ries, the optimal ε is 1.2899 and MinLns is 11. For the Now we present our main contribution, which is to use set of Lea Michele’s trajectories, the optimal ε is 1.2214 a modified version of the TRACLUS algorithm to visualise and MinLns is 11.Figure 2 (a) and Figure 2 (b) shows the the singing style based on representative trajectories in a entropy for Amber Riley’s and Lea Michele’s trajectories plot in two-dimensional space of pitch and dynamics. At sets respectively as the threshold, ε varies from 0 to 100. the end of this section, we highlight the differences between Both figures verify the result of the simulated annealing our method and the original TRACLUS algorithm. technique. One cluster is identified on each set of trajectories after carrying out TRACLUS’s density-based clustering. Algorithm 1 shows how the 4 representative trajectories Now we justify the use of TRACLUS. Observe how are generated. Beginning with a cluster of line segments, TRACLUS cuts the trajectory into line segments. Basically, we classify each segment according to its direction into the a new separate line segment (sub-trajectory) is cut from following groups: (1) Crescendo (upward orientation), (2) the original trajectory wherever it has a sharp turning point Diminuendo (downward orientation), (3) Ascending Pitches (heading right) and (4) Descending Pitches (heading left). 6http://www.mathworks.com/help/gads/simulannealbnd.html Pseudocode for this step is shown line 01 of Algorithm 1. Note that each line segment may belong to two of these Algorithm 1 Representative Trajectory Generation for Each groups simultaneously. For example, a segment may be of Four Common Musical Events simultaneously doing a crescendo in loudness and ascending INPUT: in pitch. Next we sort the starting and ending points of the (1) A cluster C of line segments, where a line segment L line segments in each of the four groups starting and ending is defined by its two end points (x1, y1) and (x2, y2), points of this set of line segments. This is line 02 of the and is oriented facing from (x1, y1) to (x2, y2) pseudocode. When we scan these points in the sorted order, (2) MinLns, (3) A smoothing parameter γ, and (4) The specific direction Dir we are basically sweeping a line across the line segments, OUTPUT: A representative trajectory RTR for C in moving in the direction of motion corresponding to its group, a specific direction Dir and pausing each time we reach a start or ending point Algorithm: of a line segment. For example, in the ascending pitches 01: Gather all line segments in C with specific direction, DIR; group, we sweep from left to right across the graph. While Let G be this set of line segments; Switch DIR sweeping the line, we count the number of line segments Case ’Up’: /* Crescendo */ that cut across the sweep line. If the number is greater than Gather all line segments which its y1 < y2; or equal to MinLns and the previous location of the line Case ’Down’: /* Diminuendo */ is not too close to its current location, we set γ to ε and we Gather all line segments which its y1 > y2; compute the average coordinate of those line segments. This Case ’Right’: /* Ascending Pitches */ Gather all line segments which its x1 < x2; is line 3-11 in the pseudocode. After running this algorithm Case ’Left’: /* Descending Pitches */ four times, once for each of the different direction groups, Gather all line segments which its x1 > x2; Dir, we have four representative trajectories for each of the 02: Let P be the set of the starting and ending points of singers. this gathered line segments; Sort the points in the set P Figure 3 (a) shows the four representative trajectories by the specific direction, DIR; Switch DIR of Amber Riley. The vertical crescendo and diminuendo Case ’Up’: /* Crescendo */ trajectories reveal the overall dynamic range of Amber Riley. Ignore the x values in P The minimum and maximum values are 4 and 41 sones. The Sort P in ascending order by their y value; horizontal trajectories reveal the dynamics over ascending Case ’Down’: /* Diminuendo */ and descending pitches sung by Amber Riley. For ascending Ignore the x values in P Sort P in descending order by their y value; pitches, the low note starts with narrow dynamic range (from Case ’Right’: /* Ascending Pitches */ 25 to 29 sones), to the middle note (Midi 70) with a wider Ignore the y values in P range (from 16 to 32 sones). For descending pitches, the Sort P in ascending order by their x value; high note starts with a concentrated dynamic (at 25 sones) Case ’Left’: /* Descending Pitches */ to the low note with wider dynamic range (from 22 to 35 Ignore the y values in P Sort P in descending order by their x value; sones). 03: for each (p ∈ P ) do Similarly, Figure 3 (b) shows the four trajectories of Lea 04: Count nump := the number of line segments in G Michele. Her dynamic range extends from 3 to 45 sones, which cut across the sweep line p. which is a little bit wider than Amber Riley. In the event 05: if(nump ≥ MinLns and diff ≥ γ) then of ascending pitches by Lea Michele, the dynamic range is /* diff := the difference between p and its immediate previous point */ stable from 20 to 30 sones, which is narrower than Amber 06: if(DIR == ’Up’ Or DIR == ’Down’) Riley’s performances. In the event of descending pitches, the 07: Compute the average x-coordinate avgp high notes are consistently close to 20 sones, but the range 08: Append (avgp, p) to the end of RTR opens up for the low notes, where the range is from 17 to 09: if(DIR == ’Right’ Or DIR == ’Left’) 31 sones. The most obvious and basic difference we notice 10: Compute the average y-coordinate avgp 11: Append (p, avg ) to the end of RTR between these two singers is that Amber Riley has a wider p dynamic range than Lea Michele. One key difference between our algorithm and the original TRACLUS algorithm is that we generate four representative trajectories aligned with the four common musical events. VI.FUTURE WORKAND DISCUSSION The original TRACLUS algorithm can only extract in one direction and therefore it assumes that all trajectories only These preliminary experimental results suggest that a move in one direction. Jae-Gil Lee et al. [8] admit that they modification of the TRACULUS algorithm may provide a should enhance the method to accommodate circular motion. viable alternative to the performance worm for the pur- Our method can easily be extended to support multiple pose of characterising musical performance style. Before directions for more musical features, or to more dimensions this method can be employed, several improvements are to support more audio features. necessary. We enumerate the main ones below. Figure 3: (a) The representative trajectories for Amber Riley. (b) The representative trajectories for Lea Michele.

1) We will add more expressive performance parameters [4] T. Nakano and M. Goto, “Vocalistener2: A singing synthesis such as deviations of tempo, pitch, and timbre at the system able to mimic a user’s singing in terms of voice timbre beat level, as well as more commonly found musical changes as well as pitch and dynamics,” in Proc. Int. Conf. on Acoustic, Speech and Signal Processing, 2011. events like intervals, or small local musical structure, so that the perceived singing performance can be better [5] T. saitou, M. Unoki, and M. Akagi, “Development of an characterized in a musical context. f0 control model based on f0 dynamic characteristics for 2) We will study how well the TRACLUS algorithm’s singing-voice synthesis,” Speech communication, vol. 46, no. 3, pp. 405–417, 2005. partitioning process detects and responds to the onset of new musical phrases. [6] G. Widmer and P. Zanon, “Automatic recognition of famous 3) We will study the relationship between MinLns and artists by machine,” in Proc. 16th Euro. Conf. on Artificial the number of trajectories needed to form a meaning- Intelligence, 2004. ful cluster. Observe that TRACLUS’s density-based [7] G. Widmer, S. Dixon, W. Goebl, E. Pampalk, and A. Tobudic, clustering may not form any cluster at all. TRACLUS “In search of the horowitz factor,” AI Magazine, vol. 24, no. 3, only reports the meaningful clusters, which occurs in pp. 111–130, 2003. at least MinLns trajectories (songs). [8] J.-G. Lee, J.Han, and K.-Y. Whang, “Trajectory clustering: A Once these studies of the cluster quality are completed, the partition-and-group framework,” in Proc. ACM SIGMOD Int. next phase of this research is to use this data to imitate the Conf. on Management of Data, 2007. style of real singers by musical performance synthesis, or to [9] H. Fastl and E. Zwicker, Psychoacoustics: facts and models, do musical information retrieval using this visualization. 3rd ed. Springer, 2007.

REFERENCES [10] B. R. Glasberg and B. C. J. Moore, “A model of loudness applicable to time-varying sounds,” J. Audio Eng. Soc, vol. 50, [1] S. Lui, A. Horner, and C. So, “Re-targeting expressive mu- no. 5, pp. 331–342, 2002. sical style from classical music recordings using a support [11] A. de Cheveigne´ and H. Kawahara, “Yin, a fundamental Journal of the Audio Engineering Society vector machine,” , frequency estimator for speech and music,” J Acoust Soc Am, vol. 58, no. 12, pp. 1032–1044, 2010. vol. 111, pp. 1917–1930, 2002.

[2] R. B. Dannenberg and I. Derenyi, “Combining instrument and [12] P. Grnwald, “A tutorial introduction to the minimum descrip- performance models for highquality music synthesis,” Journal tion length principle,” in Advances in Minimum Description of New Music Research, vol. 27, no. 3, pp. 211–238, 1998. Length: Theory and Applications. MIT Press, 2005.

[3] G. Widmer and W. Goebl, “Computational models of expres- [13] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi, “Optimization sive music performance: The state of the art,” Journal of New by simulated annealing,” SCIENCE, vol. 220, no. 4598, pp. Music Research, vol. 33, no. 3, pp. 203–216, 2004. 671–680, 1983.