Popular Analysis: Chorus and Emotion Detection

Chia-Hung Yeh 1, Yu-Dun Lin 1, Ming-Sui Lee 2 and Wen-Yu Tseng 1 1Department of Electrical Engineering, National Sun Yat-sen University, 804 Taiwan E-mail: [email protected] , Tel: +886-7-5252000 Ext. 4112 2Department of Computer Science and Information Engineering, National Taiwan University, 106 Taiwan E-mail: [email protected] , Tel: +886-2-33664888 Ext. 520

Abstract—In this paper, a chorus detection and an emotion have to waste time searching files in order to find suitable detection algorithm for are proposed. First, a . In this paper, we aim at establishing a music retrieval popular music is decomposed into chorus and verse segments system which classifies songs by emotions so as to provide based on its color representation and MFCCs (Mel-frequency the user proper music for certain scenarios. cepstral coefficients ). Four features including intensity, tempo The rest of this paper is organized as follows. Section II and rhythm regularity are extracted from these structured segments for emotion detection. The emotion of a is reviews the background of music emotion detection. Section classified into four classes of emotions: happy, angry, depressed III describes the proposed scheme including chorus detection and relaxed via a back-propagation neural network classifier. and its emotion detection. Experimental results are shown in Experimental results show that the average recall and precision Sec. IV to evaluate the performance of the proposed method. of the proposed chorus detection are approximated to 95% and Finally, concluding remarks and recommendation for future 84%, respectively; the average precision rate of emotion work are given in Sec. V. detection is 88.3% for a test database consisting of 210 popular music songs. II. BACKGROUND REVIEW

Keyword: Chorus, MFCC, music emotion, neural network Friedrich Nietzsche, a famous German philosopher, once said that “Without music, life would be a mistake.” Music I. INTRODUCTION does represent a significant part of our daily life. Besides, Multimedia information has evolved over the last decade. music can express emotion and produce emotion for listeners. Digital popular music has taken a significant place in our Therefore, we want to figure out what kind of emotions a daily lives nowadays. Knowing how to manage a large song conveys, especially popular songs. Most popular music number of music files has been of more importance these days has a simple musical structure that includes the repetition of a because of its’ mounting quantity as well as the availability. chorus. Stein [9] defines chorus as a section of a song that is Therefore, a great number of methods for music database heard several times, repeating the same . Choruses are management and retrieval are investigated [1]-[8]. Music the most noticeable and easily-remembered parts of a song. retrieval systems can find specific songs based on users’ Once the repeated part, i.e. chorus, can be detected, we obtain demands. In general, music retrieval systems can be classified a clear picture of the chorus/verse structure of a song. In into three categories: content-based, text-based, and emotion- general, the verse sections of a song lay out the theme of the based. Content-based retrieval systems enable users to song, the chorus sections allow one to remember and sing retrieve a musical piece by humming, singing, whistling or along by repeating its motifs. The main objective of this playing a fragment of the musical piece. The result shows an research is to analyze the verse-chorus structure of popular output list containing the best matching (e.g. Top 5) songs. songs so as to dig out their motion for browsing music Text-based music retrieval systems classify songs by music databases quickly and other purposes. The problem of the information such as , artist, alphabet, title, genre and so structure analysis of a song has been addressed recently [10]- on. However, manual classification is time-consuming and [14]. Most researches employ similarity matrix to analyze the not suitable for some cases. For example, exciting or happy structure of a song. In this research, we examine this problem music with fast tempo and steady rhythms is always in via color representation. demand for a party. Cheerful music might be useful for As mentioned in [15], “Music arouses strong emotions in soothing a depressed person. There are some cases when users people, and they want to know why.” One of the major feel like listening to a series of songs, rather than specific difficulties in music emotion classification is that emotions ones, as long as these songs can go with his mood at that time. are hard to express. The Thayer’s model [16] is commonly There are some cases when users feel like listening to a series used to categorize emotions. It defines the emotion as a two of songs, rather than specific ones, as long as these songs can dimension model. As for the taxonomy of music emotion, to go with his mood at that time. To be able to retrieve music simplify the emotion detection procedure, we classify with the tag “happy” from database, it is reasonable and emotions into four classes which will be described in detail in necessary to categorize songs by emotion. The users do not Sec. 4.

907

Proceedings of the Second APSIPA Annual Summit and Conference, pages 907–910, , , 14-17 December 2010.

10-0109070910©2010 APSIPA. All rights reserved. III. PROPOSED METHOD 0.5 0.6

0.7

0.8

A. Chorus Detection 0.9

1

A framework for extracting chorus from popular music 1.1

1.2 based on structural content analysis is proposed. Fig. 1 (a) 1.3

1.4 shows the flowchart of chorus detection. For each frame of 1.5 20 40 60 80 100 120 140 160 180 200 220 audio data, we extract the feature vectors by calculating the energy of three features (intensity, high , and low band in (c) the frequency domain). We map the feature vectors to the R, G, B color space domain to obtain the music color map for the 2 4 representation of the structure of a song. Color image adaptive 6 clustering segmentation algorithm [17] is employed to cluster 8 the region with similar color distribution. Then, MFCCs is 10 12 extracted from each region obtained from the color map as the 2000 4000 6000 8000 10000 12000 feature for classification of verse or chorus sections in a (a) (d) popular song. Finally, some post-processing steps to exclude some fragile regions in order to enhance the detection result. 0.5 0.5 0.6 0.6 Experimental results show the efficiency of the proposed 0.7 0.7 0.8 0.8 system for chorus detection. The followings are the details of 0.9 0.9 1 1 our proposed chorus detection algorithm. 1.1 1.1 1.2 1.2

1.3 1.3

1.4 1.4

Color Map Generation 1.5 1.5 20 40 60 80 100 120 140 160 180 200 220 20 40 60 80 100 120 140 160 180 200 220 Color information is utilized to represent music structure. Three audio signal features are extracted from the music, and (b) (e) converted into RGB color space so as to generate a color map. Fig. 1 The proposed chorus detection method.(a) Flowchart of chorus The color map not only helps us find repeating patterns but detection (b) Color map of the song “Let It Go ” (c) Structure information of also gives us an overall impression of a song. First, the the song “Let It Go” by employing RPCL (d) Illustration of MFCCs of combined chorus sections (e) Result of chorus detection after post-processing summation of energy of a song, F1, is calculated in (1). The summation of the energy of both low and high frequency Chorus and Verse Designation bands are also calculated as F2 and F3. Based on the numerous For each of combined segments, its MFCCs are calculated. experiments, the ranges of the low and high frequency bands MFCCs provide a simple way to represent band frequency are set as 1~2048Hz and 2049~22050Hz, respectively, as energy, creating a simplified spectrum with 13 coefficients as demonstrated in the following equations. follows

(3) = ∑ cos − 0.5, = , where Cs(n) is the sth coefficient of the nth frame, and Y(l) is the l th filter bank, which is representative of the critical band

= × ℎ, (1) in the human auditory system. C0(n) is the energy band and the other twelve coefficients Cs(n), s=1,2 ,…,12 are generally

= × ℎ, adopted in audio signal analysis. Fig. 1 (d) demonstrates MFCCs of chorus sections Then, (4) measures the similarity where fi is the ith frame of audio data, and W(h) is a window [19] of each cluster as follows, function at h Hz. We convert the values of F1, F2 and F3 to R, 1 (4) G, and B, respectively, by a mapping function M as seen in , = ∙ , the following (2). The simplest mapping function is a where w is the size of the cluster. To obtain a high similarity normalization operation . value, coefficients in a cluster should be similar. In this step, the cluster with the greatest similarity value is regarded as the (2) , , = , , chorus because of the definition of chorus [9]. Time- constraint mechanism is employed to enhance the accuracy of Color map is generated as shown in Fig. 1 (b). The input the result, removing any leftover fragments. Theoretically, a song is “Let it go” We can observe that the color map varies chorus should last for a certain amount of time, say 12 with time and there are three similar regions in the color map. seconds. If the time of a section is shorter than that time, it Therefore, a color image adaptive clustering method [18] is will be regarded as miss detection and set as verse as shown employed to find similar sections of the color image and in Fig. 1 (e). The black and white sections in Fig. 1 (e) classify them into several clusters. Fig. 1 (c) shows the represent the verse and chorus, respectively. segmentation result of the color map by RPCL. Without determining the cluster number in advance, the color map is B. Emotion Detection classified into 3 clusters. The proposed scheme is shown in Fig. 2. First, the neural network classifier is trained to be our emotion detection

908 classifier. In the preprocessing stage, several audio features reveals that octave is a useful feature for frequency analysis. are derived from the chorus of the song. These features are Therefore, band frequency is divided into 8 sub bands as our trained by a feed-forward back propagation neural network. feature. The training model will be saved into the database. In the test stage, the fore-mentioned features are tested and then mapped into 4 emotion classes. In the proposed emotion detection scheme, we adopt the Thayer’s 2-D emotion model as shown in Fig. 3. Each quadrant demotes one emotion class. The followings are the details of our proposed emotion detection algorithm. Preprocessing In the preprocessing stage, a song is segmented into 20ms frames with 50% overlap, and the Fast Fourier Transform (FFT) is calculated. The details of feature extraction are explained in the following. Feature Extraction

The music intensity is introduced first. Instead of using Fig. 2 The proposed emotion detection algorithm waveform amplitude in the spatial domain as energy feature, the energy summation in the frequency domain is used.

(5) = , , where E(n) is the intensity of the nth frame, and f(n, i ) is the absolute value of the i th FFT coefficient of the nth frame, and N represents the frame size. Second, the feature extraction process of tempo and rhythm regularity includes three parts as shown in Fig. 4 (a). The detection function is employed to Fig. 3 Thayer’s emotion model calculate the difference of spectrum between adjacent frames. It calculates the difference of the spectral between adjacent Detection function 2500 frames. Equation (6) defines the detection function as, 2000 1500

1000

500

0 Magintude -500 (6) -1000 = |, | − | − 1, |, -1500 -2000 0 100 200 300 400 500 600 700 800 900 1000 where n is the frame number, k is the index of FFT, and N is Frequency the frame size. The result of detection function is shown in (b)

Significant peak Fig. 4 (b). We observe some significant peaks from the 2500 picture. These peaks are used to calculate rhythm regularity. 2000 In order to detect these significant peaks, the amplitude of 1500 Magintude each peak less than 50% is set to zero as shown in Fig. 4 (c). 1000 500

0 The distances between each neighboring peak in a frame form 0 100 200 300 400 500 600 700 800 900 1000 Frequency a sequence. Then the standard deviation of the sequence is calculated as rhythm regularity. Meanwhile, the DFT of (c)

Approximation function detection function is also calculated. The DFT transfers the 2500 2000

1500 detection function into frequency domain. The peaks will 1000

500

0 form a periodic signal and can be approximated by a cosine Magintude -500 wave as shown in (7). Equation (7) describes an -1000 -1500

-2000 0 100 200 300 400 500 600 700 800 900 1000 approximation function as, Frequency π 2 (a) (d) = × cos × × , (7) Fig. 4 Rhythm and tempo extraction (a) Flowchart (b) Example of detection where C is a constant that represents the amplitude of A(n), n function (c) Example of significant peak extraction (d) Example of approximation function is the frame number and fs represent the sampling rate. The result of approximation function is shown in Fig. 4 (d). The Neural Network Classifier ratio between the number of peaks and its corresponding time The aforementioned features form an input vector. These duration is calculated to obtain the beats per second (bps) and vectors are trained by a feed-forward back propagation neural that will be our tempo feature. Fundamentals acoustic [20] network classifier. The output is an emotion vector with four

909 classes: [Happy, Anxious, Depressed, Relaxed]. Table II. Precision of emotion detection for each class Class I II III IV Average IV. EXPERIMENTAL RESULTS Precision 88.8% 83.3% 88.8% 88.8% 88.3% A. Chorus Detection Table III. The precision of cover songs Cover version In this section, a test database that consists of 210 popular Precision 90% songs is used to evaluate the performance of chorus detection. All these songs are sampled at 44100Hz, 16 bits per sample. REFERENCES We quantitatively measure the recall (8) and precision (9) rate to evaluate the experimental results by comparing with the [1] N. Kosugi, Y. Nishihara, S. Kon'ya, M. Yamanuro, and K. Kushima, ground-truth as follows, "Music retrieval by humming," in Proceedings of Pacific Rim Conference on Communications, Computers and Signal Processing , | ∩ | (8) pp. 404-407, 1999. = , [2] N. Kosugi, Y. Nishihara, , T. Sakata, M. Yamamuro, and K. Kushima, || “A practical query-by humming system for a large music database,” in | ∩ | (9) Proceedings of the 8 th ACM , pp. 333-342, 2000. = , [3] M. Lslam, H. Lee, A. Paul, and J. Baek, “Content-based music || retrieval using beat information,” in Proceedings of International where represent the segments detected by the proposed Conference on Fuzzy Systems and Knowledge Discovery (FSKD) , pp. algorithm, are the segments which are identified as chorus 317-321, 2007. [4] R. McNab, L. Smith, I. Witten, C. Henderson, and S. Cunningham, by human. Table I show the experimental results of 210 "Towards the digital music library: tune retrieval form acoustic input," popular songs. The precision is 83.93% accuracy and the in Proceedings of ACM Digital Libraries’96 , pp. 11-18, 1996. recall rate is up to 95.27%. [5] S. Blackburn and D. DeRoure, "A tool for content based navigation of music," in Proceedings of the 6th ACM Multimedia, pp. 361-368, B. Emotion Detection 1998. [6] F. Kuo and M. Shan, “Music retrieval by melody style,” in The test database has 210 popular songs and the emotion of Proceedings of International Symposium on Multimedia , pp. 613-618, test songs are designated by human perception. 150 popular 2009. songs are trained by neural network and 60 songs are used for [7] T. Mulder, J. Martens, S. Pauws, F. Vignoli, M. Lesaffre, M. Lenman, testing. The precision of emotion detection for each class is B. Baets, and H. Meyer, “Factors affecting music retrieval in query by melody,” IEEE Transactions on Multimedia , vol.8, pp. 728-739, 2006. shown in Table II. The average precision rate is 88.3%. [8] Y. Zhu, C. Xu, and M. Kankanhalli, “Melody curve processing for C. Emotion Detection of Cover Songs music retrieval,” in Proceedings of International conference on Multimedia and Expo , pp. 285-288, 2003. In popular music, it is common to hear a cover version of a [9] D. Stein, Engaging music: Essay in music analysis. New York, song. In our experiment, we collect 10 cover songs which are Oxford university press, 2005. [10] Y. Shiu, H. Jeong, & C.-C. Jay Kuo, “Similar segment detection sung in different languages. We want to figure out whether for music structure analysis via Viterbi algorithm,” in Proceedings the same melody with different lyrics will generate the same of IEEE international Conference on Multimedia and Expo ., pp. emotions or not by our proposed algorithm. Table III shows 789-792, 2006. the result of the emotion detection of 10 test cover songs. The [11] B. Mark A, & W. H. Gregory, “To catch a chorus: using chroma- based representations for audio thumbnailing,” in Proceedings of precision rate is 88.89%. IEEE workshop on the Applications of Signal Processing to Audio and Acoustics , pp. 15-18, 2001. V. CONCLUSION AND FUTURE WORK [12] C. Mathew & J. Foote, “Automatic music summarization via similarity analysis,” in Proceedings of International Conference on A new method to classify popular songs by emotion is Music Information Retrieval , pp. 81-85, 2002. proposed. The proposed scheme includes two phases. First, a [13] C. Matthew & J. Foote, “Summarizing popular music via structural chorus detection algorithm is proposed to extract choruses of similarity analysis,” in Proceedings of IEEE workshop on the Applications of Signal Processing to Audio and Acoustics , pp. 127- songs. The second part of the proposed scheme is to find the 130, 2003. emotion of a song. Audio features are trained by a neural [14] J. Foote, “Visualizing music and audio using self-similarity,” in network classifier and mapped into 4 emotion classes. A test Proceedings of ACM Multimedia , pp. 77-80, November 1999. database consisting of 210 popular songs is build. [15] W. Dowling, and J. Harwood, Music Cognition , Academic Press, pp. 202, December 1985. Experimental results show that the precision rate of chorus [16] R. Thayer, The biopsychology of mood and arousal , Oxford detection and emotion detection is 83.9% and 88.3%, University Press, May, 1989. respectively. In addition, an emotion detection experiment for [17] G. Li, C. An, J. Pang, M. Tan and X. Tu, “Color image adaptive cover songs is also conducted. Experimental results show that clustering segmentation,” in Proceedings of Third International Conference on Image and and Graphics, pp.104-107, 2004. our proposed algorithm is still robust for different languages [18] L. Xu, and A. Krzyzak, “Rival penalized competitive learning for and lyrics. In the future, more high level features are expected clustering analysis, RBF Net, and curve detection,” IEEE to help detect the emotion more precisely. Transactions on Neural Networks , vol. 4, no. 4, July 1993. [19] S. Grossberg, “Competitive learning: from iterative activation to Table I. Results of chorus detection adaptive resonance,” Cognitive Science , vol. 11, pp. 23-63, 1987. Recall Precision [20] A. Schutz, and D. Slock, “Periodic signal modeling for the octave Chorus section 95.27% 83.93% problem in music transcription,” in Proceedings of Digital Signal Processing , pp. 1-6, 2009.

910