Pop Music Highlighter: Marking the Emotion Keypoints
Total Page:16
File Type:pdf, Size:1020Kb
Huang, Y.-S., et al. (2018). Pop Music Highlighter: Marking the Emotion Keypoints. Transactions of the International Society for Music Information Retrieval, 1(1), 7,60,5 pp. 68–78. DOI: https://doi.org/10.5334/tismir.14 RESEARCH Pop Music Highlighter: Marking the Emotion Keypoints Yu-Siang Huang*,†, Szu-Yu Chou*,† and Yi-Hsuan Yang* The goal of music highlight extraction, or thumbnailing, is to extract a short consecutive segment of a piece of music that is somehow representative of the whole piece. In a previous work, we introduced an attention-based convolutional recurrent neural network that uses music emotion classification as a surrogate task for music highlight extraction, assuming that the most emotional part of a song usually corresponds to the highlight. This paper extends our previous work in the following two aspects. First, methodology-wise we experiment with a new architecture that does not need any recurrent layers, making the training process faster. Moreover, we compare a late-fusion variant and an early-fusion variant to study which one better exploits the attention mechanism. Second, we conduct and report an extensive set of experiments comparing the proposed attention-based methods to a heuristic energy-based method, a structural repetition-based method, and three other simple feature-based methods, respectively. Due to the lack of public-domain labeled data for highlight extraction, following our previous work we use the RWC-Pop 100-song data set to evaluate how the detected highlights overlap with any chorus sections of the songs. The experiments demonstrate superior effectiveness of our methods over the competing methods. For reproducibility, we share the code and the pre-trained model at https://github. com/remyhuang/pop-music-highlighter/. Keywords: Music thumbnailing; highlight extraction; chorus detection; structure analysis; convolutional neural network; attention mechanism Introduction agreement that the chorus (refrain) section of a song is a With the growing amount of multimedia data available representative sample of the whole song.1 It is a common on the Internet, the ability to efficiently browse this practice to play the chorus when introducing a Pop music data is important. Music highlight extraction, or music hit chart or when using music in commercials (Goto, thumbnailing, is such a task that aims to find a short, 2003). A user study conducted by Meintanis and Shipman continuous segment of a piece of music that can nicely III (2008) showed that, for getting familiar with a song, represent the whole piece. It can be understood as an the chorus section was regarded as the most important audio preview chosen by machines. Successful algorithms part of a song, compared with other parts such as the intro for music highlight extraction are useful for many music or the verse. Some MIR researchers view chorus sections as information retrieval (MIR) tasks, such as indexing, the most memorable and emotional part of a song (Goto, retrieval (Lee et al., 2014), trial listening (Goto, 2003), 2006; Eronen, 2007). By extracting features only from the radio podcasts (Mehrabi et al., 2017) and DJing (Bittner chorus (instead of the verse or the whole song), Wang et et al., 2017). For example, we can use highlight extraction al. (2013) demonstrated improved accuracy in recognizing as a pre-processing step to pick a representative segment song level arousal- and dominance-related emotions. from each song to facilitate the subsequent labeling, processing or analysis, instead of dealing with the whole Related Work song or taking a random segment per song (e.g., the Due to the absence of well-defined and public domain middle 30 seconds). labeled data for music highlights, an important stream of The selection of the highlighted music piece is a related work focuses on chorus detection instead. Many subjective matter and different people may have different previous researchers assumed that the most-repeated opinions. However, for Pop/Rock music, there is a stronger patterns or melody motifs correspond to the chorus sections. They firstly used methods such as the self- similarity matrix (SSM) (Cooper & Foote, 2003; Bartsch * Research Center of IT Innovation, Academia Sinica, TW and Wakefield, 2005) or hidden Markov model (HMM) † Graduate Institute of Networking and Multimedia, National (Logan & Chu, 2000; Peeters et al., 2002) for segmenting Taiwan University, TW a music piece into several parts, and then analyzed (e.g., Corresponding author: Yu-Siang Huang ([email protected]) by clustering) the resulting segments to identify the most Huang et al: Pop Music Highlighter 69 frequent ones. Commonly used feature representations Therefore, if one has access to scrubbing behavior data as include time-frequency representations such as the Bittner et al. did, highlight extraction can be performed spectrogram and the chromagram (Müller and Ewert, by simply running a peak picking algorithm over the 2011). Because the chorus repeats several times in a song, scrubbing data to find the key moments. In this crowd- it is still necessary to employ some methods to pick one sourced method, no classifiers are needed. Streaming of them as the highlight. A simple solution is to randomly companies such as Spotify4 can anonymously log such pick one, if no other heuristics are used. We see two scrubbing behavior data. However, to our knowledge limitations in such an approach. First, the most-repeated such data are not yet open to the research community. part of a song may not always correspond to the chorus; Moreover, a classifier may still be needed if we want to it can also be the verse or even other short melodies. have the highlight of unpopular songs (i.e., songs in the Second, there are still some variations between different “long tail”). chorus sections of a song (such as key change (Goto, We took a different approach in a previous work 2006)) and it is better to give these sections different (Huang et al., 2017b) by utilizing the possible connections “highlight scores” to distinguish them. In other words, a between emotions and song highlights. The idea is to classifier or a regression model is needed. However, little firstly use a data set with emotion labels to train a neural work has been performed regarding this, to the best of network model for music emotion classification (Yang and our knowledge, again due to the lack of labeled data for Liu, 2013). Then, it adds on top of the network a so-called training a supervised model that estimates such highlight “attention mechanism” to weigh the contribution of scores. different short time chunks of a song (e.g., each 3 seconds Another important stream of related work attempts to in length) in predicting the song-level emotion labels of detect highlights by leveraging traces of user behavior that song. The attention mechanism has been widely used logged by online music services. For example, working on in many natural language processing (NLP) problems electronic dance music (EDM), Yadati et al. (2014) aimed to (Bahdanau et al., 2014; Shen et al., 2017; Gehring et al., detect drops, the specific moments where there is drastic 2017; Vaswani et al., 2017) to learn to “pay attention” change in a song. Such drops are usually associated with to specific parts of an input. We used an attention strong emotional release. Because a drop point often mechanism to assign “attention scores” (i.e., weights) to implies the most interesting part of an EDM song, we audio chunks to facilitate a weighted aggregation of the can take the segment starting from a drop point as the emotion prediction from different chunks in making the highlight. Yadati et al. achieved drop detection by feeding final song-level prediction, as illustrated in Figure 1(a). the spectrogram, Mel-Frequency Cepstral Coefficients While the model was trained using emotion labels only (MFCC) and features related to rhythm to a support vector (i.e., not using any annotations of the chorus sections machine (SVM). The labeled data needed to train the or the highlights), we investigated whether peaks of SVM were mined from SoundCloud,2 which allows users the attention scores help detect the chorus sections of to leave “timed comments” annotating specific moments songs. A preliminary experiment showed that this idea is (e.g., the drop point) of a song. This idea seems to work promising (Huang et al., 2017b). for EDM, but for other music genres it is unclear whether In parallel to our work, Ha et al. (2017) independently such timed comments are available. proposed a similar attention-based neural network Another approach utilizing user logs was presented by model for music highlight extraction. There are three Bittner et al. (2017). It has been pointed out3 that people notable differences between their model and ours. First, may “scrub” (i.e., move) the playhead of a music player they used music genre classification as the surrogate task, to start a song at a specific moment, and such moments while we used emotion classification. Their attention usually occur just before the best parts of the song. mechanism assessed how each audio chunk contributes Figure 1: Architecture of two attention-based models using different fusion methods for highlight extraction. We note that model (a) was used by Huang et al. (2017b) and model (b) by Ha et al. (2017). 70 Huang et al: Pop Music Highlighter to predicting the genre of a song. Second, while we made Music highlight extraction has not received much an emotion prediction per audio chunk