Sequence Classification and Ranking in Indian Classical Music with Deep Learning
Total Page:16
File Type:pdf, Size:1020Kb
DEEPSRGM - SEQUENCE CLASSIFICATION AND RANKING IN INDIAN CLASSICAL MUSIC WITH DEEP LEARNING Sathwik Tejaswi Madhusudhan Girish Chowdhary University of Illinois, Urbana Champaign University of Illinois, Urbana Champaign [email protected] [email protected] ABSTRACT note, slide from one note to another etc [17]), arohana and avarohana (upward and downward melodic movement) A vital aspect of Indian Classical Music (ICM) is Raga, and melodic phrases/ motifs [12, 20]. Emotion is another which serves as a melodic framework for compositions and attribute that makes Ragas distinctive. Balkwill et al. [2] improvisations alike. Raga Recognition is an important conducted studies on the perception of Ragas and observed music information retrieval task in ICM as it can aid nu- that even an inexperienced listener is able to identify emo- merous downstream applications ranging from music rec- tions portrayed by various Ragas. Given the importance ommendations to organizing huge music collections. In of Raga in ICM, Machine Learning (ML) based automatic this work, we propose a deep learning based approach to raga recognition can aid in organizing large audio libraries, Raga recognition. Our approach employs efficient pre- labeling recording in the same, etc. possessing and learns temporal sequences in music data We believe automatic raga recognition has tremendous using Long Short Term Memory based Recurrent Neural potential in empowering content-based recommendation Networks (LSTM-RNN). We train and test the network on systems pertaining to ICM and contemporary Indian mu- smaller sequences sampled from the original audio while sic. However, Raga recognition is not a trivial problem. the final inference is performed on the audio as a whole. There are numerous examples where two or more Ragas Our method achieves an accuracy of 88.1% and 97 % dur- have the same or a similar set of notes but are worlds ing inference on the Comp Music Carnatic dataset and its apart in the musical effect they produce due factors like 10 Raga subset respectively making it the state-of-the-art the gamaka, temporal sequencing (which has to abide by for the Raga recognition task. Our approach also enables the constraints presented in the arohana and avarohana), sequence ranking which aids us in retrieving melodic pat- as well as places of svara emphasis and rest. Raga identi- terns from a given music data base that are closely related fication is an acquired skill that requires significant train- to the presented query sequence. ing and practice. As such, automatic Raga classification methods have been widely studied. However, existing 1. INTRODUCTION methods based on Pitch Class Distributions (PCD) disre- gard the temporal information and are highly error-prone, Carnatic and Hindustani music are the two main branches while other methods (reviewed in detail in Section 2) are of Indian Classical Music (ICM), which underlies most of highly dependent on preprocessing and hand made fea- the music emanating from the Indian subcontinent. Ow- tures, which limit the performance of such methods. ing to the contemplative and spiritual nature of these art forms and varied cultural influences, a lot of emphases is In this work, we introduce a deep learning based solu- placed on melodic development. Raga, which governs var- tion to Raga recognition and Raga content-based retrieval. ious melodic aspects of ICM, serves as a framework for First, we reformulate the problem of Raga recognition as compositions and improvisations. Krishna et al. [20] de- a sequence classification task performed using an LSTM fine a Raga as the “collective melodic expression that con- RNN based architecture with attention. We then show that sists of phraseology which is part of the identifiable macro- the same network, with minimal fine-tuning based on the melodic movement”. A major portion of contemporary In- triplet margin loss [27], can be used for sequence ranking. dian Music including film, folk and other forms of music We introduce sequence ranking as a new sub-task of auto- heavily draw inspiration from ICM [10, 12]. matic Raga recognition. In this configuration, the model Numerous melodic attributes make Ragas distinctive in can be used to perform Raga content-based retrieval. A nature, such as the svara (roughly, a musical note), the user provides the model with a query sequence and the gamaka (for example oscillatory movement about a given model returns sequences that are very closely related to the presented query. This can aid in downstream applications like recommendation systems and so on. c Sathwik Tejaswi Madhusudhan, , Girish Chowdhary. Li- Deep learning has proven to be an indispensable asset censed under a Creative Commons Attribution 4.0 International License due to its flexibility, scalability, learning abilities, end to (CC BY 4.0). Attribution: Sathwik Tejaswi Madhusudhan, , Girish Chowdhary. “DeepSRGM - Sequence classification and ranking in In- end training, and most importantly, unprecedented success dian Classical music with deep learning”, 20th International Society for in modeling unstructured spatial and temporal data over Music Information Retrieval Conference, Delft, The Netherlands, 2019. the past few years. Our work draws inspiration from the 533 Proceedings of the 20th ISMIR Conference, Delft, Netherlands, November 4-8, 2019 successes in [26, 27] introduce a convolutional neural net- work based recommendation system for music which over- Audio comes and outperforms various drawbacks of traditional Model recommendation systems like collaborative filtering, a bag Word embedding of words method, etc. It also leverages LSTM-RNN and Preprocessing attention based models which have proven to be extremely with embedding dimension = 128 effective in tasks like sequence classification, sequence to Deep learning based sequence learning, neural machine translation, image cap- audio source tioning task, etc., all of which directly deal with sequences. separation model LSTM layer To summarize, the main contributions of our work are hidden size = 768 as follows : Pitch tracking • We present a new approach to Raga recognition us- Attention Layer ing deep learning using LSTM based RNN architec- Tonic Normalization ture to address problem of Raga recognition. Dense layer 1 and note quantization Dense layer 2 • With our approach we obtain 97.1 % on the 10 Raga Softmax classification task and 88.1 % accuracy on the 40 Random Sampling Raga classification task on the Comp Music Carnatic Music Dataset (CMD), hence improving the state- of-the-art on the same. Predicted Class label • We introduce sequence ranking as a new sub task of Raga recognition, which can be used in creating Raga content based recommendation systems. Figure 1. Figure shows various preprocessing steps and model architecture for SRGM1 (refer Section 3) Note that we refer to our Raga recognition (i.e sequence classification) model as SRGM1 and the sequence ranking model as SRGM2 throughout the paper. Although arohana and avarohana are critical components to identifying a Raga, they do not contain as much infor- mation as the audio excerpt itself. Most previous works 2. RELATED WORKS heavily rely on prepossessing and feature extraction, most Broadly speaking, ML methods for Raga classification can of which are handcrafted, which can prove to be a limita- either be PCD based or sequence based. Many of the tion as it only gets harder to devise ways to extract complex previous works have focused on PCD based methods for and rich features. Raga recognition [6, 7, 18] with Chordia et al. [4], cur- rently holding the best performing results for PCD based 3. RAGA RECOGNITION AS A SEQUENCE methods with an accuracy of 91%. [17] presents an in- CLASSIFICATION PROBLEM (SRGM1) depth study of various distribution based methods for Raga classification. Although an intuitive and an effective ap- Sequence classification is a predictive modeling problem proach, the major shortcoming of PCD based methods is where an input, a sequence over space or time, is to be that it disregards temporal information and hence is error classified as one among several categories. In modern prone. To overcome this, previous works have used Hidden NLP literature, sentiment classification is one such exam- Markov Models based approach [19], convolutional neural ple, where models predict sentiments of sentences. We ar- networks based approach [21], arohana avarohana based gue that the Raga recognition is closely related to the task approach [23] among various other methods. of sequence classification, since any audio excerpt in ICM [9, 24] employ Raga phrase-based approach for Raga involving vocals/melody instruments, can be broken down recognition. [11] use an interesting phrase-based approach as a sequence of notes spread over time. To reformulate inspired by the way in which seasoned listeners identify Raga recognition as a sequence classification task, we treat a Raga. This method is currently the state-of-the-art on the notes obtained via predominant melody estimation as a 10 Raga subset of CMD. [12] introduces Time Delayed words, the set of all words form the vocabulary and each Melodic Surface(TDMS) which is a feature based on delay Raga as a class. We detail our approach to raga recognition co-ordinates that is capable of describing both tonal and as below. temporal characteristics of the melody given a song. With TDMS, the authors achieve state-of-art performance on the 3.1 Preprocessing CMD. 3.1.1 Deep Learning based audio source separation While the PCD based methods ignore the temporal in- formation, many works like assume that the audio is mono- Audio recordings available as part of CMD accurately rep- phonic. Works like [19,23] attempt to take advantage of the resent live ICM concerts with vocalists accompanied by temporal information by extracting arohana and avarohana. various string, wind, and percussion instruments. How- 534 Proceedings of the 20th ISMIR Conference, Delft, Netherlands, November 4-8, 2019 ever, for the purposes of raga recognition, analyzing the 3.1.5 Choosing an appropriate value for Nr vocals is sufficient.