Tamil lyrics corpus: Analysis and Experiments

Dhivya Chinnappa Praveenraj Dhandapani Thomson Reuters Intellect Design Arena Ltd. [email protected] [email protected]

Abstract ernment buildings in , . Polit- ical leaders, orators and even common people In this paper, we present a new Tamil quote Thirukkural excerpts in their daily lives. lyrics corpus extracted from Tamil movies Since the inception of the Tamil movie captured across a range of 65 years (1954 industry in the early 1930s, Tamil literary to 2019). We present a detailed corpus analysis showing the nature of Tamil lyrics works have taken a different form. From the with respect to lyricists and the year which very beginning, Tamil movies used the plat- it was written. We also present similar- form to showcase the language’s rich litera- ity score across different lyricists based ture through movie dialogues and song lyrics. on their song lyrics. We present experi- Tamil literary works as old as Sangam litera- mental results based on the SOTA BERT ture have been depicted in early movies, and Tamil models to identify the lyricists of a the recent movies showcase a mix of old and song. Finally, we present future research directions encouraging researchers to pur- contemporary literary works. Thus, Tamil sue Tamil NLP research. movies have become an inseparable part of most Tamilians’ lives. While independent lit- 1 Introduction erary works (poems, hykoos, novels, etc.) ex- ist, Tamil movies consistently act as impor- Tamil is a classical Dravidian language spoken tant medium in showcasing rich Tamil literary in the states Tamil Nadu and Pondicherry, In- works. dia. It is also widely spoken in Srilanka and The ideologies portrayed in Tamil movies is one of the official languages of the country. closely reflect Tamilians’ lifestyles. Tamil There are considerable Tamil speaking people movies and Tamil politics go hand in hand in countries like Singapore, Malaysia, Canada, (Kalorth, 2021). There are several people in and Mauritius. the Tamil movie industry with strong political Historians note that Tamil is as old as 2600 opinions, and it is not uncommon for Tamil years. The earliest identified movie dialogues and Tamil songs to reflect known as the Sangam literature dates back their political ideologies. to 600 B.C. (Abraham, 2003). Tamilians as The Tamil movie industry releases a movie an ethnic group appreciate literary works and every four days (Krishnan and Sakkthivel, refer to literary excerpts in their daily lives. 2010). That is, on average there are 90 movies 1 For example, Thirukkural , one of the popu- released every year. Similar to other Indian lar literary works, also called the Ulaga Pothu language movies, Tamil movies include several Marai (translation: Universal book of princi- songs. Each song in a Tamil movie is written ples) includes short literary phrases emphasiz- by a lyricist coordinating with the movie di- ing on principles for a prosperous life. Schools rector and the music director. The movie di- teach Thirukkural to their students as early rector usually describes the situation and the as kindergarten. Thirukkural excerpts are of- emotion where the song is supposed to fit in. ten found in public transportations and gov- Depending on the given situation by the movie 1https://thirukkural133.wordpress.com/ director, the lyricist writes the song lyrics, and contents/ the music director composes the music for the

1

Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages, pages 1–9 April 20, 2021 ©2021 Association for Computational Linguistics song. Though the lyrics of a Tamil movie song Thamizhan5 from the movie Mersal describes can stand alone, they are usually an integral the beauty of the language Tamil and glorifies part of the movie enhancing an emotion. Tamilians life styles. In this paper, we introduce a new Tamil Recently with the advent of youtube, there lyrics corpus2. Specifically, we focus only on are several songs that get popular even before Tamil movie lyrics and not Tamil lyrics by in- the movie release. While releasing songs be- dependent lyricists. The main contributions fore releasing a movie has been a practice in of this paper are (a) a corpus including Tamil the Tamil movie industry for a while, youtube song lyrics, lyricists, and other details; (b) has taken these songs sooner to the audience a detailed corpus analysis focusing on Tamil than ever before. Until recently, all movie lyrics; (c) experimental results to identify the songs were released together in a Compact lyricist of a given song lyric; and (d) future Disk, cassette tape or an online streaming site. research directions about how to utilize the However, in the past few years Tamil movies Tamil lyrics dataset. have started releasing singles from movies. These songs typically include a catchy lyric, 2 Background music or dance moves. They act as a promo- tion for the movie and usually have millions of Songs are an integral part of Tamil movies. views before the movie release. Kutty story6 They represent, enhance, and complement an from the movie Master and Bujji7 from the emotion in the movie. We explain the role of movie Jagamae thanthiram fall under this cat- these songs and exemplify with some recent egory. Interestingly, the song has Tamil movie songs. large amount of English songs, which is a phe- One of the most common emotion expressed nomena observed in many recent Tamil songs. in Tamil songs is romance. This includes We present youtube links in the footnote falling in love, breakup, marriage, etc. For to each song exemplified in this section. The example, the song Nenjukul peidhidum3 from English translation of the song lyrics can be the movie Vaaranam Aayiram describes a sit- viewed by turning on the subtitles in the uation when a man meets a woman for the youtube video. first time. This song emphasizes the emotion How are Tamil song lyrics written? In of love at first sight. The visualization and most cases, the movie director comes up with the lyrics are dramatic, and do not depict the a situation in the movie.. The music director actual story of the movie. While the flow of composes a tune capturing the emotion of this the movie is not affected by the song, it plays situation. The lyricist writes the lyrics, given a vital role in emphasizing a man’s emotion the tune. However, there are few music direc- towards a woman the first time he sees her. tors who compose tune to a given lyric. Some Tamil movie songs are created to complement the scenes in the movie. The 3 Related work movie would not be complete without the song. These songs include sequential scenes from the There exists a Tamil lyrics dataset (Subrama- movie. For instance, the song Kadhaipoma4 nian, 2020) in Kaggle for Tamil lyrics. The from the movie Oh my kadavulae describes the dataset includes the lyrics of Tamil songs, emotion of taking someone for granted while alongside song name and movie name. How- aptly fitting in the movie scenes. ever, we could not track any further experi- There are several Tamil movie songs writ- ments or research related to this dataset. It ten with political and social intentions. Tamil does not include the name of the lyricist or actors with political ideologies often hint their other details. Unlike this dataset, our corpus political intentions in their songs. Recently, includes the name of the lyricist, the name of there are several songs released emphasiz- the music director, and the year the song was ing on social equity. The song Aalaporan released. We also include the English translit-

2https://github.com/praveenraj0904/tamillyricscorpus 5Aalaporan Thamizhan 3Nenjukul peithidum 6Kutty Story 4Kadhaipoma 7Bujji

2 erated song lyrics. We believe these informa- Statistics tion would be useful to researchers. Songs 5,449 There are some prominent Tamil NLP works Lyricists 324 associated with Tamil lyrics. Sundara Kan- Music directors 151 chana and Ganapathy (2017) work with 1, 000 Movies 1,147 Tamil songs to identify three aspects for a Words 1,309,425 given Tamil song lyric. The three aspects are Unique words 129,614 base, mood, and style. The base aspect in- Years range 1954-2019 cludes the types character, festival, nature, Maximum words/song 653 relationship, romance, occasion, spiritual, pa- Minimum words/song 17 triotic and misc. The emotion aspect includes Average words/song 188 the types happy, sad, excited, tender, scared Table 1: Statistics of the Tamil lyrics dataset and angry. The styles aspect includes tradi- tional, folk, contemporary and mixed. Their work is word based, and they use TF-IDF as and tune; and (ii) context text and images. In a feature in their classifier to predict these as- their next work (Sridhar et al., 2016) they fol- pects. low a neuro-linguistic approach for lyric gen- Ramakrishnan A et al. (2009) work on auto- eration intending to add dialect, image, and matically generating Tamil lyrics for melodies. author specific features. In their recent work They approach the problem as a sequence la- (Sridhar et al., 2018) they generate lyrics, com- beling problem targeting to guess the syllabic bining their two previous works. They make pattern. Then they forward this pattern to a use of visual images, derived tune, and other sentence generation module, where they use specification such as the dialect of the speaker. Dijkstra’s algorithm to choose a meaningful Unlike the works described, our work tries to phrase. In their latter work Ramakrishnan A capture the characteristics of the lyric and the and Lalitha Devi (2010), follow a mapping lyricist. In an analysis perspective, our work is scheme for matching melody with words. They analogous to (Ranganathan et al., 2011). We also use a knowledge-based text generation al- go beyond, and present analysis with respect gorithm on an existing Ontology and Tamil to the year the lyrics were written, and simi- Morphology Generator. larity between pairs of lyricists. E. et al. (2011) design a pleasantness scor- 4 Corpus creation ing model based not only based on phonetic aspects. Ranganathan et al. (2011) present We create a corpus of Tamil lyrics from analysis over a given set of lyrics. In their www.allnewlyrics.com. The corpus consists later work, they (Ranganathan et al., 2013) of 5,449 Tamil songs lyrics and their English present a visualization tool for lyrics based on transliteration. Apart from the lyrics of a song, the characteristics of the lyrics. the corpus includes the name of the lyricist, Sridhar et al. (2013) have extensively music director, movie, and the year the song worked in automatic Tamil lyrics generation. was released. We also present a link to the In their first work, (Sridhar et al., 2013) they web page where the data was extracted. The follow a trigram approach using the emotion statistics of the corpus are presented in Table 1. of a given scene. The use this emotion as a The corpus includes songs written by 324 lyri- seed word and follow Tamil grammatical rules cists across 1,147 movies for which the music to generate lyrics. In their next work, Sridhar was composed by 151 music directors. et al. (2014) they follow a similar approach get- From a linguistics perspective, lyrics are the ting 77% accuracy. Following that, they (Srid- most interesting. The lyrics of a song reflects har et al., 2015) intend to generate Tamil lyrics the () situation in the movie, (ii) writing style given a sequence of images and a derived tune. of the lyricist, (iii) language style during which They generate a tune automatically based on the lyrics were released, and (iv) the emotion carnatic music. Their system generates two of the song. On an average each song includes sets of Tamil lyrics based on (i) context text 188 words ranging from minimum 17 words to

3 Lyricist Music director Movie Year M.S. Vish- Kasethan Kadavu- 1972 wanathan lada Tamil ெம쯍ல ேப毁柍க쿍 பிற쏍 ேக翍க 埂டா鏁 ெசா쯍쮿鏍 தா쏁柍க쿍 யா쏁믍 பா쏍埍க 埂டா鏁 [...] English Mella pesungal pirar ketkka koodadhu solli thaarungal yaarum paarkka koodadhu [...] Lyricist Music director Movie Year A. R. Rahman Love Birds 1996 Tamil மலா்கேள மலா்கேள இ鏁 எꟍன கனவா மைலகேள மைலகேள இ鏁 எꟍன நிைனவா [...] English Malargalae malargalae idhu enna kanavaa malaigalae malaigalae idhu enna ninaivaa aa… [...] Lyricist Music director Movie Year Na. Muthu Yuvan Shankar Kaadhal kondein 2003 Kumar Raja Tamil ெந篍ேசா翁 கலꏍதி翁 உறவாேல கால柍க쿍 மறꏍதி翁 அꟍேப நிலேவா翁 ெதꟍற쯁믍 வ쏁믍 ேவைள காய柍க쿍 மறꏍதி翁 அꟍேப [...] English Nenjodu kalanthidu uravaalae kaalangal maranthidu anbae nilavodu thendralum varum velai kaayangal maranthidu anbae [...] Lyricist Music director Movie Year Gangai Ama- Ilayaraja Paneer Pushpangal 1981 ran Tamil ேகாைட埍கால கா쟍ேற 埁ளி쏍鏍ெதꟍற쯍 பா翁믍 பா翍ேட மன믍 ேத翁믍 毁ைவேயா翁 தின믍ேதா쟁믍 இைசபா翁 [...] English Kodai kaala kaatrae kulir thendral paadum paattae manam thedum su- vaiyodu dhinandhorum isai paadu [...] Lyricist Music director Movie Year Darbuka Siva Enai Noki Paayum 2019 Thota Tamil எ鏁வைர ேபாகலா믍 எꟍ쟁 நீ ெசா쯍லேவ迍翁믍 எꟍ쟁தாꟍ விடாம쯍 ேக翍கிேறꟍ [...] English Ethuvarai pogalaam endru nee sollavendum endruthaan… vidamal ketkiren… [...]

Table 2: Examples from the Tamil lyrics corpus. Each instance includes the name of the lyricist, music director, movie, and the year the movie was released. Additionally, the Tamil lyrics and the equivalent English transliterated lyrics is also available. maximum 653 words. The total number of tion of the data in our corpus and may not re- Tamil words for all songs combined in the cor- flect the contribution of a lyricist to the Tamil pus is more than a million. The total number movie or music industry8. of unique Tamil words in the corpus is close to Figure 1 depicts the number of songs written 130, 000. Table 2 presents examples from the by a lyricist across different years. Evidently, corpus. Vaali has written songs across a wide range of years starting from the mod 1960s to the late 5 Corpus Analysis 2010s. Vairamuthu has reached the peak of his lyric writing in the early 2000s. Interest- We present detailed corpus analysis focusing on the Tamil lyrics our corpus. All statistics 8All statistics correspond to the corpus, and may presented in this paper represent the distribu- not reflect the contribution of a lyricist in real time.

4 Lyricist No. Top 10 words songs All (with 5449 நீ, எꟍ, நாꟍ, உꟍ, ஒ쏁, எꟍன, காத쯍, வா, தாꟍ, எꟍைன stopwords) All (No 5449 தா, ேபா鏁, ேமேல, பா쏍鏍鏁, ெபா迍迁, இவꟍ, உலக믍, ேதꟍ, stopwords) க迍ேண, 毁க믍 Vaali 873 தா, வꏍத鏁, நா쿁믍, ேதꟍ, மாமா, ெதா翍翁, ெம쯍ல, பி쿍ைள, வா폁믍, அவꟍ 797 க迍ேடꟍ, உ쿍ள믍, ெகா迍ட, வꏍத鏁, அவ쿍, ஆட, க迍ணா, பி쿍ைள, க迍翁, ஒேர Vairamuthu 390 ேநா, 믁鏍த믍, க迍ேண, க迍翁, ஆ迍, உலக믍, கிளி, ꯂமி, சிாி, ச鏍த믍

Na. Muthu 340 இதய믍, உயிேர, இவꟍ, உலக믍, ராஜா, அவ쿍, பற埍埁鏁, எ柍埁믍, Kumar ேவ迁믍, 믁ꟍேன Gangai 279 மாமா, ஊ쏁, ெபா迍迁, ேக翍翁, 埁யிேல, ேக쿁, கைத, ஒ迍迁, Amaran ராக믍, ராசா Pa.Vijay 263 ேபபி, 믁த쯍, ேநா, ேபா翁, வாퟍ, 毁믍மா, தானா, தி쯍, 믁னி, தீ迍ட 179 ஜி埍கி, அழ埁, ꯁ쿍ள, அவ, ேபா, ேபச, க迍迁, ஆச, வாழ, அꯍபா Madhan 163 ெவறி, ைம, ராஜா, ராஜ, அவ쿍, ச母சிꟍ, ேமேல, தல, ஏ鏍தி埍க, நா믍 Karky Thamarai 107 ஏேனா, யாேரா, அவ쿍, 믁த쯍, நிலா, என, எைன, ேமக믍, ேமேல, தா迍羿 Viveka 105 ர埍கரா, கப羿, சி埍埁, ட埍埁ன埍கா, ம翍ட埍埁, ேகா, ைகய, ஒ뿍ய, மியாퟍ, ச埍காꟍ

Table 3: Number of songs written by a lyricist and the most frequent 10 words by them are presented. The most frequent 10 words are obtained after removing the stop words. The most frequent 100 words from the entire corpus are treated as stop words. ingly, Na. Muthukumar has written the most words, and few stop words after the most fre- number of songs in the shortest span of years. quent 100 words. The top 10 most frequent words per lyricist (Table 3) are extracted after We treat each word in the corpus as the sim- removing the stop words. Interestingly, the plest unit for our analysis. That is, we do not word காத쯍 (love) always is present in the top conduct analysis based on lemmas or stems of 10 words despite not being a traditional stop words. The top 10 lyricists with the highest word. We attribute this to the nature of Tamil number of songs and their most frequent 10 movie songs which usually circles around the words are presented in Table 3. Vaali ranks emotion, romance or love (காத쯍). first with 873 songs followed by Kannadasan (797 songs), and Vairamuthu (390 songs). Table 4 presents the number of songs per decade and the top 10 most frequent words We noticed that the top 10 most fre- per decade. For example, there are 312 songs quent words across the entire corpus (Table:3, included in the corpus belonging to the decade 1st row), and the top 10 most frequent words (1970), that is from the year 1961 to 1970. The per lyricist were all pronouns or interrogative top 10 words for the decade 1970 are also given. words. Our further investigation revealed that The top 10 most frequent words are extracted the top 100 words mostly included pronouns, after removing the stop words. interrogative words, prepositions, and conjunc- tions which are considered as stop words in In Table 5, we present similarity scores be- English. Thus we decided to treat the top 100 tween pairs of lyrics among the top 5 artists. most frequent words as stop words. We ac- The Jaccard similarity computes similarity be- knowledge that this process is not perfect, as tween two sets by calculating the intersection there are a few non stop words in the top 100 over union. First, we combine all songs by

5 Year No. Top 10 words songs 1960 71 க迍ேடꟍ, கனퟁ, எ柍க쿍, சிவக柍ைக, சீைம, ெசாꏍத믍, வா폁믍, அꯍபா, ெமாழி, 毁க믍 1970 312 உ쿍ள믍, அவ쿍, ெகா迍ட, பா쏍鏍鏁, நிலா, ஆட, அவꟍ, பா쏍, க迍翁, க迍ேடꟍ 1980 296 毁க믍, பி쿍ைள, வꏍத鏁, ஆ翁믍, இ쏁, க迍ணꟍ, உ쿍ள믍, ெம쯍ல, ராக믍, பா 1990 820 ராக믍, கைத, மாேன, மாமா, ேதꟍ, ேதக믍, நா쿁믍, பாட, வꏍத鏁, 毁க믍 2000 1071 ேநா, ꯂேவ, தா, ெபா迍迁, நிலா, கிளி, க迍ேண, ேதꟍ, மாமா, ஒꟍꟁ 2010 1087 믁த쯍, வாퟍ, ேபா翁, ஏேதா, ெதா翍翁, இதய믍, ேபா鏁, பா쏍鏍鏁, ேமேல, ெம쯍ல 2020 1593 இவꟍ, ேபா, நா柍க, இதய믍, அவ, வா羿, ராஜா, அவꟍ, நா믍, ேபாக

Table 4: Number of songs written in a decade and the most frequent 10 words per decade is given. The most frequent 10 words are obtained after removing the stop words. The most frequent 100 words from the entire corpus are treated as stop words.

Jacc. TFIDF sim. cos sim Vaali Kannadasan 0.76 0.956 Vairamuthu 0.94 0.972 Na. Muthu 0.89 0.956 Kumar Gangai 0.76 0.969 Amaran Kannadasan Vairamuthu 0.71 0.948 Na. Muthu 0.77 0.910 Kumar Gangai 0.85 0.922 Figure 1: The number of songs written by a lyricist Amaran across 65 years (1954 to 2019) is presented. Vaali Vairamuthu Na. Muthu 0.9 0.964 has written Tamil song lyrics covering the widest range of years. Na. Muthukumar has written the Kumar most number of songs in the shortest span of years. Gangai 0.74 0.939 Amaran Na. Muthu Gangai 0.81 0.924 each lyricist into a large text. Thus we obtain Kumar Amaran five large texts for the top 5 lyricists. Were- move duplicate words from each text to create Table 5: Words based similarity between pairs of a set for each lyricist. We calculate the Jac- the top 5 lyricists from the corpus. Both Jaccard similarity and cosine similarity on the TF-IDF vec- card similarity over these sets to measure their tors indicate that the lyrics of Vaali and Vaira- similarity. Results show that there is higher muthu are the most similar. overlap between the words used by Vaali and Vairamuthu (Jacc. sim.: 0.94), and Vaira- muthu and Na. Muthu Kumar (Jacc. sim.: we represent these texts using their TF-IDF 0.9) vector. We calculate the cosine similarity be- We also present the cosine similarity be- tween these vectors to measure their similar- tween the TF-IDF representations of the lyrics ity. Based on this TF-IDF cosine similarity by the top 5 lyricists. We process the lyrics measure, the lyrics Vaali and Vairamuthu (TF- similar to the processing for calculating Jac- IDF cos. sim: 0.972 ) are very similar which is card similarity, but do not form a unique set observed in the Jaccard similarity score (Jacc. of words for each lyricist. First, we combine all sim.: 0.94) as well. Similarly Vaali and Gan- songs by each lyricist into a large text. Next, gai Amaran (TFIDF cos. sim: 0.969 ), and

6 Multingual Indic-NLP Tamillion BERT BERT uncased AI4Bharat P R F1 P R F1 P R F1 Vaali 0.62 0.57 0.59 0.65 0.75 0.69 0.68 0.74 0.70 Kannadasan 0.56 0.62 0.59 0.67 0.55 0.60 0.68 0.61 0.64 W.Avg 0.59 0.59 0.59 0.66 0.66 0.65 0.68 0.68 0.68

Table 6: Results of binary classifiers to classify between Vaali and Kannadasan given a song Tamil lyric is presented. All models are fine tuned to obtain the results.

Vairamuthu and Na. Muthu Kumar (TFIDF .65 vs. .68). The results obtained using Indic- cos. sim: 0.964 ). also have high TFIDF co- NLP and Tamillion BERT are comparable (F1: sine similarity scores. .65 vs. .68). Regarding the results for Vaali, Tamillion 6 Experiments and Results BERT obtains the highest precision (.68), and Indic-NLP obtains the highest recall (.75). We conduct experiments to identify the lyri- Overall, the performance for Vaali is better cist of a given lyric. We formulate a binary when using the Tamillion BERT (F1: .70). classification problem by generating a dataset including only the top two lyricists with the In case of the results for Kannadasan, the most number of songs, Vaali and Kannadasan. highest performance is obtained both for pre- This dataset consists of 1,670 instances from cision and recall by fine tuning the Tamillion the Tamil lyrics corpus, where 873 instances BERT model (P: .68, R: .61). We note that correspond to Vaali and 797 instances corre- there is still room for improvement in this bi- spond to Kannadasan. We used 80% of this nary lyricist identification task. The task be- dataset for training and withheld the remain- comes even more challenging when the number ing 20% for testing. We ensured the lyricists of lyricists are increased changing the binary distribution is stratified across the train and classification task into a multi class classifica- test sets. tion task.

6.1 Experiments 7 Future research In an era where BERT (Devlin et al., 2019) We use the Tamil lyrics corpus to identify lyri- models are used as a baseline, we present re- cists formulating a binary classification task. sults using three BERT models supporting The same corpus could be used for several Tamil. We conducted all our experiments us- other tasks. We discuss the previous works re- ing the simpletransformers (Rajapakse, 2019) lating to Tamil lyrics in Section 3. In this sec- python package. First, we fine tuned the tion, we describe a few possible research that multilingual uncased BERT (Reimers and could conducted using the Tamil lyrics corpus. Gurevych, 2020) model. We obtained the best results at a learning rate of 1e-5 with 3 epochs. 7.1 Language trend analysis Next, we fine tuned with the indic-bert model Language trend analysis intends to capture from AI4Bharat (Kunchukuttan et al., 2020). the change in a language over years. As Tamil Here we obtained the best results at a learn- song lyrics are closely tied to the political ing rate of 1e-5 with 2 epochs. Finally, we and personal lives of Tamilians, we expect the fine tuned the Tamillion BERT (Doiron, 2020) change of language style in the lyrics to reflect model. We reached the best results in this the lives of Tamilians. We hypothesize that model at learning rate of 1e-5 with 7 epochs. the songs released after the year 2000 would include more English words (code-switching) 6.2 Results than the songs from the 1960s. For instance, We present results in Table 6. The results ob- the song Why this kolaveri? has more than tained using multilingual BERT is lower than 50% English words. Additionally, as Tamil- Indic-NLP and Tamillion BERT (F1: .59 vs. ians have moved across the world, Tamil songs

7 have gained international attention, enabling the similarity of the lyrics between different changes in Tamil music. lyricists using standard similarity approaches. We also hypothesize that the songs re- Next, we formulate a binary classification task leased after the year 2000 would include sev- to identify the lyricist, given a song lyric. We eral meaningless words, whereas the songs re- fine tuned the state-of-the-art BERT models leased around the 1960s followed strict liter- to predict the lyricists. Our experimental re- ary style. For example, phrases like jimjikka, sults show that identifying lyricists is a chal- rangu rakkara, tasaku tasaku, etc. have no lenging task, and there is scope for improve- meaning (also called Rettai Kilavi in Tamil ment. We go beyond the experiments, and grammar) are often used in Tamil songs, as elaborate on the possible future research direc- they are fun and catchy. tions with the Tamil lyrics corpus. We intend to add more instances to our corpus to encour- 7.2 Emotion detection age research on Tamil lyrics. As Tamil song lyrics are usually written for an emotion in the movie, it is possible to use the References Tamil lyrics dataset to identify the emotion in them. This task would require an additional Shinu Anna Abraham. 2003. Chera, chola, pandya: layer of annotations for emotions. We also be- Using archaeological evidence to identify the tamil kingdoms of early historic south india. In lieve a distant supervision mechanism to an- Asian Perspectives, volume 42, pages 207–223. notate for the labels should be possible. The University of Hawai’i Press. possible emotions would include, but not limit Manoj Kumar Chinnakotla and Om P. Damani. to love, breakup, hero introduction, happy, and 2009. Experiences with English-Hindi, English- sorrow. While the work by (Sundara Kan- Tamil and English-Kannada transliteration chana and Ganapathy, 2017) closely relates to tasks at NEWS 2009. In Proceedings of the the emotion detection space, we believe there 2009 Named Entities Workshop: Shared Task on is so much to be explored in this area. Transliteration (NEWS 2009), pages 44–47, Sun- tec, Singapore. Association for Computational Linguistics. 7.3 Evaluating transliteration models Jacob Devlin, Ming-Wei Chang, Kenton Lee, and There are many works relating to English Kristina Toutanova. 2019. BERT: Pre-training transliteration of Tamil (Chinnakotla and of deep bidirectional transformers for language Damani, 2009; Vijayanand, 2009). As our cor- understanding. In Proceedings of the 2019 Con- pus includes the English transliterated text of ference of the North American Chapter of the the Tamil lyrics, it could be used as a valida- Association for Computational Linguistics: Hu- man Language Technologies, Volume 1 (Long tion set to evaluate a transliteration model. and Short Papers), pages 4171–4186, Minneapo- We hypothesize that Tamil lyrics are rich lis, Minnesota. Association for Computational in linguistic phenomena such as cliches, Linguistics. metaphors, etc. A model identifying such phe- Nick Doiron. 2020. Tamillion bert. https:// nomena in the Tamil corpus would be interest- huggingface.co/monsoon-nlp/tamillion. ing. Giruba Beulah S. E., V., Karthika Ranganathan, and Suriyah M. 2011. Pleasant- 8 Conclusion ness scoring models for tamil lyrics. In Proceed- ings of the 5th Indian International Conference In this paper, we introduce a new Tamil lyrics on Artificial Intelligence, IICAI 2011, Tumkur, dataset. We collect Tamil movie lyrics along Karnataka State, India, December 14-16, 2011, with the names of the lyricist, the movie, and pages 1561–1571. IICAI. the music director over a range of 65 years. We Nithin Kalorth. 2021. Identities and ideologies also present the English transliteration of the in understanding within the fram- Tamil lyrics. We present an elaborate litera- work of socio political movements in tamil nadu. ture review on studies relating to Tamil lyrics. In Into the Nuances of Cultural - Essays on Cul- We conduct detailed corpus analysis revealing tural Studies, pages 95–103. Yking books. the top 10 frequent words per decade, and top Trichy V. Krishnan and A.M. Sakkthivel. 2010. To 10 frequent words per lyricist. We also study push for stardom or not: A rookie’s dilemma in

8 the tamil movie industry. IIMB Management Rajeswari Sridhar, G Jalin Gladis, K Ganga, and Review, 22(3):80–92. G Dhivya Prabha. 2013. N-gram based ap- proach to automatic tamil lyric generation by Anoop Kunchukuttan, Divyanshu Kakwani, identifying emotion. pages 919–926. Proceed- Satish Golla, Gokul N.C., Avik Bhattacharyya, ings of International Conference on Advances in Mitesh M. Khapra, and Pratyush Kumar. 2020. Computing. Ai4bharat-indicnlp corpus: Monolingual cor- pora and word embeddings for indic languages. Rajeswari Sridhar, G Jalin Gladis, K Ganga, and arXiv preprint arXiv:2005.00085. G Dhivya Prabha. 2014. Automatic tamil lyric generation based on ontological interpretation T. C. Rajapakse. 2019. Simple transform- for semantics. pages 97–121. Academy Proceed- ers. https://github.com/ThilinaRajapakse/ ings in Engineering Sciences. simpletransformers. Siva Subramanian. 2020. Tamil songs Ananth Ramakrishnan A, Sankar Kuppan, and lyrics dataset. Data retrieved from Sobha Lalitha Devi. 2009. Automatic genera- Kaggle http://www.kaggle.com/sivaskvs/ tion of tamil lyrics for melodies. In Proceedings tamil-songs-lyrics-dataset. of the NAACL HLT Workshop on Computational Approaches to Linguistic Creativity, pages 40– Meenakshi K Sundara Kanchana and Velappa 46, Boulder, Colorado. Association for Compu- Ganapathy. 2017. Comparison of genre based tational Linguistics. tamil songs classification using term frequency and inverse document frequency. Research Jour- Ananth Ramakrishnan A and Sobha Lalitha Devi. nal of Pharmacy and Technology, 5:80–92. 2010. An alternate approach towards meaning- ful lyric generation in Tamil. In Proceedings Kommaluri Vijayanand. 2009. Testing and perfor- of the NAACL HLT 2010 Second Workshop on mance evaluation of machine transliteration sys- Computational Approaches to Linguistic Creativ- tem for . In Proceedings of the ity, pages 31–39, Los Angeles, California. Asso- 2009 Named Entities Workshop: Shared Task on ciation for Computational Linguistics. Transliteration (NEWS 2009), pages 48–51, Sun- tec, Singapore. Association for Computational K. Ranganathan, T. V. Geetha, R. Parthasarathi, Linguistics. and Madhan Karky. 2011. Lyric mining : Word , rhyme & concept co-occurrence analysis. Karthika Ranganathan, B. Barani, and T. V. Geetha. 2013. A tamil lyrics search and visu- alization system. volume 8281, pages 513–527. Lecture Notes in Computer Science. Nils Reimers and Iryna Gurevych. 2020. Mak- ing monolingual sentence embeddings multilin- gual using knowledge distillation. In Proceed- ings of the 2020 Conference on Empirical Meth- ods in Natural Language Processing. Association for Computational Linguistics. R. Sridhar, N. Dharmaraj, J. Damodaran, and S. Sampath. 2015. Automatic tamil lyric gener- ation based on image sequence and derived tune. In 2015 IEEE International Advance Computing Conference (IACC), pages 861–866. R. Sridhar, S. Venkatsubhramaniyen, and S. S. Rashmi. 2018. Automatic singable tamil lyric generation for a situation and tune based on causal effect. In 2018 International Conference on Advances in Computing, Communications and Informatics (ICACCI), pages 1152–1159. Rajeswari Sridhar, S. R. Surya Dev, V. Sankar, and K. Sivakumar. 2016. Automatic tamil lyric generation based on neuro-linguistics. In Pro- ceedings of the International Conference on In- formatics and Analytics, ICIA-16, New York, NY, USA. Association for Computing Machin- ery.

9