Tamil lyrics corpus: Analysis and Experiments Dhivya Chinnappa Praveenraj Dhandapani Thomson Reuters Intellect Design Arena Ltd.
[email protected] [email protected] Abstract ernment buildings in Tamil Nadu, India. Polit- ical leaders, orators and even common people In this paper, we present a new Tamil quote Thirukkural excerpts in their daily lives. lyrics corpus extracted from Tamil movies Since the inception of the Tamil movie captured across a range of 65 years (1954 industry in the early 1930s, Tamil literary to 2019). We present a detailed corpus analysis showing the nature of Tamil lyrics works have taken a different form. From the with respect to lyricists and the year which very beginning, Tamil movies used the plat- it was written. We also present similar- form to showcase the language’s rich litera- ity score across different lyricists based ture through movie dialogues and song lyrics. on their song lyrics. We present experi- Tamil literary works as old as Sangam litera- mental results based on the SOTA BERT ture have been depicted in early movies, and Tamil models to identify the lyricists of a the recent movies showcase a mix of old and song. Finally, we present future research directions encouraging researchers to pur- contemporary literary works. Thus, Tamil sue Tamil NLP research. movies have become an inseparable part of most Tamilians’ lives. While independent lit- 1 Introduction erary works (poems, hykoos, novels, etc.) ex- ist, Tamil movies consistently act as impor- Tamil is a classical Dravidian language spoken tant medium in showcasing rich Tamil literary in the states Tamil Nadu and Pondicherry, In- works.