Language Resources and Evaluation Journal manuscript No. (will be inserted by the editor) DravidianCodeMix: Sentiment Analysis and Offensive Language Identification Dataset for Dravidian Languages in Code-Mixed Text Bharathi Raja Chakravarthi1 ⋅ Ruba Priyadharshini2 ⋅ Vigneshwaran Muralidaran3 ⋅ Navya Jose4 ⋅ Shardul Suryawanshi1 ⋅ Elizabeth Sherly4 ⋅ John P. McCrae1 Received: date / Accepted: date Abstract This paper describes the development of a multilingual, manually anno- tated dataset for three under-resourced Dravidian languages generated from social media comments. The dataset was annotated for sentiment analysis and offensive lan- guage identification for a total of more than 60,000 YouTube comments. The dataset consists of around 44,000 comments in Tamil-English, around 7,000 comments in Kannada-English, and around 20,000 comments in Malayalam-English. The data was manually annotated by volunteer annotators and has a high inter-annotator agreement in Krippendorff’s alpha. The dataset contains all types of code-mixing phenomena Bharathi Raja Chakravarthi* E-mail:
[email protected] Ruba Priyadharshini E-mail:
[email protected] Vigneshwaran Muralidaran E-mail:
[email protected] Navya Jose E-mail:
[email protected] Shardul Suryawanshi E-mail:
[email protected] Elizabeth Sherly E-mail:
[email protected] John P. McCrae E-mail:
[email protected] 1Insight SFI Research Centre for Data Analytics, Data Science Institute, National University of Ireland Galway, Galway, Ireland arXiv:2106.09460v1 [cs.CL] 17 Jun 2021 2ULTRA Arts and Science College, Madurai, Tamil Nadu, India 3School of Computer Science and Informatics, Cardiff University, Cardiff, United Kingdom 4Indian Institute of Information Technology and Management-Kerala, Kerala, India 2 Chakravarthi et al since it comprises user-generated content from a multilingual country.