Hostility Detection Dataset in Hindi Mohit Bhardwajy, Md Shad Akhtary, Asif Ekbalz, Amitava Das?, Tanmoy Chakrabortyy yIIIT Delhi, India. zIIT Patna, India. ?Wipro Research, India. fmohit19014,tanmoy,
[email protected],
[email protected],
[email protected] Abstract Despite Hindi being the third most spoken language in the world, and a significant presence of Hindi content on social In this paper, we present a novel hostility detection dataset in Hindi language. We collect and manually annotate ∼ 8200 media platforms, to our surprise, we were not able to find online posts. The annotated dataset covers four hostility di- any significant dataset on fake news or hate speech detec- mensions: fake news, hate speech, offensive, and defamation tion in Hindi. A survey of the literature suggest a few works posts, along with a non-hostile label. The hostile posts are related to hostile post detection in Hindi, such as (Kar et al. also considered for multi-label tags due to a significant over- 2020; Jha et al. 2020; Safi Samghabadi et al. 2020); however, lap among the hostile classes. We release this dataset as part there are two basic issues with these works - either the num- of the CONSTRAINT-2021 shared task on hostile post detec- ber of samples in the dataset are not adequate or they cater tion. to a specific dimension of the hostility only. In this paper, we present our manually annotated dataset for hostile posts 1 Introduction detection in Hindi. We collect more than ∼ 8200 online so- The COVID-19 pandemic has changed our lives forever, cial media posts and annotate them as hostile and non-hostile both online and offline.