Findings of the Shared Task on Offensive Language Identification in Tamil, Malayalam, and Kannada

Total Page:16

File Type:pdf, Size:1020Kb

Findings of the Shared Task on Offensive Language Identification in Tamil, Malayalam, and Kannada Findings of the Shared Task on Offensive Language Identification in Tamil, Malayalam, and Kannada Bharathi Raja Chakravarthi1, Ruba Priyadharshini2, Navya Jose3 Anand Kumar M 4,Thomas Mandl 5, Prasanna Kumar Kumaresan3, Rahul Ponnusamy3, Hariharan R L 4, John Philip McCrae 1 and Elizabeth Sherly3 1National University of Ireland Galway, 2Madurai Kamaraj University, 3Indian Institute of Information Technology and Management-Kerala, 4National Institute of Technology Karnataka Surathkal, 5University of Hildesheim Germany [email protected] Abstract Research in hate speech detection (Kumar et al., Detecting offensive language in social me- 2018) or offensive language detection (Zampieri dia in local languages is critical for mod- et al., 2020; Mandl et al., 2020) using Natural Lan- erating user-generated content. Thus, the guage Processing (NLP) has significantly improved field of offensive language identification for in recent years. However, the work on under- under-resourced languages like Tamil, Malay- resourced languages is still limited (Chakravarthi, alam and Kannada is of essential impor- 2020). For example, under-resourced languages tance. As user-generated content is often such as Tamil, Malayalam, and Kannada lack tools code-mixed and not well studied for under- resourced languages, it is imperative to cre- and datasets (Chakravarthi et al., 2020a,c; Thava- ate resources and conduct benchmark stud- reesan and Mahesan, 2019, 2020a,b). Recently, ies to encourage research in under-resourced shared sentiment analysis for Tamil and Malay- Dravidian languages. We created a shared alam by Chakravarthi et al.(2020d) and offensive task on offensive language detection in Dra- language identification in Tamil and Malayalam vidian languages. We summarize the dataset by Chakravarthi et al.(2020b) paved the wave for this challenge which are openly avail- for more research on Dravidian languages. Tamil, able at https://competitions.codalab. Malayalam and Kannada belong to the Dravidian org/competitions/27654, and present an overview of the methods and the results of the language family and are spoken by some 220 mil- competing systems. lion people in the Indian subcontinent, Singapore and Sri Lanka (Krishnamurti, 2003). It is essen- 1 Introduction tial to create NLP systems such as hate speech The dawn of social media helped to bridge the gap detection or offensive language detection in local between the borders and paved the way for the peo- languages since most user-generated content is in ple to communicate or express their opinions more such languages. easily than any other time in human history (Edo- The Dravidian language family is reported as somwan et al., 2011). But with this advent of social being approximately 4500 years old by Bayesian media, the platforms like YouTube, Facebook, and phylogenetic analysis of cognate-coded lexical data Twitter not only helped information sharing and (Kolipakam et al., 2018). The Dravidian languages networking but also became the place that target, scripts are first attested in the 580 BCE as Tamili1 defame and marginalize people merely on the basis script inscribed on the pottery of Keezhadi, Siva- of their physical appearance, religion or sexual ori- gangai and Madurai district of Tamil Nadu, India entation (Keipi et al., 2016; Benikova et al., 2018; (Sivanantham and Seran, 2019)2 by Tamil Nadu Pamungkas et al., 2020). Social media has moulded State Department of Archaeology and Archaeo- itself into its specialised tool with which the people logical Survey of India. The modern Dravidian are verbally threatened and cornered not for what languages have its own scripts evolved from Tamili they have done but for who they are (Maitra and scripts. The scripts are alpha-syllabic, belonging McGowan, 2012; Patton et al., 2013). Undoubt- to a family of the abugida writing systems that edly, the infiltration depth and width of this ‘digital are partially alphabetic and partially syllable-based wonder’ equipped the erstwhile ‘invisible and so- cially paralysed’ communities to play cards, if not 1also called Damili or Dramili or Tamil-Brahmi extravagantly noticeable (Barnidge et al., 2019). 2Keeladi-Book-English-18-09-2019.pdf 133 Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages, pages 133–145 April 20, 2021 ©2021 Association for Computational Linguistics (Albert et al., 1985; Rajendran, 2004). The writing 2 Task Description system of Tamili was explained in the old grammar Offensive language identification for Dravidian lan- text Tolkappiyam which dates are various proposed guages at different levels of complexity were devel- between 9th century BCE to 6nd century BCE (Pil- oped following the work of (Zampieri et al., 2019). lai, 1904; Swamy, 1975; Zvelebil, 1991; Takahashi, It was customized to our annotation method from 1995) and in the Jaina work Samavayanga Sutta three-level hierarchical annotation schema. A new and Pannavana Sutta, these two Jain works date to category Not in intended language was added to 3rd-4th century BCE (Salomon, 1998). include comments written in a language other than Up until around the 15th-century CE, Malay- the Dravidian languages. Annotations decision for alam was the west coast dialect of Tamil (Black- offensive language categories were split into six burn, 2006). Due to geographical isolation by the labels to simplify the annotation process. steep Western Ghats from the main speech commu- • Not Offensive: Comment/post does not have nity, the dialect eventually evolved into a distinct offence, obscenity, swearing, or profanity. language by the 15th century (Sekhar, 1951). The first literary work in Malayalam was Ramacari- • Offensive Untargeted: Comment/post have tam (”Deeds of Rama”) was written using Tamil offence, obscenity, swearing, or profanity not Grantha script which was used to write Pali, Prakrit directed towards any target. These are the and foreign words in Tamil (Andronov, 1996). The comments/posts which have inadmissible lan- first printed work in an Indian language and script guage without targeting anyone. was a Roman Catholic catechism translated by Hen- rique which is Thambiran Vanakkam (Doctrina • Offensive Targeted Individual: Com- Christam en Lingua Malauar Tamul in Portuguese) ment/post have offence, obscenity, swearing, on 20 October 1,578 CE at Quilon, Venad (present or profanity which targets an individual. day Kerala) (George, 1972). Similarly, Kannada • Offensive Targeted Group: Comment/post also split from Tamil by 9th century CE; for exam- have offence, obscenity, swearing, or profan- ple, Taayviru is a term of Kannada origin and is ity which targets a group or a community. found in a Tamil inscription from the 4th-century CE. The Kappe Arabhatta record of 700 CE is the • Offensive Targeted Other: Comment/post oldest extant form of Kannada poetry in the Tripadi have offence, obscenity, swearing, or profan- metre, it is based in part on Kavyadarsha (Rawlin- ity which does not belong to any of the previ- son, 1930; Hande et al., 2020). ous two classes. However, user-generated content is often • Not in indented language: If the comment adopted to Roman script for typing due to histor- is not in the intended language. For example, ical reason and modern computer keyboard lay- in the Malayalam task, if the sentence does outs. Hence, the majority of user-generated data for not contain Malayalam written in Malayalam these Dravidian languages are code-mixed (Priyad- script or Latin script, then it is not Malayalam. harshini et al., 2020; Jose et al., 2020). Due to Sample of not-offensive comments in our dataset the growth of social media platforms across the provided for the participants is given below with world and the possibility of writing content without the corresponding English glosses.. any moderation, users write content in multilingual code-switching without grammatical restrictions • Tamil-English: Paka thana poro movie la and using non-native scripts (Chakravarthi et al., Enna irukunu baki ellam 2019). In linguistics, code-switching is switching English: “You will see what is the movie” between two or more language in the same utter- ance. This explosion of code-mixed user-generated Sample offensive language text from the dataset content in the social media platforms aroused in- terest in NLP. In this paper, we dicuss about the • Tamil-English: Ghibran p—a vitta ungalukku offensive language identification shared task for vera aale kedaika laya da. Tamil, Malayalam and Kannada languages and the English: “Dont you get anyone else other than participant’s submissions. Ghibran V—a” 134 Language Tamil Malayalam Kannada Number of words 511,734 202,134 65,702 Vocabulary size 94,772 40,729 20,796 Number of comments 43,919 20,010 7,772 Number of sentences 52,617 23,652 8,586 Average number of words per sentence 11 10 8 Average number of sentences per comment 1 1 1 Table 1: Corpus statistics for Offensive Language Identification Class Tamil Malayalam Kannada Not Offensive 31,808 17,697 4,336 Offensive Untargeted 3,630 240 278 Offensive Targeted Individual 2,965 290 628 Offensive Targeted Group 3,140 176 418 Offensive Targeted Others 590 - 153 Not in indented language 1,786 1,607 1,898 Total 43,919 20,010 7,772 Table 2: Offensive language Identification Dataset Distribution. This is an example of Offensive Targeted Indi- Comment Scraper tool3 was used to collect com- vidual. Mr. Ghibran is a popular music composer ments from YouTube social media. These com- in Tamil films, and the author tries to insult him. ments were utilized to make the datasets for of- fensive language identification shared task. We • Malayalam-English: Verupikkal manju ul- collected comments that contain code-mixing at lathu ozhichal baki ellam super aanu ee various levels of the text, with enough representa- padam. tion for each sentiment in all the three languages. 4 English: “Except the presence of disgusting Langdetect library was used to detect different Manju, everything is super in this movie” languages apart and eliminate the unintended lan- guages.
Recommended publications
  • BSW 043 Block 1 English.Pmd
    UNIT 4 TRIBES OF TAMIL NADU Structure 4.0 Objectives 4.1 Introduction 4.2 About Tamil Nadu 4.3 Tribes of Tamil Nadu 4.4 Social Hierarchy of the Tribes in Tamil Nadu 4.5 Tribal Languages in Tamil Nadu 4.6 Let Us Sum Up 4.7 Further Readings and References 4.0 OBJECTIVES This unit gives a description of the tribes of Tamil Nadu State which is a part of South India. It provides information about their origin, social, cultural and economic characteristics and their present status with the object of developing an understanding in the learner about the distinct features of the tribes located in the heart of the nation. After reading this unit, you should be able to: Describe the tribal areas of Tamil Nadu; Trace the origin of the tribes and understand their culture and occupation; Understand the different tribes of the region and their social, economic and cultural characteristics; Discuss the social hierarchy of the people in Tamil Nadu; and Outline their present status in terms of literacy, occupation, etc. 4.1 INTRODUCTION Tribes of Tamil Nadu are mainly found in the district of Nilgiris. Of all the distinct tribes, the Kotas, the Todas, the Irulas, the Kurumbas and the Badagas form the larger groups, who mainly had a pastoral existence. The men from each family of this tribe are occupied in milking and grazing their large herds of buffaloes; a very common form of pastoral farming. This tribe is distinguished by their traditional costume; a thick white cotton cloth having stripes in red, blue or black, called puthukuli worn by both women and men over a waist cloth.
    [Show full text]
  • BSW 043 Block 1 English.Pmd
    BSW-043 TRIBALS OF SOUTH Indira Gandhi AND CENTRAL INDIA National Open University School of Social Work Block 1 TRIBES OF SOUTH INDIA UNIT 1 Tribes of Andhra Pradesh and Telangana 5 UNIT 2 Tribes of Karnataka 17 UNIT 3 Tribes of Kerala 27 UNIT 4 Tribes of Tamil Nadu 38 UNIT 5 Tribes of Lakshadweep and Puducherry 45 EXPERT COMMITTEE Prof. Virginius Xaxa Dr. Archana Kaushik Dr. Saumya Director – Tata Institute of Associate Professor Faculty Social Sciences Department of Social Work School of Social Work Uzanbazar, Guwahati Delhi University IGNOU, New Delhi Prof. Hilarius Beck Dr. Ranjit Tigga Dr. G. Mahesh Centre for Community Department of Tribal Studies Faculty Organization and Development Indian Social Institute School of Social Work Practice Lodhi Road, New Delhi IGNOU, New Delhi School of Social Work Prof. Gracious Thomas Dr. Sayantani Guin Deonar, Mumbai Faculty Faculty Prof. Tiplut Nongbri School of Social Work School of Social Work Centre for the Study of Social IGNOU, New Delhi IGNOU, New Delhi Systems Dr. Rose Nembiakkim Dr. Ramya Jawaharlal Nehru University Director Faculty New Delhi School of Social Work School of Social Work IGNOU, New Delhi IGNOU, New Delhi COURSE PREPARATION TEAM Block Preparation Team Programme Coordinator Unit 1 Anindita Majumdar Dr. Rose Nembiakkim and Dr. Aneesh Director Unit 2 & 3 Rubina Nusrat School of Social Work Unit 4 Mercy Vungthianmuang IGNOU Unit 5 Dr. Grace Donnemching PRINT PRODUCTION Mr. Kulwant Singh Assistant Registrar (P) SOSW, IGNOU August, 2018 © Indira Gandhi National Open University, 2018 ISBN-978-93-87237-69-8 All rights reserved. No part of this work may be reproduced in any form, by mimeograph or any other means, without permission in writing from the Indira Gandhi National Open University.
    [Show full text]
  • Shiva's Waterfront Temples
    Shiva’s Waterfront Temples: Reimagining the Sacred Architecture of India’s Deccan Region Subhashini Kaligotla Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Graduate School of Arts and Sciences COLUMBIA UNIVERSITY 2015 © 2015 Subhashini Kaligotla All rights reserved ABSTRACT Shiva’s Waterfront Temples: Reimagining the Sacred Architecture of India’s Deccan Region Subhashini Kaligotla This dissertation examines Deccan India’s earliest surviving stone constructions, which were founded during the 6th through the 8th centuries and are known for their unparalleled formal eclecticism. Whereas past scholarship explains their heterogeneous formal character as an organic outcome of the Deccan’s “borderland” location between north India and south India, my study challenges the very conceptualization of the Deccan temple within a binary taxonomy that recognizes only northern and southern temple types. Rejecting the passivity implied by the borderland metaphor, I emphasize the role of human agents—particularly architects and makers—in establishing a dialectic between the north Indian and the south Indian architectural systems in the Deccan’s built worlds and built spaces. Secondly, by adopting the Deccan temple cluster as an analytical category in its own right, the present work contributes to the still developing field of landscape studies of the premodern Deccan. I read traditional art-historical evidence—the built environment, sculpture, and stone and copperplate inscriptions—alongside discursive treatments of landscape cultures and phenomenological and experiential perspectives. As a result, I am able to present hitherto unexamined aspects of the cluster’s spatial arrangement: the interrelationships between structures and the ways those relationships influence ritual and processional movements, as well as the symbolic, locative, and organizing role played by water bodies.
    [Show full text]
  • Jakanachari: an Artisan Or a Collective Genius?
    Jakanachari: An artisan or a collective genius? deccanherald.com/spectrum/jakanachari-an-artisan-or-a-collective-genius-862581.html July 18, 2020 1/10 2/10 3/10 4/10 5/10 6/10 Myths about legendary artisans who built spectacular temples abound in many parts of India. Those mysterious, faceless personalities who built the considerable architectural wealth of the Indian subcontinent have always fascinated us lucky inheritors of this treasure. In Kerala, we have the legendary Peruntacchan, to whom many a temple has been attributed, and everyone in Odisha is familiar with Dharmapada, the young son of the chief architect of the Sun Temple at Konark, who solved the puzzle of the crowning stone of the temple shikhara. And in Karnataka, we have the legendary Jakanachari, whose chisel is said to have given form to innumerable temples of this land. Most of these artisan-myths appear to have the same plot, with more than a whiff of tragedy. Peruntacchan, supposedly jealous of the rising talents of his own son, accidentally drops a chisel while the duo was working on the timber roof of a temple, killing the son. Dharmapada, who solves the riddle which allowed the completion of the tower of the Konark Temple, which had vexed twelve thousand artisans, sacrifices himself so that their honour is not diminished in the eyes of the King. There are other myths in other lands which echo this tragic course of events. In Tamil country, the unfinished state of the rock-cut temple called Vettuvan Kovil is believed to be the consequence of the architect of this structure striking his son dead, because the son had boasted upon finishing his own temple project ahead of the father.
    [Show full text]
  • Tamil Teaching and Learning Process
    Vol.1 No.3 July, 2013 ISSN : 2320 –2653 Tamil Teaching and Learning Process R.Nithya Research Scholar, Govt. Arts & Science College, Ooty, India Dr. Senavarayar Professor, Govt. Arts & Science College, Ooty, India Abstract ‘Tamil’ is one of the classical languages of India and the other one is Sanskrit Language. Tamil is one of the major languages of the Dravidian family of languages. Most of the persons who speak Tamil Language reside in the Indian state of Tamil Nadu, Tamil is also spoken in many parts of the world. Large concentration of Tamil speaking people is located in Sri Lanka, Malaysia, and Singapore, as also in other South Indian states of Andhra Pradesh, Karnataka, and Kerala and major cities of India. In fact, Tamil speaking people are found in all the time zones of the world. There are more than 74 million people who speak Tamil as their native language. Tamil is one of the few living classical languages and has an unbroken literary tradition of over two millennia. The written language has changed little during this period, with the result that classical literature is as much a part of everyday Tamil as modern literature. Tamil school-children, for example, are still taught the alphabet using the átticúdi, an alphabet rhyme written around the first century A.D. Teaching and learning are related terms. In teaching - learning process, the teacher, the learner, the curriculum and other variables are organized in a systematic way to attain some pre-determined goal. A language to be studied must be the mother tongue or the regional language.
    [Show full text]
  • Genetic Affinity of Muslim Population in South India Based on HLA-DQB1 and Relationship with Other Indian Populations
    97 International Journal of Modern Anthropology Int. J. Mod. Anthrop. 2019. Vol. 2, Issue 12, pp: 97- 113 DOI: http://dx.doi.org/10.4314/ijma.v2i12.4 Available online at: www.ata.org.tn & https://www.ajol.info/index.php/ijma Research Report Genetic affinity of Muslim population in South India based on HLA-DQB1 and relationship with other Indian Populations Koohyar Mohsenpour1* and Adimoolam Chandrasekar2 1 Department of Studies in Anthropology, University of Mysore, Manasa Gangothri, Mysore-570006, Karnataka, India. 2 Anthropological Survey of India, Southern Regional Center, Government of India, Mysore-570026, India. * Correspondant author: Koohyar Mohsenpour. E. mail: [email protected] (Received 15 January 2019; Accepted 25 February 2019; Published 2 April 2019) Abstract - The present study made an attempt to observe genetic affinity of the Muslim population in South India with other neighbor populations. In this regard, DQB1 loci of HLA class II gene as a common genetic marker in phylogenetic assessment has been examined in 45 unrelated healthy individuals using sequence-based typing. The result of this study indicates a close genetic similarity among Indian sub-populations, in spite of segregation with other Muslim populations in North India. Although results of present study indicates genetic relationship of selected populations, all HLA loci or at least all loci of each classes to be assessed in order to attain highly probability of estimates. Keywords: South Indian Muslims, Anthropology, HLA 98 International Journal of Modern Anthropology (2019) Introduction Human leukocyte antigen (HLA) genes are highly polymorphic genes in human which are considered widely as a useful autosomal genetic marker beside the traditional sex chromosomes markers like as mtDNA and Y chromosome for population relationship and phylogenetics.
    [Show full text]
  • An Analysis of Kannada Language Newspapers, Magazines and Journals: 2008-2017
    International Journal of Library and Information Studies Vol.8(2) Apr-Jun, 2018 ISSN: 2231-4911 An Analysis of Kannada Language Newspapers, Magazines and Journals: 2008-2017 Dr. K. Shanmukhappa Assistant Librarian Vijayanagara Sri Krishnadevaraya University Ballari, Karnataka e-mail: [email protected] Abstract – The paper presents the analysis of Kannada language newspapers, magazines and journals published during the 2008-2017 with help of Registrar of Newspapers for India (RNI) database is controlled by The Ministry of Information and Broadcasting. Major findings are highest 13.23% publications published in 2012, following 11.85, 11.12% in 2014 and 2011 year, 10.97% in 2015, 10.90% in 2009, 10.17% in 2016, 10.38% in 2013 year published. Among 2843 publications, majority 46.4% monthly publications, 20.30% of the publications fortnight, 15.30% weekly publications, 14.46% daily publications. Among 30 districts, majority 26.70% of the publications published in Bangalore and 14.03% in Bengaluru urban district, and least 0.28% published in Vijayapura (Bijapur) district. It is important to study the Kannada literature publications for identifying the major developments and growth of the Kannada literature publications of Annual, Bi-Monthly, Daily, Daily Evening, Fortnightly, Four Monthly, Half Yearly, Monthly, Quarterly and Weekly’s were analysed in the study. Keywords: Kannda language publications, Registrar of Newspapers for India (RNI), Journalism, Daily newspapers, Magazines, Journals, Bibliometric studies. Introduction Information is an important element in every sector of life, be it social, economic, political, educational, industrial and technical development. In the present world, information is a very valuable commodity.
    [Show full text]
  • Kodagu District, Karnataka
    GOVERNMENT OF INDIA MINISTRY OF WATER RESOURCES CENTRAL GROUND WATER BOARD GROUND WATER INFORMATION BOOKLET KODAGU DISTRICT, KARNATAKA SOMVARPET KODAGU VIRAJPET SOUTH WESTERN REGION BANGALORE AUGUST 2007 FOREWORD Ground water contributes to about eighty percent of the drinking water requirements in the rural areas, fifty percent of the urban water requirements and more than fifty percent of the irrigation requirements of the nation. Central Ground Water Board has decided to bring out district level ground water information booklets highlighting the ground water scenario, its resource potential, quality aspects, recharge – discharge relationship, etc., for all the districts of the country. As part of this, Central Ground Water Board, South Western Region, Bangalore, is preparing such booklets for all the 27 districts of Karnataka state, of which six of the districts fall under farmers’ distress category. The Kodagu district Ground Water Information Booklet has been prepared based on the information available and data collected from various state and central government organisations by several hydro-scientists of Central Ground Water Board with utmost care and dedication. This booklet has been prepared by Shri M.A.Farooqi, Assistant Hydrogeologist, under the guidance of Dr. K.Md. Najeeb, Superintending Hydrogeologist, Central Ground Water Board, South Western Region, Bangalore. I take this opportunity to congratulate them for the diligent and careful compilation and observation in the form of this booklet, which will certainly serve as a guiding document for further work and help the planners, administrators, hydrogeologists and engineers to plan the water resources management in a better way in the district. Sd/- (T.M.HUNSE) Regional Director KODAGU DISTRICT AT A GLANCE Sl.No.
    [Show full text]
  • Names of Foodstuffs in Indian Languages
    NAMES OF FOODSTUFFS IN INDIAN LANGUAGES CEREAL GRAINS AND PRODUCTS 1. Pearl Millet: Pennisetum typhoides Bajra (Bengali, Hindi, Oriya), Bajri (Gujarati, Marathi), Sajje (Kannada), Bajr’u (Kashmiri), Cambu (Malayalam, Tamil), Sazzalu (Telugu). Other names : Spiked millet, Pearl millet 2. Italian millet: Setaria italica Syama dhan (Bengali), Ral Kang (Gujarati), Kangni (Hindi), Thene (Kannada), Shol (Kashmiri), Thina (Malayalam), Rala (Marathi), Kaon (Punjabi), Thenai (Tamil), Korralu (Telugu), Other names: Foxtail millet , Moha millet, Kakan kora 3. Sorghum: Sorghum bicolor Juar (Bengali , Gujarati , Hindi), Jola (Kannada), Cholam (Malayalam , Tamil), Jwari (Marathi), Janha (Oriya), Jonnalu (Telugu), Other names: Milo , Chari 4. Maize: Zea mays Bhutta (Bengali), Makai (Gujarati), Maka (Hindi , Marathi , Oriya), Musikinu jola (Kannada), Makaa’y (Kashmiri), Cholam (Malayalam), Makka Cholam (Tamil), Mokka jonnalu (Telugu) 5. Finger Millet: Eleusine coracana Madua (Bengali , Hindi), Bhav (Gujarati), Ragi (Kannada) , Moothari (Malayalam), Nachni (Marathi), Mandia (Oriya), Kezhvaragu (Tamil), Ragulu (Telugu), Other names: Korakan 6. Rice, parboiled: Oryza sativa Siddha chowl (Bengali) Ukadello chokha (Gujarati), Usna chawal (Hindi), Kusubalakki (Kannada), Puzhungal ari (Malayalam), Ukadla tandool (Marathi), Usuna chaula (Oriya), Puzhungal arisi (Tamil), Uppudu biyyam (Telugu) 7. Rice raw: Orya sativa Chowl (Bengali), Chokha (Gujarati), Chawal (Hindi), Akki (Kannada), Tomul (Kashmiri), Ari (Malayalam), Tandool (Marathi), Chaula (Oriya), Arisi
    [Show full text]
  • Exploring Language Similarities with Dimensionality Reduction Techniques
    EXPLORING LANGUAGE SIMILARITIES WITH DIMENSIONALITY REDUCTION TECHNIQUES Sangarshanan Veeraraghavan Final Year Undergraduate VIT Vellore [email protected] Abstract In recent years several novel models were developed to process natural language, development of accurate language translation systems have helped us overcome geographical barriers and communicate ideas effectively. These models are developed mostly for a few languages that are widely used while other languages are ignored. Most of the languages that are spoken share lexical, syntactic and sematic similarity with several other languages and knowing this can help us leverage the existing model to build more specific and accurate models that can be used for other languages, so here I have explored the idea of representing several known popular languages in a lower dimension such that their similarities can be visualized using simple 2 dimensional plots. This can even help us understand newly discovered languages that may not share its vocabulary with any of the existing languages. 1. Introduction Language is a method of communication and ironically has long remained a communication barrier. Written representations of all languages look quite different but inherently share similarity between them. For example if we show a person with no knowledge of English alphabets a text in English and then in Spanish they might be oblivious to the similarity between them as they look like gibberish to them. There might also be several languages which may seem completely different to even experienced linguists but might share a subtle hidden similarity as they is a possibility that languages with no shared vocabulary might still have some similarity.
    [Show full text]
  • Malayalam Range: 0D00–0D7F
    Malayalam Range: 0D00–0D7F This file contains an excerpt from the character code tables and list of character names for The Unicode Standard, Version 14.0 This file may be changed at any time without notice to reflect errata or other updates to the Unicode Standard. See https://www.unicode.org/errata/ for an up-to-date list of errata. See https://www.unicode.org/charts/ for access to a complete list of the latest character code charts. See https://www.unicode.org/charts/PDF/Unicode-14.0/ for charts showing only the characters added in Unicode 14.0. See https://www.unicode.org/Public/14.0.0/charts/ for a complete archived file of character code charts for Unicode 14.0. Disclaimer These charts are provided as the online reference to the character contents of the Unicode Standard, Version 14.0 but do not provide all the information needed to fully support individual scripts using the Unicode Standard. For a complete understanding of the use of the characters contained in this file, please consult the appropriate sections of The Unicode Standard, Version 14.0, online at https://www.unicode.org/versions/Unicode14.0.0/, as well as Unicode Standard Annexes #9, #11, #14, #15, #24, #29, #31, #34, #38, #41, #42, #44, #45, and #50, the other Unicode Technical Reports and Standards, and the Unicode Character Database, which are available online. See https://www.unicode.org/ucd/ and https://www.unicode.org/reports/ A thorough understanding of the information contained in these additional sources is required for a successful implementation.
    [Show full text]
  • The Making of Modern Malayalam Prose and Fiction: Translations from European Languages Into Malayalam in the First Half of the Twentieth Century
    The Making of Modern Malayalam Prose and Fiction: Translations from European Languages into Malayalam in the First Half of the Twentieth Century K.M. Sherrif Abstract Translations from European languages have played a crucial role in the evolution of Malayalam prose and fiction in the first half of the Twentieth Century. Many of them are directly linked to the socio- political movements in Kerala which have been collectively designated ‘Kerala’s Renaissance.’ The nature of the translated texts reveal the operation of ideological and aesthetic filters in the interface between literatures, while the overwhelming presence of secondary translations indicate the hegemonic status of English as a receptor language. The translations never occupied a central position in the Malayalam literature and served mostly as mere literary and political stimulants. Keywords: Translation - evolution of genres, canon - political intervention The role of translation in the development of languages and literatures has been extensively discussed by translation scholars in the West during the last quarter of a century. The proliferation of diachronic translation studies that accompanied the revolutionary breakthroughs in translation theory in the mid-Eighties of the Twentieth Century resulted in the extensive mapping of the intervention of translation in the development of discourses and shifts of ideological paradigms in cultures, in the development of genres and the construction and disruption of the canon in literatures and in altering the idiomatic and structural paradigms of languages. One of the most detailed studies in the area was made by Andre Lefevere (1988, pp 75-114) Lefevere showed with convincing 118 Translation Today K.M.
    [Show full text]