Development of a Micro Telugu Opinion Wordnet and Aligning with TELOWN Ontology for Automatic Recognition of Opinion Words from Telugu Documents
Total Page:16
File Type:pdf, Size:1020Kb
INTERNATIONAL JOURNAL OF RESEARCH ISSN NO : 2236-6124 Development of a Micro Telugu Opinion WordNet and Aligning with TELOWN Ontology for Automatic Recognition of Opinion Words from Telugu Documents Benarji Tharini1, Dr.Vishnu Vardhan Bulusu2 1 Research Scholar Rayalaseema University, Kurnool, AP, India. [email protected] 2 Professor in Department of CSE, Manthany JNTUH, TS, India. [email protected] Abstract: The emergencies in Indian language based documents over the web are observed in the recent past with the advent of Unicode standard. The content of Indian language and its accessibility is observed to be minimal in the linguistic process evaluation. Unicode based language tools are created in order to prepare language specific repositories and the dictionaries online. Construction of wordnet, language constructs and thinking about a semantically rich lexical synsets is useful in the linguistic processing of the Indian context. Over the two decades the research world is trending towards construction of the semantic models. It is necessary to start a beginning to create a rich knowledge base in order to attain semantically rich linguistic models. With this phenomenon as an aim a micro opinion Telugu wordnet is created in order to map with Telugu Opinion WordNet Ontology (TELOWN) which consists of semantic knowledge on positive and negative Telugu opinion words. The objective of this process is to create opinion wordnet in Telugu along with their synsets for the automatic recognition of opinion words from Telugu documents. SPARQL is used as a query language for the retrieval at the backend. Keywords: semantic web, ontology, Telugu, WordNet, opinion words, SPARQL I. Introduction In a multilingual nation like India interpretation between Indian languages and also amongst English and Indian languages is a basic undertaking. Likewise basic is the errand of Cross-Lingual Search where the query is made in an Indian language and recovery of reports occurs in English or Telugu (vide Figure 1). Every one of these exercises relies upon lexical information of high caliber and scope. This lexical learning is as machine-discernable lexicons, ontologies (various leveled association of ideas) and wordnets (a huge chart like the structure of words). Volume 7, Issue VI, JUNE/2018. Page No:197 INTERNATIONAL JOURNAL OF RESEARCH ISSN NO : 2236-6124 Query in an Indian language IL Query in IL Query in IL Input Input Input Processing Processing Processing Search in Search in English Telugu Document Document Search in Processing of Processing of IL retrieval in retrieval in English Telugu Optional Optional output in output in Translation Translation English Telugu E-> IL T IL Output in IL Output in IL Figure 1: Cross Lingual Search The vast majority of the data on the World Wide Web is encoded as natural language content expected for people yet troublesome for machines to get it. With the Internet blast over the current years, expansive volumes of unstructured messages in different languages and structures are being included to the data stores an everyday schedule. With the approach of Unicode, this wonder is watched for writings in Indian languages like Telugu, Tamil and Bengali in the current years [1]. When all is said in done, these languages are poor as far as accessibility of entrenched corpus, natural language handling apparatuses, et cetera, and along these lines have turned into a vital zone of research in the Indian people group. Volume 7, Issue VI, JUNE/2018. Page No:198 INTERNATIONAL JOURNAL OF RESEARCH ISSN NO : 2236-6124 Throughout the most recent two decades, the world is seeing huge development in Web substance of Indian languages. This influenced individuals to feel good with their local language. Particularly, throughout the previous couple of years, there has been a colossal increment in the Telugu content on the web. Telugu is the fifth biggest talked language and has 250 million speakers over the world, the dominant part of who are from India [2]. So as to process the substance in local language towards important data recovery, the language- particular WordNet is required. WordNet [3] has developed as an awesome asset for the Natural Language Processing applications for English reports. Following English WordNet, WordNets are worked for some languages of the world. Indo WordNet [4] is the main WordNet worked for an Indian language. Wordnets are lexical structures made out of synsets and semantic relations. Synsets are sets of equivalent words. They are connected by semantic relations as is hypernymy (a), meronymy (some portion of) and so on. Wordnets have developed as critical assets for Natural Language Processing (NLP). The principal word net on the planet was worked for English at Princeton University1. At that point took after word nets for European Languages: Eurowordnet2. Since 2000, wordnets for various Indian languages are getting assembled, driven by the Indo wordnet3 exertion at Indian Institute of Technology Bombay4 (IITB). Opinionated substance in Telugu is critical to be dissected for the utilization of enterprises and government(s). Programmed thinking about such natural language records by the machine requires the help of Telugu WordNet. At the point when such a lexical asset is coordinated with the ideas of ontology, the programmed acknowledgment of opinion words from Telugu records happens effectively. II. Related Work A good amount of research has happened on determining orientations of the opinion words in Telugu language. The development of lexical resources for both traditional information retrieval and Opinion Mining tasks is the first step in this research. IndoWordNet is a linked lexical knowledge base of word nets of 18 scheduled languages of India, namely Assamese, Bangla, Bodo, Gujarati, TELUGU, Kannada, Kashmiri, Konkani, Malayalam, Meitei (Manipuri), Marathi, Nepali, Odia, Punjabi, Sanskrit, Tamil, Telugu and Urdu. Such project indeed took off in 2000 with TELUGU WordNet being created by the Natural Language Processing group at the Center for Indian Language Technology (CFILT) in the Computer Science and Engineering Department at IIT Bombay. [5] It was made publicly available in 2006 under GNU license. The TELUGU WordNet was created with support from the TDIL project of Ministry of Communication and Information Technology, India and also partially from Ministry of Human Resources Development, India. Volume 7, Issue VI, JUNE/2018. Page No:199 INTERNATIONAL JOURNAL OF RESEARCH ISSN NO : 2236-6124 The word nets follow the principles of minimality, coverage and replace ability for the synsets. That means, there should be at least a 'core' set of lexemes in the synsets that uniquely give the concept represented by the synsets (minimality), e.g., {house, family} standing for the concept of 'family' ("she is from a noble house"). Then the synsets should cover ALL the words representing the concept in the language (coverage), e.g., the word 'ménage' will have to appear in the 'family' synsets, albeit, towards the end of the synsets, since its usage is rare. Finally, the words towards the beginning of the synsets should be able to replace one another in reasonable amount of corpora (replace ability), e.g., 'house' and 'family' can replace each other in the sentence "she is from a noble house". IndoWordNet is highly similar to EuroWordNet. However, the pivot language is TELUGU which, of course, is linked to the English WordNet. Also typical Indian language phenomena like complex predicates and causative verbs are captured in IndoWordNet. IndoWordNet is publicly brows able. The Indian language word net building efforts forming the subcomponents of IndoWordNet project are: North East WordNet project, Dravidian WordNet Project and Indradhanush project all of which are funded by the TDIL project. Word nets of other languages of India then followed suit. The large nationwide project of building Indian language word nets was called the IndoWordNet project. IndoWordNet[1] is a linked lexical knowledge base of word nets of 18 scheduled languages of India, viz., Assamese, Bangla, Bodo, Gujarati, TELUGU, Kannada, Kashmiri, Konkani, Malayalam, Meitei, Marathi, Nepali, Oriya, Punjabi, Sanskrit, Tamil, Telugu and Urdu. The word nets are getting created by using expansion approach from the TELUGU WordNet. The TELUGU WordNet was created from first principles (mentioned below) and was the first wordnet for an Indian language. The method adopted was same as the Princeton WordNet for English. Polish WordNet is being mapped to Princeton WordNet based on the strategy followed by IndoWordNet.[6] For example Fig : Telugu word response on Indo wordnet Volume 7, Issue VI, JUNE/2018. Page No:200 INTERNATIONAL JOURNAL OF RESEARCH ISSN NO : 2236-6124 Amitava Das and Bandopadhya created [6] SentiWordNet for the Bengali language. 35,805 Bengali passages were accounted for from their trial. Joshi et al. created [7] one of Indian language Telugu SentiWordNet (T-SWN) by utilizing English SentiWordNet and English- Telugu WordNet mappings. Bakliwal et al. made [8] Telugu subjective vocabulary for Telugu content extremity grouping. They built up this dictionary with Telugu descriptive words and qualifiers and their extremity scores. The examination chip away at connecting of WordNet with ontology is propelled by the inspiration of computerized thinking about natural language assets. The advantages of connecting WordNet with ontology are multi-overlay [10]. These are: (I) The formal details of the ontology are conceivable to