Hindi Named Entities Recognition (Ner) Using Natural Language Processing and Machine Learning

International Journal of Advances in Electronics and Computer Science, ISSN(p): 2394-2835 Volume-6, Issue-9, Sep.-2019 http://iraj.in HINDI NAMED ENTITIES RECOGNITION (NER) USING NATURAL LANGUAGE PROCESSING AND MACHINE LEARNING 1RIA MEHTA, 2DWEEP PANDYA, 3PRATIK CHAUDHARI, 4DEVIKA VERMA, 5KRISHNANJAN BHATTACHARJEE, 6SHIVA KARTHIK S, 7SWATI MEHTA, 8AJAI KUMAR 1,2,3,4Vishwakarma Institute of Information Technology, Pune, India 5,6,7,8Centre for Development of Advanced Computing, Pune, India E-mail: 1ria.mehta [email protected], 2dweep.pandya [email protected], 3pratik.chaudhari [email protected], [email protected], [email protected], [email protected], [email protected], [email protected] Abstract - Named Entity Recognition (NER) is an important task in Natural Language Processing (NLP) that aims to auto identify and annotate Named Entities in the text, such as Person, Location, Organization etc. NER has been an essential component in various applications such as Information Extraction and Retrieval, Machine Translation, Question Answering (Q-A), Text Summarization etc. For NER in Hindi, while there have been a number of studies carried out, no high accuracy tool has yet been developed as per the Literature Survey. In this research, a methodology for Hindi Named Entities Recognition using NLP algorithms with RDF and Conditional Random Fields has been proposed. The results derived shows that the hybrid approach for NER achieves the recognition accuracy to 90.7% on Hindi texts. Keywords - Named Entity Recognition, Machine Learning, Natural Language Processing, CRF I. INTRODUCTION The accuracy of NER systems vary as per texts. The process of identifying Named Entities (NEs) Rule based systems have higher accuracy from a textual document and classifying them into but are not customizable different conceptual categories (Name, Place, Party, Limited tagset is considered in existing Designation) is an important step in the task of systems. Natural Language Processing (NLP). This process is Context of word is not taken into account called Named Entity Recognition (NER). In this age especially when it comes to ambiguous of World Wide Web, information is available in named entities. ( Location and Person Name abundance but in Indian Languages, there is very less being same) work done on NLP and Analytics per se. NER is Proper nouns are used as common nouns in especially a crucial proviso in applications of certain cases Information Retrieval, Machine Translation, Question No standard Gazetteer list available for Answering, Text Summarization, Efficient Search Names of people, organizations, cities, Engines etc. NER development for Indian languages states, companies, ruling parties etc. remain a challenge owning to syntactic and semantic Spelling variations complexities of Indian Languages and lack of Lack of high accuracy Part of Speech processing tools. Since Hindi is the official language Tagger of India, it has been selected for this project to In the current system, these problems are addressed develop NER tool with Hybrid NLP Ontology and by combining with indigenously built NLP algorithm rule based and Machine Learning based approach to using Linguistic rules and ontology depicted through gain better accuracy than available tools. Since Hindi RDF in tandem with the Machine Learning (ML) is part of Indo-Aryan family, the current approach model of Conditional Random Fields (CRF). RDF can be used for other Indo-Aryan languages such as has been used for creating a list of standard named Gujarati, Bengali, Marathi etc. as optimal NER tools entities compiled from various sources as cue to NLP are not present in the aforesaid languages as well. algorithm and ML. 10 tags for NER have been There are three main ways to perform NER. Using considered whereas most of the existing systems only Linguistic Rule-sets along with Ontology, Machine consider the 3 main tags- person, location, Learning or a Hybrid approach comprising both. organization. The named entities are also phrase Machine Learning has been the most successful in marked to better elicit information from the text. predicting unknown entities and Rule based systems give highest accuracy. Thus Hybrid NER systems are II. PREVIOUS WORK AND GAP ANALYSIS most efficient for Indian languages. Most common approaches to NER are Machine This system was developed to improve upon some of Learning (ML) algorithms and Hybrid of Rule-Based the issues in the current systems. Some of the issues and ML algorithms. realized are as follows: Stanford NER tagger is an existing available open- source NER tool. Stanford uses CRF as its classifier. Hindi Named Entities Recognition (NER) using Natural Language Processing and Machine Learning 59 International Journal of Advances in Electronics and Computer Science, ISSN(p): 2394-2835 Volume-6, Issue-9, Sep.-2019 http://iraj.in In the analysis and testing of this tool on 5400 Hindi They have identified 3 types of rules- Type 1: words. The precision and recall obtained were 0.45 Dictionary rule, Type 2: Bi-gram rule and Type 3: and 0.5 respectively. The F-score was 0.47. While the feature rule. From the result, it is observed that rules accuracy for English NER was 0.9223. Type 2 and Type 1 combined gives the highest Chopra, Joshi and Mathur [1] have used Hidden precision of 91.1% and recall of 67.89%. Markov Model (HMM) for NER in Hindi. As they have explained, HMM helps develop language- HMM has also been used for Gujarati language by Independent NER systems. It is also easy to scale and Vora, Vasant and Adhvaryu [6]. The paper mentions analyze these systems. An F-Measure of 97.14% was no details on the accuracy of their system. Named obtained on training data of 2343 tokens and testing Entity Recognition in Gujarati is however, rare and on 105 tokens. However, HMM has the label bias very less explored. HMM is easy to implement on problem. The size of training and testing corpus used any language thus making it desirable for Vora et al. in [1] is very limited. to use it for their NER in Gujarati language. As input Sharma and Goyal [2] considered a total of 29 they are taking printed versions of Gujarati Text. The features such as Context word, Word prefix, Word drawback they pointed out is that their system Suffix, POS information, Gazetteer Lists- list of wouldn’t recognise handwritten documents as writing person names, location names, organization names differs from person to person. etc. Overall the precision, recall and F-Score they get are 72.78%, 65.82% and 70.45% respectively. Furthermore, Das and Dhar [7] not only classify Sinha [3] focuses on disambiguating Ambiguous Named Entities like Organization, Person, Location Proper Names (APN) in Hindi using CRF. An APN is but entities have Before, Internal and End of a name which can also be used as a common noun. multiword tags as well that are represented by B-tag, In his paper, Sinha [3] also mentions the need for I-tag, E-tag. The aim is to improve the performance derivation of a relevant corpus. Subsequently, a of Question Answering, Auto Summarization, relevant corpus has been created tagging it into 3 Information Retrieval etc with a combination of POS categories namely: Names (NEP), Words in the Tagging and Entity Recognition. The Entity dictionary not used as names (NNE) and other words Recognition task was done using Conditional (OTH). A combined output of CRF and Rule-Based Random field (CRF). For training of CRF, they use where the output is essentially OR-ed gives an overall the ICON 2013 training datasets in open source using F-Score of 71.16%. the following web resource: CRF3 While surveying for NER on the Marathi Language, (https://crfsharp.codeplex.com/sourceControlilatest) we found that A. S. Patil, B. V. Pawar and N. V. Patil classifier which is written in C#. The entity set [4] address the problem of assigning correct named proposed is Name, Location, Organization, Symbol, entity class tag to each word using the Hidden Number, Date and Abbreviation. They have a Markov Model (HMM), a probabilistic one, trained maximum of 0.9156 and a minimum of 0.5714 on a manually tagged corpus for the language. precision values for symbol and abbreviation entity Proposed system in [4] reports an overall F1-Score of class respectively and maximum 0.8172 and 62.70% when no preprocessing was applied whereas minimum 0.4444 recall values for these two entity it reports an overall F1-Score of 77.79% when class on the 300 sentences randomly chosen from the preprocessing was applied on the same data. The ICON 2013 training dataset. system described [4] has recognized- the person, locations, numbers and measures well but other By combining the outputs of classifiers like named entities are not recognized satisfactorily. Maximum Entropy (ME), Conditional Random Field Reference [4] uses pre-processing techniques like (CRF) and Support Vector Machine (SVM) using a lemmatization to improve the efficiency which can be majority voting approach, Named Entity Recognition a bit costly. (NER) for the Bengali Language is done by Ekbal and Bandopadhyay [8]. They use four major NE tags: Association rule mining aims to find rules, given a set Person name, Location name, Organization name and of transactions. It helps discover relations between Miscellaneous name tags. The training set consists of variables of a large database. Jain Yadav and Tayal about 150K wordforms. Evaluation results of the [5] have a proposed a system for Hindi where for voted system for the gold standard test set of 30K every pair of itemsets A and B they find a set of rules wordforms have demonstrated the overall recall, where Support >= min (Support) and Confidence >= precision, and F-Score values of 87.11%, 83.61%, min (Confidence).

Hindi Named Entities Recognition (Ner) Using Natural Language Processing and Machine Learning

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support