International Journal of Advances in Electronics and Computer Science, ISSN(p): 2394-2835 Volume-6, Issue-9, Sep.-2019 http://iraj.in NAMED ENTITIES RECOGNITION (NER) USING NATURAL LANGUAGE PROCESSING AND MACHINE LEARNING

1RIA MEHTA, 2DWEEP PANDYA, 3PRATIK CHAUDHARI, 4DEVIKA VERMA, 5KRISHNANJAN BHATTACHARJEE, 6SHIVA KARTHIK S, 7SWATI MEHTA, 8AJAI KUMAR

1,2,3,4Vishwakarma Institute of Information Technology, Pune, 5,6,7,8Centre for Development of Advanced Computing, Pune, India E-mail: 1ria.mehta [email protected], 2dweep.pandya [email protected], 3pratik.chaudhari [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]

Abstract - Named Entity Recognition (NER) is an important task in Natural Language Processing (NLP) that aims to auto identify and annotate Named Entities in the text, such as Person, Location, Organization etc. NER has been an essential component in various applications such as Information Extraction and Retrieval, Machine Translation, Question Answering (Q-A), Text Summarization etc. For NER in Hindi, while there have been a number of studies carried out, no high accuracy tool has yet been developed as per the Literature Survey. In this research, a methodology for Hindi Named Entities Recognition using NLP algorithms with RDF and Conditional Random Fields has been proposed. The results derived shows that the hybrid approach for NER achieves the recognition accuracy to 90.7% on Hindi texts.

Keywords - Named Entity Recognition, Machine Learning, Natural Language Processing, CRF

I. INTRODUCTION  The accuracy of NER systems vary as per texts. The process of identifying Named Entities (NEs)  Rule based systems have higher accuracy from a textual document and classifying them into but are not customizable different conceptual categories (Name, Place, Party,  Limited tagset is considered in existing Designation) is an important step in the task of systems. Natural Language Processing (NLP). This process is  Context of word is not taken into account called Named Entity Recognition (NER). In this age especially when it comes to ambiguous of World Wide Web, information is available in named entities. ( Location and Person Name abundance but in Indian Languages, there is very less being same) work done on NLP and Analytics per se. NER is  Proper nouns are used as common nouns in especially a crucial proviso in applications of certain cases Information Retrieval, Machine Translation, Question  No standard Gazetteer list available for Answering, Text Summarization, Efficient Search Names of people, organizations, cities, Engines etc. NER development for Indian languages states, companies, ruling parties etc. remain a challenge owning to syntactic and semantic  Spelling variations complexities of Indian Languages and lack of  Lack of high accuracy Part of Speech processing tools. Since Hindi is the official language Tagger of India, it has been selected for this project to In the current system, these problems are addressed develop NER tool with Hybrid NLP Ontology and by combining with indigenously built NLP algorithm rule based and Machine Learning based approach to using Linguistic rules and ontology depicted through gain better accuracy than available tools. Since Hindi RDF in tandem with the Machine Learning (ML) is part of Indo-Aryan family, the current approach model of Conditional Random Fields (CRF). RDF can be used for other Indo-Aryan languages such as has been used for creating a list of standard named Gujarati, Bengali, Marathi etc. as optimal NER tools entities compiled from various sources as cue to NLP are not present in the aforesaid languages as well. algorithm and ML. 10 tags for NER have been There are three main ways to perform NER. Using considered whereas most of the existing systems only Linguistic Rule-sets along with Ontology, Machine consider the 3 main tags- person, location, Learning or a Hybrid approach comprising both. organization. The named entities are also phrase Machine Learning has been the most successful in marked to better elicit information from the text. predicting unknown entities and Rule based systems give highest accuracy. Thus Hybrid NER systems are II. PREVIOUS WORK AND GAP ANALYSIS most efficient for Indian languages.

Most common approaches to NER are Machine This system was developed to improve upon some of Learning (ML) algorithms and Hybrid of Rule-Based the issues in the current systems. Some of the issues and ML algorithms. realized are as follows: Stanford NER tagger is an existing available open- source NER tool. Stanford uses CRF as its classifier.

Hindi Named Entities Recognition (NER) using Natural Language Processing and Machine Learning

59 International Journal of Advances in Electronics and Computer Science, ISSN(p): 2394-2835 Volume-6, Issue-9, Sep.-2019 http://iraj.in In the analysis and testing of this tool on 5400 Hindi They have identified 3 types of rules- Type 1: words. The precision and recall obtained were 0.45 Dictionary rule, Type 2: Bi-gram rule and Type 3: and 0.5 respectively. The F-score was 0.47. While the feature rule. From the result, it is observed that rules accuracy for English NER was 0.9223. Type 2 and Type 1 combined gives the highest Chopra, Joshi and Mathur [1] have used Hidden precision of 91.1% and recall of 67.89%. Markov Model (HMM) for NER in Hindi. As they have explained, HMM helps develop language- HMM has also been used for Gujarati language by Independent NER systems. It is also easy to scale and Vora, Vasant and Adhvaryu [6]. The paper mentions analyze these systems. An F-Measure of 97.14% was no details on the accuracy of their system. Named obtained on training data of 2343 tokens and testing Entity Recognition in Gujarati is however, rare and on 105 tokens. However, HMM has the label bias very less explored. HMM is easy to implement on problem. The size of training and testing corpus used any language thus making it desirable for Vora et al. in [1] is very limited. to use it for their NER in Gujarati language. As input Sharma and Goyal [2] considered a total of 29 they are taking printed versions of Gujarati Text. The features such as Context word, Word prefix, Word drawback they pointed out is that their system Suffix, POS information, Gazetteer Lists- list of wouldn’t recognise handwritten documents as writing person names, location names, organization names differs from person to person. etc. Overall the precision, recall and F-Score they get are 72.78%, 65.82% and 70.45% respectively. Furthermore, Das and Dhar [7] not only classify Sinha [3] focuses on disambiguating Ambiguous Named Entities like Organization, Person, Location Proper Names (APN) in Hindi using CRF. An APN is but entities have Before, Internal and End of a name which can also be used as a common noun. multiword tags as well that are represented by B-tag, In his paper, Sinha [3] also mentions the need for I-tag, E-tag. The aim is to improve the performance derivation of a relevant corpus. Subsequently, a of Question Answering, Auto Summarization, relevant corpus has been created tagging it into 3 Information Retrieval etc with a combination of POS categories namely: Names (NEP), Words in the Tagging and Entity Recognition. The Entity dictionary not used as names (NNE) and other words Recognition task was done using Conditional (OTH). A combined output of CRF and Rule-Based Random field (CRF). For training of CRF, they use where the output is essentially OR-ed gives an overall the ICON 2013 training datasets in open source using F-Score of 71.16%. the following web resource: CRF3 While surveying for NER on the , (https://crfsharp.codeplex.com/sourceControlilatest) we found that A. S. Patil, B. V. Pawar and N. V. Patil classifier which is written in C#. The entity set [4] address the problem of assigning correct named proposed is Name, Location, Organization, Symbol, entity class tag to each word using the Hidden Number, Date and Abbreviation. They have a Markov Model (HMM), a probabilistic one, trained maximum of 0.9156 and a minimum of 0.5714 on a manually tagged corpus for the language. precision values for symbol and abbreviation entity Proposed system in [4] reports an overall F1-Score of class respectively and maximum 0.8172 and 62.70% when no preprocessing was applied whereas minimum 0.4444 recall values for these two entity it reports an overall F1-Score of 77.79% when class on the 300 sentences randomly chosen from the preprocessing was applied on the same data. The ICON 2013 training dataset. system described [4] has recognized- the person, locations, numbers and measures well but other By combining the outputs of classifiers like named entities are not recognized satisfactorily. Maximum Entropy (ME), Conditional Random Field Reference [4] uses pre-processing techniques like (CRF) and Support Vector Machine (SVM) using a lemmatization to improve the efficiency which can be majority voting approach, Named Entity Recognition a bit costly. (NER) for the is done by Ekbal and Bandopadhyay [8]. They use four major NE tags: Association rule mining aims to find rules, given a set Person name, Location name, Organization name and of transactions. It helps discover relations between Miscellaneous name tags. The training set consists of variables of a large database. Jain Yadav and Tayal about 150K wordforms. Evaluation results of the [5] have a proposed a system for Hindi where for voted system for the gold standard test set of 30K every pair of itemsets A and B they find a set of rules wordforms have demonstrated the overall recall, where Support >= min (Support) and Confidence >= precision, and F-Score values of 87.11%, 83.61%, min (Confidence). Support and Confidence are and 85.32%, respectively. calculated as follows: A hybrid approach has been used by Patawar and ℎ Support= (I) Potey [9] for NER in Marathi language tweets. ℎ Twitter being a widely used social media platform

ℎ naturally is been used in many different languages. In Confidence= (II) ℎ their system, Patawar et al. have used a hybrid of

Hindi Named Entities Recognition (NER) using Natural Language Processing and Machine Learning

60 International Journal of Advances in Electronics and Computer Science, ISSN(p): 2394-2835 Volume-6, Issue-9, Sep.-2019 http://iraj.in CRF and K-Nearest Neighbour. Initially, the Sr. No. Tag Meaning Examples normalized tweets are assigned a confidence value ‘cf ’ using a K - value of 4. Then CRF labeler is used to Named assign a label. CRF calculates the probability, 1 NEP Entity आिदनाथ, छाया however if ‘cf’ exceeds then base label is assigned Person directly. The assigned token is added to the clusters Named रुपया, यूरो, and system uses it for further training. Otherwise, 2 NEC Entity डॉलर CRF label is assigned to it. Only Location and Names Currency are identified by them with a precision of 39.80 and Named अध्यक्ष, पर्धान recall of 85.11 for location and a precision of 59.72 3 NED Entity मंतर्ी and recall of 25.28 for name tags. CRF makes it Designation easier to add prefixes and suffixes which is necessary Named राज्य सरकार, for NER in Indian languages. 4 NEO Entity गूगल Organization A Deep Neural based Name Entity Recognizer and Named आईबीएम, Classifier for English Language has been developed 5 NEA Entity सीआरएफ by S.P. Singh et al [10]. The system developed by Abbreviation them was tested on 500 sentences with the accuracy Named शर्ी, कु मारी, varying from 60-70%. The system uses Deep Neural 6 NETP Entity Title- महाशय Network for Parts-of-Speech (POS) tagging. POS is Person an essential clue in identifying the Named Entities in Named Indian languages. This approach has not been used 7 NEL Entity भारत, पुणे for Indian Languages yet. For prediction of Named Location Entities, first the sentences are chunked into Noun Named फरवरी, शाम 5 Phrases and Verb Phrases. They have applied 8 NETI Entity Time बजे handcrafted rules to determine the chunk type of a Named word (wi) in a Sentence, and then to refer to a 9 NEN Entity १९९७, २३ memory based classifier in order to check whether the Number rules satisfy or not. A database (DB2) is maintained Named which contains chunk type and their corresponding 10 NEM Entity ३ िदन, ५ िकलो category. The handcrafted rules used also check Measure whether the Entity in question has the first letter in Table 1 NER Tagset Uppercase or not. This advantage is void in Indian Languages. B. Named Entity Resource Creation Resource Description Framework (RDF): Thus, NER systems in English language are: RDF is used as the framework to store the Gazetteer  Have advantage of first letter of a Named lists of names of various entities such as names of Entity being capitalized persons, locations (states, districts, cities etc.),  Have an advantage rigid Subject-Verb- organizations (private, government etc.), days, Object sentence structure. airports, companies, hills, forests etc. RDF is a part of  Have accurate tools for Parts of Speech World Wide Web Consortium (W3C). It is useful in (POS) tagging. storing data in a hierarchical form where principles of Ontology can be incorporated for entity linking. For However, for Hindi language, no system has yet example, Country->State/Province->City-> proved to be both robust as well as accurate. The Taluka/Village as linked hierarchical entity best above mentioned advantages of NER systems for represented in RDF. The framework produces files English don’t exist for Hindi. Thus NER system in with extension OWL. OWL stands for Web Ontology Hindi needed a novel approach to achieve better Language. Here the Ontology represents hierarchies accuracy than existing ones, despite not supported by of real world entities as part of world-knowledge capitalization rule of English, accurate POS tagger linking. The elements are represented as URIs in (Stanford POS tagger being more than 97% accurate) these OWL files which makes the access faster as or Standard Classified Dataset for ML training or compared to a Relational Database. The URIs are labeling. used to name the relationships between two things called the triples. The triples are in the form Subject- III. SCOPE OF THE SYSTEM Predicate-Object. Finally, the linking structure forms a directed graph that is labeled. The edges represent A. Tagset for Named Entities the name of the relationship between two resources. The system uses the following 10 tags for Named Resources are represented as nodes of graph. Thus, Entities: the Named Entities get represented in linked form

Hindi Named Entities Recognition (NER) using Natural Language Processing and Machine Learning

61 International Journal of Advances in Electronics and Computer Science, ISSN(p): 2394-2835 Volume-6, Issue-9, Sep.-2019 http://iraj.in with well-defined connections. This not only helps (like newspaper articles, reports, social media, blogs the process of NER but has various usages in etc.). Information Retrieval. B. RDR POS Tagger C. CoNLL 2003 data format: The text input is then passed to the Open Source POS The training corpus used for our Machine Learning tagger which is a Ripple Down Rule-based Parts of algorithm (CRF) is in the CoNLL (Conference on Speech Tagger. This POS tagger was thoroughly Computational Natural Language Learning) 2003 trained by C-DAC provided crawled data of more shared task format. The training data consists of three than 2 lakhs corpora that increased its native accuracy columns separated by a tab. The first column of each (After training, approx. 80% vis-à-vis 65% native line of the training data is the word, the second accuracy) column its respective POS tag and the third column is the NER tag for the word. Sentences are delimited by C. Preprocessing an empty line in the training data. This phase deals with recognizing two or more than For example, two consecutive words that have a collective meaning (Multi Word Expressions-MWEs) and phrase- Word POS Tag NER Tag marking them as a single entity delimited by a '- '(hyphen). क ीय JJ 0

िवमंी NNP NED अण NNPC NEP  RDF-based Phrase marking: जेटली NNP NEP Phrase marking done using the data collected for the बुधवार NNP NETI system, comes under the scope of RDF-based phrase Table 2 marking. Training Data Format [0 in the third column represents a word that is not a  POS-based Phrase marking: Named Entity.] Entities that are not phrase marked through RDF classified data lookup have a good chance of being IV. PROPOSED SYSTEM phrase marked using sentence grammar i.e. POS tags. For example, two consecutive words that have the As mentioned in the Section I, the proposed system is POS tag of 'NNP' (Proper Noun) like नरदर् and मोदी a novel approach combining indigenously built NLP can be phrase marked as नरदर्-मोदी algorithms and Hindi Karaka Theory based rule-sets having supplemented and supported by world D. Hybrid System knowledge collected for Named Entities which is In this phase, actual named entities are recognized represented through RDF. CRF based approach is based on hybrid approach of RDF and CRF. simultaneously applied to achieve better accuracy in  World Knowledge (RDF based list lookup: this three fanged approaches culmination. The The POS tagged and the phrase marked input is then following diagram depicts the stages of the proposed passed for RDF classified data-lookup, where Named approach. Entities are matched and tagged directly in the available dataset. RDF is used for its fast retrieval of A. System Architecture data and since RDF has built-in support for data The flow of the system is as follows: relationships, it makes the perfect candidate for our system as a Database.  Natural Language Processing (NLP) based approach: NLP rules are used to predict the Named Entities that are not identified by RDF lookup. The rules were formed using Karaka Theory Principles where the Vibhaktis (Case) in Hindi are taken as semantic markers for preceding and succeeding words. Cases in Hindi are की/के /को/का/न े etc are considered for named entity category identification. They give clues

Figure 1: System Architecture if a particular entity is person or location etc. As mentioned in Section I, Indian names are also used as V. METHODOLOGY common nouns. The vibhaktis give clues in deciding whether the word is used as a name or a common A. Raw Text noun. The system takes as input raw unstructured Hindi-text For example, in the reporting style of English, in general domains o If word is followed by ne:

Hindi Named Entities Recognition (NER) using Natural Language Processing and Machine Learning

62 International Journal of Advances in Electronics and Computer Science, ISSN(p): 2394-2835 Volume-6, Issue-9, Sep.-2019 http://iraj.in It is likely a proper noun o Using the Predicate Logic and formal semantics, the representation for this rule will be: ∀x(n)(follows(ने,x(n)) →tag(x(n))=Proper noun)

Similarly, rule-sets are formed for choosing a tag, predicting a tag and verifying a tag. In the following table one representative rule of each type depicted using predicted logic:

Predicate SR.NO. RULE Representation Table 4 Tag Information in Training and Testing Files for CRF Rules foor choosing a tag: If word contains The training data for CRF has a total of around multiple tokens 500,000 tokens. The RDF data collected has a total of separated by ∀x 1 (0…n) around 15,000 Named Entities for lookup. hyphen: (tag(x(0…n))=tag(x(n))) tag(word)=tag of VI. OUTPUT AND SYSTEM INTERFACE last token Rules foor predicting tag: Finally, the tagged text obtained is as follows: ∀ x(n)(follows(की, िबहारा /NNP के /PSP अन्न्य/JJ िजल/NN -/SYM x(n)) ∨ follows(का, If words(n+1)== पूिणयाणय /NNP ,/SYM मुंगुं ेर/NN ,/SYM x(n)) ∨ follows(को, 2 की or के or को or का: x(n)) ∨ follows(के , भागलपुर/NNP म/PS P भी/RP इनकी/PRP Tag(word)=NEP x(n))) → आबादी/NN काफ़ी/INTF ह ै /VAUX tag(x(n))=NEP Rules for verifying the tag: The above auto tagging process of Named Entities is ∀ duly represented in the following screenshot of the If x(n)(propernoun(x(n- system’s User Interface. For analysis, the entity tag(word)==NEP: 1)) → recognised from the data in RDF is displayed in color 3 If (word(n- tag(x(n))=NEP) ∨ (¬ red. The ones predicted using the rules are in color 1)==propernoun): propernoun(x(n-1) → green and the ones by CRF in blue as follows: Tag(word)==NEP tag(x(n))=∅) Table 3 List of Rules

 Prediction oof remaining named entities with CRF: Named Entities that are not classified using the list- lookup approach or the rules are further classified using the Machine Learning algorithm CRF. Conditional random field (CRF) is a sequential statistical modeling method often applied in NER and used for structured prediction. CRFs fall into the sequence modeling family. A discrete classifier Figure 2: Named Entity Recognition Output predicts a label for a single entity without considering entities preceding and succeeding it but a CRF can With the output some statistics that provide better take context into account. They are often used for insight on the processed text are also displayed. labeling or parsing of sequential data, such as NLP or Such as count of entities recognized and the biological sequences and in computer vision. CRF Graphical representation for the same: considers lot of features while implementing the algorithm. POS tag of the entities is used as one of the features while prredicting their NER tag. The format of data is shown in Table II. For the training a large training corpus of around 500,000 words from the IJCNLP-08 workshop on NER [11] shared task that is converted into the CoNLL-2003 shared task format is used. Figure 3: Entity Count

Hindi Named Entities Recognition (NER) using Natural Language Processing and Machine Learning

63 International Journal of Advances in Electronics and Computer Science, ISSN(p): 2394-2835 Volume-6, Issue-9, Sep.-2019 http://iraj.in Mrs. Swati Mehta for all the invaluable guidance and support every step of the research project. We also want to thank our internal guide Mrs. Devika Verma for her constant guidance, evaluation and feedback for this project.

REFERENCES

[1] D. Chopra, N. Joshi and I. Mathur, "Named Entity Recognition in Hindi Using Hidden Markov Model," 2016 Second International Conference on Computational Intelligence & Communication Technology (CICT), Ghaziabad, 2016, pp. 581-586.. [2] Sharma R., Goyal V. (2011) Name Entity Recognition Systems for Hindi Using CRF Approach. In: Singh C., Singh Lehal G., Sengupta J., Sharma D.V., Goyal V. (eds) Figure 4: Graphical Representation of Named Entities Information Systems for Indian Languages. ICISIL 2011. Communications in Computer and Information Science, vol VII. RESULTS 139. Springer, Berlin, Heidelberg [3] R. M. K. Sinha, "Learning Recognition of Ambiguous Proper Testing was performed on 50,000 documents Names in Hindi," 2011 10th International Conference on Machine Learning and Applications and Workshops, obtained by crawling articles of different Hindi Honolulu, HI, 2011, pp. 178-182. newspapers acquired from Centre for Development of [4] N. V. Patil, A. S. Patil and B. V. Pawar, "HMM based Named Advanced Computing (C-DAC). The Precision and Entity Recognition for inflectional language," 2017 Recall values of the identified named entities are International Conference on Computer, Communications and Electronics (Comptelix), Jaipur, 2017, pp. 565-572. 90.3% and 91.4% respectively. The F-Measure is thus [5] Jain, D. Yadav and D. K. Tayal, "NER for Hindi language 90.8%. The accuracy of Named Entity identification using association rules," 2014 International Conference on process has been determined by manually calculating Data Mining and Intelligent Computing (ICDMIC), New system generated Named Entities from the Delhi, 2014, pp. 1-5. [6] K. Vora, A. Vasant and R. Adhvaryu, "Named entity representative sets of documents after grouping them recognition and classification for Gujarati language," 2016 in clusters where each document is of similar pattern. International Conference on Advances in Computing, Communications and Informatics (ICACCI), Jaipur, 2016, pp. 2269-2272. [7] S. K. Das and S. Dhar, "Entity Recognition in Bengali language," 2015 International Symposium on Advanced Computing and Communication (ISACC), Silchar, 2015, pp. 157-160. [8] A. Ekbal and S. Bandyopadhyay, "Bengali Named Entity Recognition Using Classifier Combination," 2009 Seventh International Conference on Advances in Pattern Recognition, Kolkata, 2009, pp. 259-262. [9] M. L. Patawar and M. A. Potey, "Extending hybrid Conditional Random Fields approach of Named Entity Recognition for Marathi tweets," 2016 International Conference on Computing Communication Control and automation (ICCUBEA), Pune, 2016, pp. 1-5. Figure 5: Depiction of F-Measure [10] S. P. Singh, A. Kumar and H. Darbari, "Deep neural based name entity recognizer and classifier for English language," ACKNOWLEDGEMENT 2017 International Conference on Circuits, Controls, and Communications (CCUBE), Bangalore, 2017, pp. 242-246. [11] Ltrc.iiit.ac.in. (2019). IJCNLP-08 Workshop on NER for We are indebted to the guides of Applied Artificial South and South East Asian Languages. [online] Available at: Intelligence Group (AAI), Centre for Development of http://ltrc.iiit.ac.in/ner-ssea-08/ [Accessed 14 May 2019] Advanced Computing (C-DAC), Pune – Dr. Krishnanjan Bhattacharjee, Mr. Shiva Karthik S and



Hindi Named Entities Recognition (NER) using Natural Language Processing and Machine Learning

64