Language Independent Named Entity Recognition
Total Page:16
File Type:pdf, Size:1020Kb
LANGUAGE INDEPENDENT NAMED ENTITY RECOGNITION Thesis submitted in partial fulfillment of the requirements for the degree of Master Of Science by Research in Computer Science by MAHATHI BHAGAVATULA 201007004 [email protected] SEARCH INFORMATION EXTRACTION AND RETRIEVAL LAB International Institute of Information Technology Hyderabad - 500 032, INDIA DECEMBER 2012 Copyright c Mahathi Bhagavatula, 2012 All Rights Reserved International Institute of Information Technology Hyderabad, India CERTIFICATE It is certified that the work contained in this thesis, titled “Language Independent Named Entity Recogni- tion” by Mahathi Bhagavatula, has been carried out under my supervision and is not submitted elsewhere for a degree. Date Adviser: Prof. Vasudeva Varma To my mother Anantha Lakshmi, father Kutumbarao and all my dear ones Acknowledgments First of all, I would like to thank my advisor Prof: Vasudeva Varma, for every thing he has done for me. Firstly, for the freedom he has given to me for pursuing my research and the kind of support he has given me at every stage where I was deviating from my research work. His regular suggestions have been a great value. It was pleasure and joy working with him.His constant guidance and motivation throughout the course was invaluable and it kept me going in research. Then I would take the oppurtunity to thank my parents B.Kutumba Rao and B. Anantha Lakshmi for their continous encouragement and support during the course. I thank them for the freedom they have given me throughout my research. I would like to thank even my brother Yashaswi and my sister Ra- mayendu for their encouragement throughout the course. I sincerely thank my lab mate Santosh GSK without whom it would have been difficult to get through my thesis so early. I would thank him for the moral support in dull days and for the knowledge he has shared with me throughout my research. I would like also thank my friends Ruchi, Deepthi, Swagathika, Vikram, Jatin, Nikhil and Sushma for all kinds of motivation and encouragement they have given me throughout my course. I would like to extent my gratitude to my other labmates Kiran, Sudheer, Srikanth and Aditya who guided me at various stages. v Abstract The role of Internet in personal, economic and political advancement is growing in a fast pace. By the turn of century, data on web reaches to petabytes or exabytes or may even scale up-to unimaginable quantities. Extraction of precise and structured information from such large amounts of unstructured or semi-structured data is the major concern of web known as Information Extraction. Named entity recognition (NER) (also known as entity identification and entity extraction) is one of the important subtask of information extraction that seeks to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations, monetary values, per- centages, expressions of times, etc. NER has many applications in NLP, for e.g., in data classification, question answering, cross language information access, machine translation system, query processing, etc. Recognizing Named Entities (NEs) in English has reached accuracies nearing to 98%. For English, many cues aid to know the structure of language (one such important cue in identifying NEs is capi- talization) which made the accuracies to be high. Whereas in Indian languages, there are no such cues available and moreover each Indian language differ from the other in grammatical structure. Hence, developing a language independent NER is a challenging task. Previous works includes developing an NER system using language dependent tools such as POS Tagger, dictionaries, Chunk Tagger, gazetteer lists, etc., or they have used linguistic experts to manu- ally tag the training and testing data or linguistic experts used to generate rules for recognizing NEs. Language Independent approaches include supervised machine learning techniques such as CRF, HMM, MEMM, SVM, etc. These techniques need High amounts of manually tagged data which is again a point of concern. Some of the other approaches include exploiting the external knowledge such as Wikipedia. But, in those methods the utilization of Wikipedia is not complete. Hence, the main objective of this work is to build a language independent NER system without any manual intervention and without any usage of language dependent tools. The approach specified throughout the work, includes language independent methods to identify, extract and recognize the NEs. Identification of NEs is done using an External Knowledge namely vi vii Wikipedia. More specifically, English Wikipedia is used as an aid to derive the NEs from Indian lan- guages. Wikipedia hierarchal structure is explored and the documents in it are divided into specific domains. Each domain is considered and the corresponding English and Indian language documents are clustered. English documents are tagged using the Stanford NER Tagger and the non-NEs are removed. Using the term co-occurrences between the tagged English and non-tagged Indian language words, the corresponding NEs between Indian language and English are mapped. Thus the tag of English NE is duplicated to the Indian language NE. Hence, the Indian language data is tagged. The tagged data generated in previous step, is used in recognition of NEs on sets of monolingual Indian language documents. In this step, a set of features are generated from the words of these docu- ments and these features are used for recognition of NEs in a new document. Consider each document; extract the tagged data from the document using the data from previous step. Now, from the remaining words of the document, a Naive Bayes Classifier is build which uses these words to generate a set of features for each class (features here are nothing but the important words of a particular class in that document). The importance of these features is calculated statistically by different metrics (the metrics for classification). Now given a new document, the presence of these features along with their scores is calculated. If the score exceed a threshold, implies the presence of NEs in the document. By decreasing the size of document the process is repeated again till we get the NE. Hence, the monolingual Indian language document is tagged. The approach specified in identifying and recognizing the NEs is language independent and can be extended to any language as none of the language dependent tools are used or there is no involvement of linguistic experts. Hindi, Marathi and Telugu were the languages in which the work has been done. PERSON, LOCATION and ORGANIZATION were the tag of NEs used throughout the identification and recognition process. Wikipedia is used as a dataset in identifying the NEs. Around 3,05,574 English documents, Hindi 100,000 documents, Marathi 83,000 documents, Telugu 85,000 documents are used to generate the results. The results are evaluated on manually tagged 2328, 1658, 2200 Hindi, Marathi and Telugu Wikipedia documents respectively. The F-Measure scores are 80.42 for Hindi, 81.25 for Marathi and 79.98 for Telugu. Dataset for recognition of NEs is a set of 33,435 documents of FIRE corpus for Hindi and 46,892 Telugu documents crawled from web. F-measure scores of Hindi and Telugu are 81.8 and 81.6, evalu- ated on 9,000 and 12,000 Hindi and Telugu manually tagged documents respectively. Baseline system used here are with F-Measure scores nearly 56.81 and 44.91 for Hindi and Telugu respectively. viii The above results are quite encouraging and they outperform the baseline systems. Moreover, the approach specified is language independent, unlike the baseline systems which depends on language resources at some time throughout their process. In-spite of being language independent the approach specified could able to reach the accuracies which makes the system successful. Contents Chapter Page 1 Introduction :::::::::::::::::::::::::::::::::::::::::: 1 1.1 Language Independent Named Entity Recognition . 2 1.2 Problem Definition . 4 1.2.1 Motivation . 4 1.2.2 Problem Statement . 4 1.2.3 Challenges . 5 1.2.3.1 Variation in NEs . 5 1.2.3.2 Spell variations in NEs . 5 1.2.3.3 Disambiguation in the forms of NE . 5 1.2.3.4 Ambiguity with common noun . 6 1.3 Overview of proposed solutions . 6 1.3.1 Named Entity Identification . 7 1.3.2 Named Entity Recognition . 8 1.4 Contributions . 8 1.5 Thesis Organization . 9 2 Related Work ::::::::::::::::::::::::::::::::::::::::: 11 2.1 Language-Dependent Approaches . 11 2.1.1 Rule-Based approaches . 11 2.1.2 Approaches making use of Dictionaries and gazetteer lists . 12 2.1.3 Advantages . 12 2.1.4 Disadvantages . 13 2.2 Semi-Language-Dependent Approaches . 13 2.2.1 Hidden Markov Models (HMMs) . 13 2.2.2 Maximum Entropy Markov Models (MEMMs) . 13 2.2.3 Conditional Random Fields (CRF) . 14 2.2.4 Support Vector Machine (SVM) . 14 2.2.5 Decision Tree (DT) . 15 2.2.6 Hybrid of above approaches . 15 2.2.7 Advantages . 15 2.2.8 Disadvantages . 15 2.3 Language-Independent Approaches . 16 2.3.1 Approaches using Wikipedia . 16 2.3.2 Advantages . 17 2.3.3 Disadvantages . 17 ix x CONTENTS 3 Named Entity Identification :::::::::::::::::::::::::::::::::: 18 3.1 Role of Wikipedia in Identification of Named Entities . 18 3.1.1 Limitations of Previous Approaches . 18 3.1.2 Enhancements of this Approach . 18 3.1.3 Structure of Wikipedia . 19 3.1.3.1 Category links . 19 3.1.3.2 Inter-Language links . 19 3.1.3.3 Subtitles of the document . 19 3.1.3.4 Abstract . 19 3.1.3.5 Infobox . 20 3.2 Overview of the Approach . 20 3.3 Clustering of Similar documents . 20 3.3.1 Hierarchical Clustering without using Category Information of Wikipedia .