Named Entity Recognition for a Resource Poor Indo-Aryan Language
Total Page:16
File Type:pdf, Size:1020Kb
Named Entity Recognition for a Resource Poor Indo-Aryan Language A thesis submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy by Padmaja Sharma Enrollment No. CSP-09006 Registration No. 059 of 2010 Department of Computer Science and Engineering School of Engineering, Tezpur University Tezpur, Assam, India - 784028 September, 2015 Dedicated to my P A R E N T S ii Abstract Named Entity Recognition which is a subfield of information extraction is one of the most important topics of Natural Language Processing. It is a process through which the machines understand the proper nouns in text and associates them with proper tags. NER has made significant progress in European languages, but in Indian languages due to the lack of proper resources, it is a challenging task. As natural language is a polysemous, ambiguity exists among the name references. Recognizing the ambiguity and assigning a proper tag to the names is the goal of NER. Thus NER is a two stage process i.e., identification of the proper nouns and the classification of these nouns into di↵erent classes such as person, location, organization and miscellaneous which includes date, time, year, etc. The main aim of our work is to develop a computational system that can perform NER in text in the Assamese language which is a resource poor Indo-Aryan language. Our thesis discusses the di↵erent issues related to NER in general and in Indian languages, along with the di↵erent approaches to NER. We discuss the di↵erent works carried out by di↵erent researchers in di↵erent Indian languages along with the datasets and the tagsets used by them. We focus on rule-based approach first which involves the identification of the root word from an inflected form, which is known as stemming, and further we derive some hand coded rules for di↵erent classes of NEs. We also discuss the tagging of NEs using gazetteer list. Then we experiment with two machine learning approaches, namely CRF and HMM, and find that in iii our language the system performs reasonably well. Lastly we implement hybrid approach which involves both rule-based approach and machine learning approach and find that compared to single approach, the combined approach improves the performance of the system. Finally we conclude our work and also suggest some future work in this line. iv Declaration I, Padmaja Sharma, hereby declare that the thesis entitled Named Entity Recognition for a Resource Poor Indo-Aryan Language submitted to the Department of Computer Science and Engineering under the School of Engineering, Tezpur University, in partial fulfillment of the requirements for the award of the degree of Doctor of Philosophy, is based on bonafide work carried out by me. The results embodied in this thesis have not been submitted in part or in full, to any other university or institute for award of any degree or diploma (Padmaja Sharma) v Department of Computer Science & Engineering Tezpur University Napaam, Tezpur- 784028, Assam, India. CERTIFICATE This is to certify that the thesis entitled “Named Entity Recognition for a Resource Poor Indo-Aryan Language” submitted to the School of Engineering, Tezpur University in partial fulfillment for the award of the degree of Doctor of Philosophy in the Department of Computer Science and Engineering is a record of research work carried out by Ms. Padmaja Sharma under my supervision and guidance. All help received by her from various sources have been duly acknowledged. No part of this thesis has been submitted elsewhere for award of any other degree. (Utpal Sharma) Supervisor Department of Computer Science and Engineering Tezpur University Assam, India-784028 CERTIFICATE This is to certify that the thesis entitled “Named Entity Recognition for a Resource Poor Indo-Aryan Language” submitted to the School of Engineering, Tezpur University in partial fulfillment for the award of the degree of Doctor of Philosophy in the Department of Computer Science and Engineering is a record of research work carried out by Ms. Padmaja Sharma under my supervision and guidance. All help received by her from various sources have been duly acknowledged. No part of this thesis has been submitted elsewhere for award of any other degree. (Jugal K. Kalita) Co-Supervisor Dept of Computer Science Univrsity of Colorado at Colorado Springs Colorado Springs, Colorado. CERTIFICATE This is to certify that the thesis entitled “Named Entity Recognition for a Resource Poor Indo-Aryan Language” submitted by Ms. Padmaja Sharma to Tezpur University in the Department of Computer Science and Engineering under the School of Engineering in partial fulfillment for the award of the degree of Doctor of Philosophy in Computer Science and Engineering has been examined by us on ........................................... and found to be satisfactory. The committee recommends for award of degree of Doctor of Philosophy. Signature of Principal Supervisor Signature of External Examiner Signature of Doctoral Committee Heartfelt thanks ... I wish to record my deep sense of gratitude to my supervisor Prof. Utpal Sharma ··· from Tezpur University and cosupervisor Prof. Jugal K. Kalita from University of Colorado, Colorado Springs for their inspiring guidance and constant supervision at each and every stage during the entire period of the present work. I wish to thank the Head of the Dept, Computer Science and Engineering, Tezpur ··· University, Tezpur for providing me the necessary laboratory facilities to carry out the work. I shall ever remain grateful to the members of the doctoral committee of my ··· research: Prof Utpal Sharma, Prof Shyamanta M. Hazarika and Dr.Sarat Saharia and departmental research committee and other faculty members for their insightful comments and valuable feedback. I extend my sincere thanks to sta↵ members of the Department: Ajay da, Arun da, Raghavendra, Gautam da, Pranati Baidew, Ginu Baidew, Gulap Da for their help and suggestions during my work. I thank all my coresearch workers: Adity, Navanath, Dushyanta for their help and ··· co-operation during the entire period of my work. Lastly, my thanks go to my parents, husband and my family without whose help ··· and inspiration, it would not have been possible to complete my work. - Padmaja Sharma ix Contents Abstract iii Declaration v Acknowledgment ix 1 Introduction 1 1.1 Introduction . 1 1.1.1 Definition of Named Entity Recognition . 2 1.1.2 Application of Named Entity Recognition . 3 1.2 Problems in Named Entity Recognition . 4 1.3 Approaches to Named Entity Recognition . 9 1.4 Conclusion . 19 2 NER Research in Indian Languages 21 x 2.1 Introduction . 21 2.2 NER work in Indian languages . 22 2.2.1 NER in Bengali . 23 2.2.2 NER In Hindi . 28 2.2.3 NER in Telugu . 31 2.2.4 NER in Tamil . 32 2.2.5 NERInOriya........................... 33 2.2.6 NER in Urdu . 34 2.2.7 NER in Punjabi . 35 2.2.8 NER in Kannada . 36 2.2.9 International Work done on NER . 37 2.2.10 Work done on NER using Gazetteer . 39 2.3 Tagsets................................... 42 2.3.1 The IJCNLP-08 Shared Task Tagset . 42 2.3.2 The Named Entity Tagset by Ekbal and Bandopadhyay . 42 2.3.3 CoNLL 2003 Shared Tagset . 43 2.4 Evaluation Metrics used in NER . 43 2.5 Cross-validation . 45 xi 2.6 Conclusion . 46 3 Rule-based and Gazetteer-based NER for Assamese 49 3.1 Introduction . 49 3.2 Rule-based Approach . 50 3.2.1 Di↵erent Approaches to Stemming . 50 3.2.2 Previous Work on Stemming . 53 3.2.3 Assamese Language and its Morphology . 54 3.2.4 AssameseCorpora ........................ 57 3.2.5 Our Approach . 58 3.2.6 Results and Discussions . 61 3.3 Gazetteer-based Approach . 68 3.3.1 Results and Discussions . 72 3.4 Conclusion . 73 4 Machine Learning based NER for Assamese 74 4.1 Introduction . 74 4.2 Features used in Named Entity Recognition . 75 4.3 CRF Approach . 78 xii 4.3.1 Experiment . 79 4.3.2 Results and Discussion . 83 4.4 The HMM Approach . 86 4.4.1 Our Experiment . 88 4.4.2 Results and Discussion . 91 4.5 A Hybrid approach for NER . 97 4.5.1 Previous work On NER using Hybrid Approach . 102 4.5.2 Results Discussions . 107 4.6 Conclusion . 108 5 Conclusion and Future Directions 109 Publications 112 A Gazetteer List 115 xiii List of Tables 2.1 NER approaches in Indian languages . 47 2.2 NER approaches in Indian languages . 48 3.1 Example of suffixes found in locations and organizations . 56 3.2 Examples of location named entities . 58 3.3 Suffix stripping experiment for location named entity . 61 3.4 Examples of issues in preparing gazetteer list . 69 3.5 Dataforgazetteerlist .......................... 70 3.6 Results obtained using gazetteer list . 71 4.1 Example of training file for CRF . 81 4.2 Format of the Serialized file . 83 4.3 NER results for Set 1 using CRF . 84 4.4 NER results for Set 2 using CRF . 84 xiv 4.5 NER results for Set 3 using CRF . 85 4.6 Average CRF Results . 85 4.7 NER results for Set 1 using HMM . 91 4.8 NER results for Set 2 using HMM . 92 4.9 NER results for Set 3 using HMM . 92 4.10 Average HMM Results . 92 4.11 NER results for Set 1 using HMM and Smoothing . 95 4.12 NER results for Set 2 using HMM and Smoothing . 95 4.13 NER results for Set 3 using HMM and Smoothing . 95 4.14 Average HMM Results after applying Smoothing . 96 4.15 NERresultsforBodoDataset . 97 4.16 NERresultsforBPMDataset .