LANGUAGE INDEPENDENT NAMED ENTITY RECOGNITION

Thesis submitted in partial fulfillment of the requirements for the degree of

Master Of Science by Research in Computer Science

by

MAHATHI BHAGAVATULA 201007004 [email protected]

SEARCH INFORMATION EXTRACTION AND RETRIEVAL LAB International Institute of Information Technology Hyderabad - 500 032, INDIA DECEMBER 2012 Copyright c Mahathi Bhagavatula, 2012 All Rights Reserved International Institute of Information Technology Hyderabad, India

CERTIFICATE

It is certified that the work contained in this thesis, titled “Language Independent Named Entity Recogni- tion” by Mahathi Bhagavatula, has been carried out under my supervision and is not submitted elsewhere for a degree.

Date Adviser: Prof. Vasudeva Varma To my mother Anantha Lakshmi, father Kutumbarao and all my dear ones Acknowledgments

First of all, I would like to thank my advisor Prof: Vasudeva Varma, for every thing he has done for me. Firstly, for the freedom he has given to me for pursuing my research and the kind of support he has given me at every stage where I was deviating from my research work. His regular suggestions have been a great value. It was pleasure and joy working with him.His constant guidance and motivation throughout the course was invaluable and it kept me going in research.

Then I would take the oppurtunity to thank my parents B.Kutumba Rao and B. Anantha Lakshmi for their continous encouragement and support during the course. I thank them for the freedom they have given me throughout my research. I would like to thank even my brother Yashaswi and my sister Ra- mayendu for their encouragement throughout the course.

I sincerely thank my lab mate Santosh GSK without whom it would have been difficult to get through my thesis so early. I would thank him for the moral support in dull days and for the knowledge he has shared with me throughout my research. I would like also thank my friends Ruchi, Deepthi, Swagathika, Vikram, Jatin, Nikhil and Sushma for all kinds of motivation and encouragement they have given me throughout my course. I would like to extent my gratitude to my other labmates Kiran, Sudheer, Srikanth and Aditya who guided me at various stages.

v Abstract

The role of Internet in personal, economic and political advancement is growing in a fast pace. By the turn of century, data on web reaches to petabytes or exabytes or may even scale up-to unimaginable quantities. Extraction of precise and structured information from such large amounts of unstructured or semi-structured data is the major concern of web known as Information Extraction.

Named entity recognition (NER) (also known as entity identification and entity extraction) is one of the important subtask of information extraction that seeks to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations, monetary values, per- centages, expressions of times, etc. NER has many applications in NLP, for e.g., in data classification, question answering, cross language information access, machine translation system, query processing, etc.

Recognizing Named Entities (NEs) in English has reached accuracies nearing to 98%. For English, many cues aid to know the structure of language (one such important cue in identifying NEs is capi- talization) which made the accuracies to be high. Whereas in Indian languages, there are no such cues available and moreover each Indian language differ from the other in grammatical structure. Hence, developing a language independent NER is a challenging task.

Previous works includes developing an NER system using language dependent tools such as POS Tagger, dictionaries, Chunk Tagger, gazetteer lists, etc., or they have used linguistic experts to manu- ally tag the training and testing data or linguistic experts used to generate rules for recognizing NEs. Language Independent approaches include supervised machine learning techniques such as CRF, HMM, MEMM, SVM, etc. These techniques need High amounts of manually tagged data which is again a point of concern. Some of the other approaches include exploiting the external knowledge such as . But, in those methods the utilization of Wikipedia is not complete. Hence, the main objective of this work is to build a language independent NER system without any manual intervention and without any usage of language dependent tools.

The approach specified throughout the work, includes language independent methods to identify, extract and recognize the NEs. Identification of NEs is done using an External Knowledge namely

vi vii

Wikipedia. More specifically, is used as an aid to derive the NEs from Indian lan- guages. Wikipedia hierarchal structure is explored and the documents in it are divided into specific domains. Each domain is considered and the corresponding English and Indian language documents are clustered. English documents are tagged using the Stanford NER Tagger and the non-NEs are removed. Using the term co-occurrences between the tagged English and non-tagged Indian language words, the corresponding NEs between Indian language and English are mapped. Thus the tag of English NE is duplicated to the Indian language NE. Hence, the Indian language data is tagged.

The tagged data generated in previous step, is used in recognition of NEs on sets of monolingual Indian language documents. In this step, a set of features are generated from the words of these docu- ments and these features are used for recognition of NEs in a new document. Consider each document; extract the tagged data from the document using the data from previous step. Now, from the remaining words of the document, a Naive Bayes Classifier is build which uses these words to generate a set of features for each class (features here are nothing but the important words of a particular class in that document). The importance of these features is calculated statistically by different metrics (the metrics for classification). Now given a new document, the presence of these features along with their scores is calculated. If the score exceed a threshold, implies the presence of NEs in the document. By decreasing the size of document the process is repeated again till we get the NE. Hence, the monolingual Indian language document is tagged.

The approach specified in identifying and recognizing the NEs is language independent and can be extended to any language as none of the language dependent tools are used or there is no involvement of linguistic experts. Hindi, Marathi and Telugu were the languages in which the work has been done. PERSON, LOCATION and ORGANIZATION were the tag of NEs used throughout the identification and recognition process.

Wikipedia is used as a dataset in identifying the NEs. Around 3,05,574 English documents, Hindi 100,000 documents, Marathi 83,000 documents, Telugu 85,000 documents are used to generate the results. The results are evaluated on manually tagged 2328, 1658, 2200 Hindi, Marathi and Telugu Wikipedia documents respectively. The F-Measure scores are 80.42 for Hindi, 81.25 for Marathi and 79.98 for Telugu.

Dataset for recognition of NEs is a set of 33,435 documents of FIRE corpus for Hindi and 46,892 Telugu documents crawled from web. F-measure scores of Hindi and Telugu are 81.8 and 81.6, evalu- ated on 9,000 and 12,000 Hindi and Telugu manually tagged documents respectively. Baseline system used here are with F-Measure scores nearly 56.81 and 44.91 for Hindi and Telugu respectively. viii

The above results are quite encouraging and they outperform the baseline systems. Moreover, the approach specified is language independent, unlike the baseline systems which depends on language resources at some time throughout their process. In-spite of being language independent the approach specified could able to reach the accuracies which makes the system successful. Contents

Chapter Page

1 Introduction ...... 1 1.1 Language Independent Named Entity Recognition ...... 2 1.2 Problem Definition ...... 4 1.2.1 Motivation ...... 4 1.2.2 Problem Statement ...... 4 1.2.3 Challenges ...... 5 1.2.3.1 Variation in NEs ...... 5 1.2.3.2 Spell variations in NEs ...... 5 1.2.3.3 Disambiguation in the forms of NE ...... 5 1.2.3.4 Ambiguity with common noun ...... 6 1.3 Overview of proposed solutions ...... 6 1.3.1 Named Entity Identification ...... 7 1.3.2 Named Entity Recognition ...... 8 1.4 Contributions ...... 8 1.5 Thesis Organization ...... 9

2 Related Work ...... 11 2.1 Language-Dependent Approaches ...... 11 2.1.1 Rule-Based approaches ...... 11 2.1.2 Approaches making use of Dictionaries and gazetteer lists ...... 12 2.1.3 Advantages ...... 12 2.1.4 Disadvantages ...... 13 2.2 Semi-Language-Dependent Approaches ...... 13 2.2.1 Hidden Markov Models (HMMs) ...... 13 2.2.2 Maximum Entropy Markov Models (MEMMs) ...... 13 2.2.3 Conditional Random Fields (CRF) ...... 14 2.2.4 Support Vector Machine (SVM) ...... 14 2.2.5 Decision Tree (DT) ...... 15 2.2.6 Hybrid of above approaches ...... 15 2.2.7 Advantages ...... 15 2.2.8 Disadvantages ...... 15 2.3 Language-Independent Approaches ...... 16 2.3.1 Approaches using Wikipedia ...... 16 2.3.2 Advantages ...... 17 2.3.3 Disadvantages ...... 17

ix x CONTENTS

3 Named Entity Identification ...... 18 3.1 Role of Wikipedia in Identification of Named Entities ...... 18 3.1.1 Limitations of Previous Approaches ...... 18 3.1.2 Enhancements of this Approach ...... 18 3.1.3 Structure of Wikipedia ...... 19 3.1.3.1 Category links ...... 19 3.1.3.2 Inter-Language links ...... 19 3.1.3.3 Subtitles of the document ...... 19 3.1.3.4 Abstract ...... 19 3.1.3.5 Infobox ...... 20 3.2 Overview of the Approach ...... 20 3.3 Clustering of Similar documents ...... 20 3.3.1 Hierarchical Clustering without using Category Information of Wikipedia . . . 20 3.3.2 Clustering by considering the Category Information of Wikipedia ...... 22 3.4 Identification of NEs from Infobox ...... 22 3.4.1 Map corresponding Keys across Languages ...... 23 3.4.2 Tagging non-English Data with NE tags ...... 24 3.5 Identification of NEs from Subtitle ...... 24 3.5.1 Mapping the Subtitles of Hindi with English subtitles ...... 25 3.5.2 Clustering of Similar Subtitles ...... 27 3.5.3 Term Co-occurrences ...... 27 3.6 Identification of NEs from Abstract ...... 28 3.7 Evaluation ...... 29 3.7.1 Dataset and Test Set ...... 29 3.7.2 Baseline System ...... 30 3.7.3 Evaluation Metrics ...... 30 3.8 Experiments and Results ...... 30 3.8.1 Experiment 1: Exploitation of structure of the Wikipedia page ...... 30 3.8.2 Experiment 2: Similarity metrics for Clustering ...... 32 3.8.3 Experiment 3: Variations in Lambda Scores ...... 33 3.8.4 Experiment 4: Varying of beta values ...... 34 3.9 Discussions ...... 35 3.10 Conclusions and Future Work ...... 35

4 Named Entity Recognition ...... 37 4.1 Named Entity Recognition Vs Named Entity Identification ...... 37 4.2 Building of Statistical Model ...... 38 4.2.1 Naive Bayes Classification ...... 38 4.3 Feature Generation and Selection ...... 39 4.3.1 Mutual Information ...... 41 4.3.2 χ2 Feature Selection ...... 42 4.3.3 Frequency Based Feature Selection ...... 43 4.3.4 Point-wise Mutual Information ...... 44 4.3.5 Why only these Features? ...... 45 4.4 Recognition and Tagging of NEs ...... 46 4.5 Challenges and Enhancements ...... 48 CONTENTS xi

4.5.1 Grouping of Similar NE’s to overcome Variations in NE’s ...... 48 4.5.2 Edit Distances to overcome Variations in Spellings ...... 49 4.5.3 Ambiguity in Tagging and Identifying NE’s ...... 50 4.6 Evaluation ...... 51 4.6.1 Dataset and Test set ...... 51 4.6.2 Metrics ...... 51 4.6.3 Baseline ...... 51 4.7 Experiments and Results ...... 52 4.7.1 Experiment 1: Variation of α values ...... 52 4.7.2 Experiment 2: Threshold for Feature Selection ...... 52 4.7.3 Experiment 3: Experiment on stabilizing the size window of words ...... 54 4.7.4 Experiment 4: Threshold Vs F-Measure ...... 55 4.7.5 Experiment 5: Edit distance Vs F-Measure ...... 56 4.7.6 Experiment 6: Baseline Comparison ...... 57 4.8 Discussions ...... 58 4.9 Conclusions ...... 59

5 Summary and Conclusions ...... 60

Bibliography ...... 65 List of Figures

Figure Page

1.1 Content of Languages for Websites ...... 3

3.1 Variation of λ1 values ...... 34 3.2 Variation of β1 values ...... 35

4.1 Variations in α values ...... 53 4.2 Number of features Vs F-Measure for different categories ...... 53 4.3 Variation of window size with F-Measure ...... 55 4.4 Threshold Vs F-Measure ...... 56 4.5 Edit distance Vs F-Measure ...... 56

xii List of Tables

Table Page

3.1 Experiment to note the contribution of structure of Wikipedia in Hindi ...... 31 3.2 Experiment to note the contribution of structure of Wikipedia in Telugu ...... 32 3.3 Experiment to note the contribution of structure of Wikipedia in Marathi ...... 32 3.4 Experiment to compare Similarity Metrics ...... 33

4.1 Calculation of MI ...... 42 4.2 Calculation of χ2test ...... 43 4.3 Experiment on Hindi data ...... 57 4.4 Experiment on Telugu data ...... 58

xiii Chapter 1

Introduction

Information extraction(IE) is the task of automatically extracting structured information from un- structured and/or semi-structured machine-readable documents. In most of the cases this activity con- cerns processing human language texts by means of natural language processing (NLP). Recent activi- ties in multimedia document processing like automatic annotation and content extraction out of images, audio or video could be seen as IE. Automatic extraction of entities, relationships between entities, and attributes describing entities from data on web is the major goal of IE. These entities enable much richer forms of queries on the abundant unstructured sources than possible with keyword searches alone. A more specific goal is to allow logical reasoning to draw inferences based on the logical content of the input data. Structured data is semantically well-defined data from a chosen target domain, interpreted with respect to category and context.

Named Entity Recognition (NER) involves identifying names within text and classifying each such identified instance. This processing has become a standard component of IE, enabling the extraction of useful information from documents.

The task of NER is important component in many NLP applications there are, question answering where the NE tag will aid to recognize the answer for a question in a given sentence. In query process- ing, where the importance of NE is more than the corresponding words in the given search query. In the applications of cross lingual information retrieval and machine translation where translation of words between different languages are required, NE’s are transliterated whereas the remaining entities are translated. Data classification can be improved with respect to accuracies by using the NE tagged data. Apart from the above, there are lot many applications of NLP where NER is an anchor for improvisation.

Recognition of NE’s can be done using several approaches. Grammatical rules can be written for each and every language. These rules are written and maintained by linguistic experts, and so the accuracies of the systems are very high. The other approaches can be the methods which use supervised machine learning techniques. For these methods, the data is annotated using various features, the features can be

1 language dependent or independent. This data is used for training and testing. There are several tech- niques used like CRF, HMM, MEMM, SVM, etc. Another well known approaches are using external knowledge for recognizing NE’s. The external knowledge can be Wikipedia, DBpedia, FIRE corpus or data from shared tasks of conferences like CoNLL - 03, IJCNLP - 08, MUC - 6, MUC - 7, etc.

NE’s are identified and classified into different categories. There are several types in which a given entity can be categorized. Some of them are string categories which involves the names of persons, locations, organizations or can be numerical categories like percentages, monetary values or categories which involve different formats of date, year, month, age, etc. The word/ entity which belong to any of the above category need to be identified from the given piece of text and it need to classified correctly according to the context. Thus, NER is a challenging task.

Stanford NER is the NE-tagger tool for English, which identif the NE’s based on the 125 rules which are pre-written and the tool is developed by a sequential models namely, Conditional Random Field (CRF) with various features. This system will tag the given word with four different representations like, /PER- SON which will denote that the corresponding word is name of the person, /ORGANIZATION which denotes the word is the name of an organization, /LOCATION which will denote that the word is the name of a place and finally /O which will denote that the word is neither of the above. Hence, Stanford NER will only consider the names of persons, locations and organizations as NE’s and throughout this thesis we would consider Stanford NER as a support to recognize NE’s.

1.1 Language Independent Named Entity Recognition

The research on recognition of NE’s has been from decades and there are several improvements throughout the process. The research on NER of English has reached to saturation with excellent ac- curacies nearing to 100. Stanford NER is one such tool which gives a good amount of accuracies. Whereas, the NER in Indian languages is still in considerable amount of research because of following reasons:

1. Resource poor languages: From Figure:1.1, it is evident that English is considered to be the most popular language used world-wide. The content of Indian languages compared to English is negligible. Hence with the availability of such short data, applying the same techniques used to that of English is not that appropriate. So, there is a need to develop techniques for the languages which are resource-poor, and the main aim of such techniques is to utilize the data which is available to its maximum extent. Thus, the shortage of resource is a challenge to develop the NER system for Indian languages.

2. Multilingual NER: As the name suggests, it is the task of extracting NE’s from different languages. The approaches

2 Figure 1.1 Content of Languages for Websites

for multilingual NER task are also similar to those mentioned above as writting the grammatical rules or using supervised machine learning approaches or using external knowledge. The task of multilingual NER is to deal with the languages which neither have sufficient amount of resources nor have sufficient tools to develop the NER system. Moreover the major aspect of consideration is that the grammatical structure of Indian languages vary a lot with each other which in-turn restricts from developing a generic NER system.

3. Processing of Unstructured Data: Recognition of NE’s is done on data which is highly unstructured. The data on web are the documents in natural language, means the documents on web are content-oriented than structure- oriented. Hence given a document, it is highly difficult to process such unstructured data. Extrac- tion of structured information and generation of patterns from such extracted structured informa- tion is the major concern of the task NER.

4. Generic NER System: The main aim of developing a generic NER system is to address different languages by over- coming the above stated limitations. That is, the NER system developed need to utilize the data available on web to its maximum extent such that the availability of data should not be a con- cern of consideration. The NER system need to be independent of any of the languages used and should not have any dependency towards the tools or linguistic experts of any language. Recog- nition of NE’s should give maximum possible accuracies even with high volumes of unstructured data.

3 1.2 Problem Definition

1.2.1 Motivation

NER is the process of recognizing proper nouns (in short NE’s) from the given text. The state-of-art NER systems for English produce near-human performance. However, for non-English languages the state-of-art NER systems perform below par. And for languages that have a lack of resources (e.g., Indian Languages) a NER system with a near-human performance is a distant future.

NER systems so far developed involved linguistic grammar-based techniques as well as statistical mod- els. The grammar-based techniques require linguistic expertise and requires strenuous efforts to build a NER system for every new language. Such techniques can be safely avoided when there is a requirement to build a generic NER system for several languages (e.g., Indian Languages).

Wikipedia is a free, web-based, collaborative, multilingual encyclopaedia. There are 283 language editions available as of now. Wikipedia has both structured (e.g., Infoboxes, Categories, Hyperlinks, In- terLanguage links, etc.) and semi-structured (content and organization of the page) information. Hence, the richly linked structure of Wikipedia present across several languages (e.g., English, Hindi, Marathi, Telugu) has been used to build and enhance many NLP applications including NE identication systems. However, the existing approaches that exploit Wikipedia for recognizing NEs concentrates only on the structured parts which results in less recall.

Statistical NER systems typically require a large amount of manually annotated training data. Gen- erating such tagged data for every language is a tedious process. With the serious lack of such manually annotated training data, the task of high-performance NER system projects as a major challenge for Indian languages. Moreover, these systems need language dependent tools such as POS Tagger, Chunk Tagger, gazetteer lists, etc., availability of these resources in all Indian languages is also a major limita- tion for statistical NER systems.

There is a need to focus on building a generic-purpose NE recognition system for Indian languages. Given the constraints for resource-poor language, developing a regular NE Recognition system is what expected. However, the goal here is to recognize as many NEs available in Indian languages without using any language-dependent tools or resources.

1.2.2 Problem Statement

Given a monolingual document d with the number of NE’s as n (d), the generic NER system devel- oped through this thesis need to recognize the number of NE’s x from the document, where n (d) ' x. The NE’s from that document need to be recognized without using any language dependent tools or

4 without any human intervention. The Identification of NE’s is extracting NE’s from Wikipedia. In a language l, let the total number of NE’s are n (l) and the NE’s identified through the approach, let it be y, then the system need to get maximum accuracies, i.e., n (l) ' y. The main aim is to concentrate on resource-poor Indian languages like Telugu, Hindi, etc., and reach the accuracies near to English language.

1.2.3 Challenges

NEs are recognized from monolingual documents which are not tagged, which means throughout the process the approach is dealing with natural Indian languages. Hence, there will be a lot of challenges involved when dealing with NLP. Some of the challenges are listed below:

1.2.3.1 Variation in NEs

The same NE is written in various forms within a document or set of documents. Also if the NE is a nested NE i.e., if it has more than one NE, then the nested NE may be referred differently at different places with all the NEs present in the nested NE. For example: Dr. Kalaish Srinathan Prasad is written as Dr. Kailash at some places, Dr. Srinathan at some places and Dr. Prasad at some places etc., where all the NE’s refer to the single person.

1.2.3.2 Spell variations in NEs

In Indian languages usually the same word can be written in different spellings. The words have slight variations in their prefixes or suffixes, but are considered as different words. Hence recognizing the same NE with different spell variations is a challenging task. For example: the word and the word refer to the same word “Hindi“ with different spell variations.

1.2.3.3 Disambiguation in the forms of NE

There are three categories of NEs considered namely, PERSON, LOCATION, ORGANIZATION. At some cases, they might have an overlap of same word, i.e., the same word is considered as PERSON, LOCATION or ORGANIZATION. Detailed explanation is given below:

Person vs location: If a place is named after a famous personality, then the name of the place and the name of the person will be same. These are the situations where there is a need of disambiguation between name of person and location. Example: Washington DC is the name of location and person.

5 Location vs organization: If an organization is situated in a place, then they are high chances that the name of organization is named after the name of location. Then the task of disambiguation between location and organization place an important role. Example: Liberty is the place of an organization and location as well.

Organization vs person: If a person owns an organization, then the organization can be named af- ter the person. In these cases, the disambiguation between name of person and organization comes into picture. Example:TATA or BIRLA is the name of the organization and person.

In all the above three cases, the main objective is not only just to identify the NE but also to disam- biguate and give the NE the correct category.

1.2.3.4 Ambiguity with common noun

The major ambiguity in common is the ambiguity with common nouns. There are many common nouns which are given to the names of persons or organizations. Hence, finding the difference in com- mon and proper noun is a difficult task and need to be handled carefully.

Appearance in various parts of speech: Each word on Indian language has varied meanings. More often, the name given to person, location and organizations can be also from varied collection of words. So, there are high chances of word overlap between a word and an NE.

Person vs Adverb: If the name of person and an adverb are same, then there is a need for disam- biguation of NE from adverb.

Organization vs Noun: This is a common case where there is a need to disambiguate a word as orga- nization from being a common noun.

Location vs Verb: If the name of location overlaps with a verb then there is an ambiguity in assigning the NE, which needs to be taken care.

Thus, these are all the challenges when identifying and tagging the NEs.

1.3 Overview of proposed solutions

The main objective of the thesis is to recognize NEs from Indian language documents in a language independent way. To achieve this goal first, a list of NEs are identified from Wikipedia using English

6 Wikipedia as a key using term co-occurrence model. Then using the list of NEs generated, a statistical model is generated from the monolingual Indian language documents by generating features from the words of those documents. These features are further used to recognize NEs from given document.

1.3.1 Named Entity Identification

Wikipedia has structured and semi-structured information. The main concentration of this approach is exploiting both structured and semi-structured parts of Wikipedia. The main idea of this approach is to use English Wikipedia in identifying the NEs of Indian languages. All the English Wikipedia docu- ments are clustered based on the similarity between the documents. Then by considering each cluster separately, from each structural aspect of Wikipedia identify the NEs by term co-occurrences between English and Indian language data. Finally tagging of NEs is done by replicating the tags of maximum co-occurred English data. The detailed explanation is given below:

First, cluster the English documents by exploring the inherent hierarchical structure of Wikipedia. Hier- archical clustering algorithm is also run on the documents as a part of experimentation. The documents are clustered till a level of threshold. The threshold can be defined as the overlap of subtitles (the titles of the sub-parts in which the document is divided) of all documents within a cluster need to be similar. This cluster is replicated in Indian languages by using the inter-language links of English with respective Indian language. Thus, forming Indian language clusters and avoiding the clustering step in different languages.

Now, identify NEs from each structural aspect of Wikipedia. The steps can be defined as identifica- tion of NEs from infobox, sub-titles and abstract. In identification of NEs from infobox, infobox can be divided into key, value pairs. The keys of English and Indian language pairs are mapped across lan- guages by term co-occurrence model and based on the keys, values of Indian language document are mapped with English document values. Thus, the corresponding NEs in English and Indian language are mapped and finally the Indian language data is tagged.

Identification of NEs from semi-structured feature of Wikipedia is done first by mapping of subtitles across languages and then the content in subtitles are mapped by term co-occurrences (i.e., the number of times each English word is mapped with Indian language word is referred to as term co-occurrence of the two words). The maximum co-occurred English word with Indian language word is mapped. Finally, the tag of English word is replicated to Indian language word.

The approach used is simple, efficient, easily reproducible and can be extended to any language as it does not use any of the language specific resources.

7 1.3.2 Named Entity Recognition

Recognizing NEs from monolingual Indian language documents is referred to as NER. The NER can be language dependent or independent. The main objective of this approach is to develop a language independent NER.

A statistical model is developed from the monolingual documents and the list of NEs generated during Identification of NE. The statistical model used here is Naive Bayes Classifier. This model generates a list of features from the words of these monolingual documents. The features are referred to as the most probable words whose presence or absence will affect the occurrence of a NE. These features are calculated by dividing the document into several windows of words. The weight given to the features vary with the size of the window.

Now, consider these features and weights assigned to them to recognize NEs in other documents. That is, given another set of monolingual documents, consider the words in those documents and calculate the overlap of these words with the features generated. This process is also done by varying the size of windows of words in a document and varying their weights. Calculate the threshold at which the occur- rence of a NE is possible. These thresholds along with the feature sets are the output of the statistical model.

Finally given a new document, the overlap of the document with the feature sets at different windows of words is calculated. The weights of the features at different windows of words are calculated. Now, if this weight exceeds the threshold (calculated in previous step), then the probability of occurrence of NE is detected. Now by decreasing the size of window and by iterative calculation of thresholds the NE is recognized.

1.4 Contributions

The contribution of this thesis is as follows:

1. Language Independent Approach: The approach specified does not just deal with single lan- guage but, it deals with many Indian languages which has a great drift in the grammatical struc- ture. Though the experiments are conducted on Hindi, Telugu and Marathi, the approach can be extended to any language. Hence, development and maintenance of such systems are easy compared to language dependent systems where there will be involvement of human resources or linguistic experts. This approach can be used for new Indian languages, where research has just started as there is no need of any language dependent tools available. The only limitation is the availability of data in Wikipedia and on web. Hence, throughout the work, there is no involvement

8 of language experts or there is no human intervention and also, no language dependent tools are used.

2. Improvement in accuracies: The accuracy of English Stanford NER is 96%. The accuracies of Indian language NER is not even near to it. Hence, one of the major contribution of this work is to improve the accuracies to 85% from 60% (the accuracy of language independent NER bench- mark). Though the work may not reach the English NER system, it could able to improve the accuracies to a large extent without degrading the quality of the NE’s produced (precision scores are substantially high).

3. Complete Exploration of Wikipedia: Wikipedia has structured (information from Infoboxes, Categories, Interlanguage Links, etc)and semi-structured data (data from abstract, subtitles, etc which involves natural language). The previous approaches explored most of the structured in- formation, not covering much of the semi-structured information. This thesis does covers both structured and semi-structured parts of data, with different yet simple approaches.

4. Building a statistical model: A statistical model has been generated from set of documents crawled from web. This statistical model will generate features from the words of documents. These features are used for recognition of NE’s. From each document important words are picked as features and the same process is used to pick important words from corresponding other lan- guage document, hence these words from different languages form a bipartite graph which can be used further in many cross-lingual applications. Moreover, the model generated from the features recognizes NE’s with more than 83% accuracy.

5. Mapping of Similar Content: Through the approach suggested in the thesis, the data similar across languages is mapped, which has the potential to be applied elsewhere. (refer to section 3.3.2)

1.5 Thesis Organization

The rest of the thesis is as follows:

Chapter 2 explains in details the work already been done in the area of NER. It also details the process involved, their limitations, tools used, need of human resources, datasets used, evaluation strategies and the results of the existing approaches. The chapter does include language depen- dent, semi-language dependent and language independent approaches, which describes the degree

9 of involvement of language resources in the approaches.

Chapter 3 describes the approach used for Named Entity Identification from Wikipedia. This chapter explains in detail the steps involved in identifying the NE’s from Indian languages us- ing English Wikipedia as an aid. The chapter describes the way English Wikipedia is clustered and tagged and how the corresponding Indian language pages are clustered and using term co- occurrences approach the Indian language words are tagged. This chapter includes the discus- sions on the results generated from the conducted experiments.

Chapter 4 describes the approach for Named Entity Recognition. It explains in detail the way web documents are crawled from web, using those web documents the way features are extracted from the words of the documents extracted and finally using these features recognition of NE’s from new documents. It includes the experiments conducted and the analysis of results.

Chapter 5 includes summary of the whole work performed. It gives the conclusions and discusses on the variations of results. This chapter includes detailed explanations on the whole approach and explains how the system is language independent. The chapter also includes future directions for the problem stated and solved.

Thus, the organization of thesis is mentioned.

10 Chapter 2

Related Work

In this chapter we describe the past research work that has been done and relevant to this thesis.

Named entity recognition (NER) (also known as entity identification and entity extraction) is one of the important subtask of information extraction that seeks to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations, expressions of times, etc. NER has many applications in NLP, e.g., data classification, question answering, cross language information access, machine translation system, etc.

Unlike in the case of Indian Languages, a lot of work has been done in the field of NER for English lan- guage. The existing approaches can be classified into Language-Dependent, Semi-Language-Dependent and Language-Independent approaches.

2.1 Language-Dependent Approaches

2.1.1 Rule-Based approaches

Rule-based systems also known as knowledge based systems consist of defining heuristics in the form of regular expressions or linguistic pattern and making use of dictionaries and lexicons for ex- tracting named-entities. Language dependent rules includes generating a number of grammar rules for every language. An example of a rule or heuristic would be the presence of words like Incorporated, Corporation, Limited etc indicating the presence of an Organization entity or the heuristic that a string with a @ symbol and ending with a .com or .org or .edu is an e-mail address. Similar such kind of number of grammar rules can be crafted to build a NER system for different languages. Such rules are written and maintained by linguistic experts. Rule-based system also makes use of dictionaries or lexicons containing commonly occurring terms or trigger words.

11 Linguistic experts of that particular language are involved in crafting such rules for generating NE’s. The linguistic experts are expected to have high knowledge on the language chosen by them. Such rules are generated in various languages. An example of a rule-based approach in Greek language is presented in the paper by Demiros et al (under consideration Grishman, 1995; McDonald,1996; Wakao et al., 1996.). There are rule based systems for Asian languages like for Urdu ,Hindi, Telugu and Tamil.

2.1.2 Approaches making use of Dictionaries and gazetteer lists

The language dependent approaches also include the usage of language-dependent resources apart from involvement of linguistic experts. These resources include, Dictionaries - the bilingual dictionaries used between two Indian languages or an Indian language and English are the approaches which highly dependent on these dictionaries and unavailability of such dictionaries in a particular language restricts that language to build an NER system. Similar is the gazetteer lists - the lists of NE’s which are col- lected from various sources including web pages. The availability of such resource to a language might fetch the language to have better NER system. But, the unavailability of such resources will make the accuracies of the system to be very low.

There are some of the other resources produced and released during the tasks of conferences. For example, NEWS corpus of ACL -2012 Conference is a corpus of transliterations of letters between non- English languages and English language. This corpus released the transliterations in 5 languages which restricts the availability and use of such resources in other languages. Hence, one of the major challenge of the task of NER is not to use any of such language dependent tools.

2.1.3 Advantages

Some of the early systems were rule-based systems and there are several reports claiming good per- formance. The accuracies of rule-based systems are considered as gold standard as they reach near to perfection. Also, rule-based systems that do not reply on deep syntactic parsing and without using deep knowledge of the language have shown to perform well. The main advantage associated to hand-crafted rule based systems is that extraction logic of complex entities can be fine tuned and it does not require a large amount of pre-annotated data or corpus. Rule-based systems can work very well in restricted do- mains or specific applications where there is an enough implied structure in the underlying unstructured data that needs to be processed.

12 2.1.4 Disadvantages

The disadvantages often associated with rule-based system are the effort required to customize or adapt the system to new domains or new languages. Also creation, modification and maintenance of handcrafted rules and lexicons needs are a time consuming process and has dependency on the avail- ability of a domain expert. The performance of the system also at times may depend on the expert. However, the grammar-based techniques requires linguistic expertise and strenuous efforts to build a NER system for every new language. Moreover, for English and many European languages, capitaliza- tion is a major clue for crafting grammar rules. Whereas, Indian languages does not have such feature. Hence, crafting rules becomes an arduous task for Indian languages and can be safely avoided when there is a requirement to build a generic NE system for several Indian languages.

2.2 Semi-Language-Dependent Approaches

Semi-Language-Dependent techniques developed for the task of NER are predominantly statistical approaches. Statistical NER systems typically require a large amount of manually annotated training data. Hence are quoted as semi-language-dependent approaches. Several Machine Learning techniques had been successfully used for the NER task. Some of them are explained in detail below:

2.2.1 Hidden Markov Models (HMMs)

It is a generative model. The model assigns a joint probability to paired observation and label se- quence. Then the parameters are trained to maximize the joint likelihood of training sets.

P (X,Y ) = πiP (Xi,Yi) P (Yi,Yi−1) (2.1)

It uses forward-backward algorithm, Viterbi Algorithm and Estimation-Modification method for mod- eling. Its basic theory is elegant and easy to understand. Hence it is easier to implement and analyze. In order to define joint probability over observation and label sequence HMM needs to enumerate all pos- sible observation sequence. Hence it makes various assumptions about data like Markovian assumption i.e. current label depends only on the previous label. Also it is not practical to represent multiple over- lapping features and long term dependencies. Number of parameter to be evaluated is huge. So it needs a large data set for training. Some of the relevant works includes [Bikel et al.1999];[Zhou and Su2002]

2.2.2 Maximum Entropy Markov Models (MEMMs)

It is a conditional probabilistic sequence model. It can represent multiple features of a word and can also handle long term dependency. It is based on the principle of maximum entropy which states that the least biased model which considers all know facts is the one which maximizes entropy. Each

13 source state has a exponential model that takes the observation feature as input and output a distribution over possible next state. Output labels are associated with states. It solves the problem of multiple feature representation and long term dependency issue faced by HMM. It has generally increased recall and greater precision than HMM. It has Label Bias Problem. The probability transition leaving any given state must sum to one. So it is biased towards states with lower outgoing transitions. The state with single outgoing state transition will ignore all observations. To handle Label Bias Problem we can change the state-transition structure or we can start with fully connected model and let the training procedure decide a good structure.Previous works which include MEMM is [Sujan et al2008].

2.2.3 Conditional Random Fields (CRF)

It is a type of discriminative probabilistic model. It has all the advantage of MEMMs without the label bias problem. CRFs are undirected graphical models (also know as random field) which is used to calculate the conditional probability of values on assigned output nodes given the values assigned to other assigned input nodes.

Random field: Let G = (Y, E) be a graph where each vertex YV is a random variable. Suppose P(Yv — all other Y) = P(Yv —neighbors(Yv)), then Y is a random field. Let X = random variable over data sequences to be labeled Y = random variable over corresponding label sequence. Definition Let G = (V,E) be a graph such that Y = (Yv) vV , so that Y is indexed by the vertices of G. Then (X,Y) is a conditional random field in case, when conditioned on X, the random variables Yv obey the Markov Property with respect to the graph: P(Yv —X,Yw, w v) = P(Yv —X,Yw, w v), where w v means that w and v are neighbors in G.

The works related to CRF are [shobhana et al.2010].

2.2.4 Support Vector Machine (SVM)

SVM is one of the famous supervised machine learning algorithms for binary classification in all various data set and it gives the best results where the data set is a few, and with extended algorithms it can be used in multi- class problems. To solve a classification task by a supervised machine learning model like SVM, the task usually involves with training and testing data, which consists of some data instances. Each instance in the training set contains one target value (class labels, where class label 1 for positive and class label -1 for negative target value and several attributes (features). The goal of a supervised SVM classifier method is to produce a model which predicts target value of the attributes. For each SVM, there are two data set namely, training and testing, where the SVM used the training set to make a classifier model and classify testing data set based on this model with use of their features.

SVM’s are used for recognition of NE’s in [Asif and Shivaji2010]

14 2.2.5 Decision Tree (DT)

DT is a powerful and popular tool for classification and prediction [7]. The attractiveness of DT is due to the fact that in contrast to neural network, it presents rules. Rules can readily be expressed so that human can understand them or even directly use them in a database access language like SQL so that records failing into a particular category may be tree. Decision Tree is a classifier in the form of a tree structure where each node is either a leaf node-indicates the value of the target attributes(class) of expressions, or a decision node that specifies some text to be carried out on a single attribute value with one branch and sub-tree for each possible outcome of the text. It is an inductive approach to acquire knowledge on classification. The papers which used this technique is [georgios et al.2010]

2.2.6 Hybrid of above approaches

There are certain relevant works which uses the hybrid of above approaches. They include the combi- nation of the approaches used which gave raised to better accuracies then the models used independently.

2.2.7 Advantages

Machine-learning based approaches overcome some of the limitation of rule-based approaches. Ma- chine learning based techniques (both supervised and unsupervised learning) falls into the category of inductive approach whereas rule-based approach falls in the category of deductive approach. It consists of learning a model or rules from pre-annotated data. Once a model is trained or built, the model can be applied to classify unseen data.The advantages of machine-learning based approach over rule-based ap- proach is that it is comparative easier to port and adapt machine-learning based systems to new domains and applications.

2.2.8 Disadvantages

Machine learning based approaches require large volumes of pre-annotated data for the purpose of training and testing. In the absence of sufficient manually annotated data, the statistical techniques doesn’t promise commendable results. It also requires inputs of a domain and technology experts from the point of view of identifying and selecting the right features and also selecting an appropriate ma- chine learning algorithm to build a model. The heart of a rule-based system is the rule repository and lexicon whereas the heart of machine learning based system is the annotated corpus and the statistical or inductive model. Both the approaches need to have human involvement at one or other stages of their development.

15 2.3 Language-Independent Approaches

2.3.1 Approaches using Wikipedia

As an alternative, research has moved towards language-independent techniques which involve any external source mainly Wikipedia as a major source of data. Wikipedia has been the subject of a consid- erable amount of research in recent years. Some relevant papers include (Kazama and Torisawa, 2007) and (Richman and Schone, 2008).

Wikipedia has been the subject of a considerable amount of research in recent years including [Gabrilovich and Markovitch2007], [Milne et al.2006], [Timothy Weale2006], [Zesch et al.2007] and [Richman and Schone2008]. The most relevant work to this paper are [Kazama and Torisawa2007], [Toral and Munoz2006], [Cucerzan2007], [Richman and Schone2008]. More details follow, however it is worth noting that all known prior research is fundamentally monolingual, often developing algorithms that can be adapted to other languages pending availability of the appropriate semantic resources.

[Toral and Munoz2006] used Wikipedia to create lists of NEs. They used the first sentence of Wikipedia articles as likely definitions of the article titles, and used them in attempting to classify the titles as people, locations, organizations, or none. Unlike the method presented in our paper, their algorithm relied on WordNet (or an equivalent resource in another language). The authors noted that their results would need to pass a manual supervision step before being useful for the NER task, and thus did not evaluate their results in the context of a full NER system.

[Kazama and Torisawa2007] used Wikipedia, particularly the first sentence of each article, to create lists of entities. Rather than building entity dictionaries, associating words and phrases to the classical NE tags (Person, Location, etc.), they used a noun phrase following the verb forms ’to be’ to derive a label. For example, they used the sentence ’Franz Fischler ... is an Austrian politician’ to associate the label ’politician’ to the surface form ’Franz Fischler’. They proceeded to show that the dictionaries generated by their method are useful when integrated into an NER system. It is to be noted that their technique relied upon a part-of-speech tagger.

[Cucerzan2007], by contrast to the above, used Wikipedia primarily for Named Entity Disambigua- tion, following the path of [Bunescu and Pasca2006]. As in this paper, and unlike the above mentioned works, it made use of the explicit Category information found within Wikipedia. In particular, Category and related list derived data were key pieces of information used to differentiate between various mean- ings of an ambiguous surface form. [Cucerzan2007] did not make use of the Category information to identify a given entity as a member of any particular class. It is to be noted that the NER component

16 was not the focus of their research, and was specific to the English language.

[Richman and Schone2008] emphasized on the use of links between articles of different languages, specifically between English Wikipedia (the largest and densely linked) and other languages. Their approach used English Wikipedia structure namely categories and hyperlinks to get NEs and then used language specific tools to derive multilingual NE’s.

2.3.2 Advantages

The language-independent approaches, unlike the previous approaches doesn’t need any tools or resources which will limit the utilization of the approach to other languages. Moreover, Wikipedia is a resource which is spread over different languages with atleast considerable amount of data in every language. Thus, these approaches can be extended to any language.

2.3.3 Disadvantages

At least one of the language resources or tools like dictionary, POS taggers, gazetteer lists, etc., were used for constructing a NER system. Such resources are limited across Indian languages. They used either the first sentence, title, or category information from Wikipedia articles. However, there lies a scope for wider exploitation of many other structural characteristics of Wikipedia. The NE’s derived are the entities which are present in either of the documents of Wikipedia. That is, a situation where the NE doesn’t have a page in Wikipedia or it never occurred in any of the Wikipedia documents, then recognition of such an NE is difficult.

Thus, there should be an approach which overcomes the disadvantages of all the above approaches and utilize the advantages of all approaches.

17 Chapter 3

Named Entity Identification

In this chapter, identification of NE’s is done from an External knowledge namely Wikipedia. In the approach specified, the concept of term co-occurrences between different language words is calculated. Term co-occurrences is defined as the probability by which two terms occur together. The structure of Wikipedia is explored completely for getting the term co-occurrences and Stanford NER is used for getting appropriate taggings for NE’s. Thus, the Indian language data is tagged by using Wikipedia and Stanford NER.

3.1 Role of Wikipedia in Identification of Named Entities

There are several limitations for the approaches specified in the previous chapter. They are:

3.1.1 Limitations of Previous Approaches

1. Involvement of linguistic experts are needed at every stage of building the NER system. The development and maintenance of such system is costlier compared to automatic processing.

2. At least one of the language resources or tools like dictionary, POS taggers, gazetteer lists, etc., were used for constructing a NER system. Such resources are limited across Indian languages.

3. In the absence of sufficient manually annotated data, the statistical techniques doesn’t promise commendable results.

4. They used either the first sentence, title, or category information from Wikipedia articles. How- ever, there lies a scope for wider exploitation of many other structural characteristics of Wikipedia.

3.1.2 Enhancements of this Approach

The approach specified in this chapter aim at generating the automated approaches for solving the identification of NE task across languages. Hence given a document in any language, the system need to

18 identify the NE’s in the absence of language experts, language-dependent tools or sufficiently manually tagged data, etc. Wikipedia is a dataset of major consideration in recent years. Wikipedia has both structured and semi-structured information in every document. The division of Wikipedia document into different structures, where each structure play an important role in extracting the specific information, makes the dataset convenient for its use. The structure of Wikipedia is mentioned in the next section:

3.1.3 Structure of Wikipedia

The format of the dataset is as follows. Within Wikipedia, we have availed five major structures:

3.1.3.1 Category links

Links from an article to special ’Category’ pages are represented in the form of [[Category:One Day Internationals]], [[Category: International Matches]], {{Sports}}. The first two are direct links to Category pages. The third is a link to a Template, which links the article to ’Category: Sports’. We will typically say that a given article belongs to all these category pages.

3.1.3.2 Inter-Language links

Links from an article to a presumably equivalent article in another language. For example, in the Hindi language article ’Mahendra Singh Dhoni’ one finds a set of links including [[en:Mahendra Singh Dhoni]] and [[mr: ]] . These represent links to English and Marathi language articles on ’Mahendra Singh Dhoni’ respectively. In almost all cases, the articles linked in this manner represent articles on the same subject in different languages.

3.1.3.3 Subtitles of the document

These are considered to be semi-structured parts of a Wikipedia article. Every page in Wikipedia consists of a title and subtitles. Considering the data below the subtitles, they can be referred as sub-parts of the article. These subtitles are partitioned and processed separately. For example, in the Wikipedia article Rahul Dravid the subtitles can be ’Early life’, ’Cricketing Career’, ’Praises and Accolades’, ’Teams’, ’Captaincy Record’, etc.

3.1.3.4 Abstract

Abstract is the initial few lines of a Wikipedia article which provides the gist of the entire page. Abstract also is a semi-structured part of Wikipedia and can be considered as one of the subtitle of the article without a specific title.

19 3.1.3.5 Infobox

Infobox is typically a tabular representation of key statistics, which includes all important aspects related to the title of the Wikipedia article. For example, Infobox of Sachin Tendulkar article includes his name, full name, nationality, test debut against, place of birth, etc.

Hence, all the structural aspects of Wikipedia are considered compared to any of the previous ap- proaches, which makes it to extend to any resource-poor language. That means, the approach specified utilizes the data completely hence in case of less amounts of data, the approach can be used.

3.2 Overview of the Approach

The approach specified can be divided into different steps, namely clustering of documents, in which the entire Wikipedia is considered from which set of clusters of documents are formed where each of the cluster will be having documents which are highly similar to each other. These clusters are formed in various languages. Then the next step is to explore each and every structural aspect of Wikipedia, like Infobox, subtitles, abstract and extract the NE’s from the data of these structures. This identification from each structural aspect include identification of an Indian language NE from the NE of English and then tag the Indian language NE by replicating the tag of English NE. These steps are explained in detail in the following sections.

3.3 Clustering of Similar documents

Clustering is performed in two ways in this chapter. They are:

3.3.1 Hierarchical Clustering without using Category Information of Wikipedia

Clustering is an unsupervised algorithm which puts together similar documents and separates dif- ferent documents apart. There are two clustering algorithms namely, hierarchical clustering and flat clustering. Hierarchical clustering outputs a hierarchy, a structure that is more informative than the unstructured set of clusters returned by flat clustering. Hierarchical clustering does not require us to pre-specify the number of clusters and most hierarchical algorithms that have been used in IR are de- terministic. This work deals with large amounts of semi-structured data and requires structured clusters as output rather than unstructured clusters. Moreover, specifying the number of clusters beforehand is difficult. Hence, we prefer Hierarchical clustering over flat clustering in rest of the thesis. The algorithm of this chapter performs clustering till a certain threshold.

Within hierarchical clustering, there are two clustering algorithms. They are Agglomerative - this is a

20 ”bottom up” approach, where each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy and Divisive - this is a ”top down” approach where all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy. Bottom-up algo- rithms can reach a cluster configuration with a better homogeneity than Top-Down clustering. Hence, we prefer bottom-up clustering over top-down clustering.

Within bottom-up clustering there are several similarity measures that can be employed namely single- linkage, complete-linkage, group-average and centroid-measure. This single-link merge criterion is local. Priority is given solely to the area where the two clusters come closest to each other. Other, more distant parts of the cluster and the clusters overall structure are not taken into account. In complete-link clustering or complete-linkage clustering, the similarity of two clusters is the similarity of their most dissimilar members. In centroid clustering, the similarity of two clusters is defined as the similarity of their centroids.

Group-average agglomerative clustering or GAAC evaluates cluster quality based on all similarities between documents, thus avoiding the pitfalls of the single-link and complete-link criteria. Hence, we made use of the Group-average agglomerative clustering.

We have considered the English Wikipedia articles which contain Interlanguage links to Hindi arti- cles. The English articles are clustered based on the overlap of terms, i.e., the number of common terms present between articles. The clustering algorithm is detailed as follows:

Initially, consider English Wikipedia data, each article in the dataset is considered as a single docu- ment cluster. Now, the distance between two clusters is calculated using

1 P SIM-GA(ωi, ωj) = (3.1) (Ni+Nj )(Ni+Nj −1) dm∈ωi∪ωj P d~ · d~ dn∈ωi∪ωj ,dm6=dn m n

where d~ is the length-normalized vector of document d, · denotes the dot product, and Ni and Nj are the number of documents in ωi and ωj, respectively. Using group average agglomerative clustering, the process is repeated till we reach a certain threshold and thus the hierarchical clusters of English data are formed. In order to cluster documents of other languages (say articles), we availed the Interlanguage links and structure of English clusters. The Inter-Language links are used in replicating the cluster structure of English Wikipedia articles across other language articles. Therefore, we avoided the repetition of the clustering step for non-English articles. These different language clusters, being interconnected, are further utilized in our approach.

21 3.3.2 Clustering by considering the Category Information of Wikipedia

The other way of performing clustering of documents is by using the inherent structure of Wikipedia i.e., the Wikipedia articles are clustered based on the Category links. Each document is considered and the categories to which the document belongs to are extracted. Now, the documents in these categories are extracted. For example, consider the English Wikipedia article on Rahul Dravid, the page has a category link Indian Cricketers. This implies that the page is speaking about a cricketer of India. The category Indian Cricketers has a subtitle named Pages in Category. This subtitle lists links to Wikipedia articles that talk about a cricketer of India. All such links together form an English cluster containing 800 Indian cricketers pages. This process is continued till we get the list of cricketers of all countries which turned out to be around 3,853 documents. Hence, the cluster Cricketers is formed. This process is repeated for all the pages taking till we get sets of clusters.

Now, the corresponding Indian language articles for each page in the English cluster are fetched us- ing the inter-language links. Thus, forming clusters separately in Indian langauge for the same category (i.e., Cricketers in this example). Hence, the repetition of clustering process of other languages are avoided.

3.4 Identification of NEs from Infobox

Indian Languages such as Telugu, Tamil, Hindi, Marathi, etc., are not only short of resources but also short of data present on web. For 10 GB of English Wikipedia data, the corresponding data in Hindi is around 346MB, Marathi is around 156MB and Telugu is around 183MB. Hence, there is a need to utilize all possible structural information available for Indian language data.

Infobox is a tabular representation of important information in a Wikipedia document. According to our observation, almost all documents (approximately 89%) of Wikipedia in Indian language have In- fobox and almost all entries in Infobox are NEs. Hence, Infobox is justified to be of considerable importance for a study here.

Infobox can be represented as (Key, Value) pairs, where Key is a common attribute shared across differ- ent pages in a cluster. Value is an attribute specific to a Wikipedia document. Every value of a document is mapped with a specific Key. For E.g., a page on Mahender Singh Dhoni has Keys such as Name, Place of Birth, Date of Birth, Year of Birth, debut against, last test against, etc., which can be shared across Wiki documents. The Values are Dhoni, Bihar, 7 july, 1981, Sri Lanka, Australia, etc., which are specific to Mahender Singh Dhoni document.

We have considered a cluster List of Cricketers - from Wikipedia as an example for the sake of Ex-

22 planation. The Key attributes of an Infobox play a crucial part in identification of NEs. Our observation is supported by the following three facts:

1. The Key attributes across Wikipedia pages in a cluster (e.g., Indian Cricketers) are almost simi- lar. For Example both the Wikipedia pages on Sachin Tendulkar, Mahender Singh Dhoni, Rahul Dravid, etc., have the same keys such as Full Name, Test Debut against, Nationality, etc.

2. The Key attributes of Indian language articles (e.g., Hindi, Marathi) are mostly the translated versions of Key attributes of English articles from same cluster. For example consider English and Hindi languages, the keys for those languages are Name - Nam (meaning Name in Hindi) more examples...

3. In a given cluster, the orders of occurrences of the Key attributes are also similar across Wikipedia documents for different languages. E.g., Name appears initially, followed by Date of Birth, etc. Likewise nam occurs first and then desh etc.

The approach of identifying NEs from Infobox is detailed in below two sections.

3.4.1 Map corresponding Keys across Languages

The main aim of this step is to map the keys across different languages. For the sake of simplicity, Hindi is the Indian language of consideration for the rest of the explanation.

The Key attributes in an English Wikipedia document can be Name, Country, test debut against, etc.. Similarly, the Key attributes in a Hindi Wikipedia document can be Nam, desh, ... ,etc.. Mapping corresponding Keys across languages suggest Name in English should be mapped with Nam (meaning Name) in Hindi.

In order to achieve such mappings, for the cluster, say List of Cricketers. Extract a Wikipedia docu- ment say Sachin Tendulkar from English and from Indian language. Fetch the lists of Key attributes from English and Hindi pages. Map each Key attribute in English with every Key attribute in Hindi. This would result in mappings like, (Name, Nam, 1, 1), (Name, desh, 1, 0), (Country, Nam, 1, 0), (Country, desh, 1, 1), etc. Along with the attributes, we have mapped the number of occurrences (+1 on every new occurrence of the pair) and the order of occurrence (=1 if the attributes occur at same positions, =0 otherwise). This process is repeated with the remaining pages in the cluster and on every new occurrence of an already existing pair, the corresponding values are added. Finally, we would have mappings similar to (Name, Nam, 5, 4), (Name, desh, 4, 0), (Country, Nam, 5, 1), (Country, desh, 5, 4), etc. For each pair, a score is assigned based on a weighted linear combination of the co-occurrence statistics.

23 Score(ei, hj) = (λ1 ∗ number of occurrences)

+(λ2 ∗ order of occurrence) (3.2)

w.r.t λ1 + λ2 = 1 (3.3)

The λ1 and λ2 indicate the relative importance assigned to the two different statistics, constrained to Eq.(2). Their values are determined experimentally.

The pair with the highest score is determined as a valid map (e.g., Name - Nam). Similarly, other mappings are identified with the rest of the pairs. This procedure is repeated for finding mappings across English and Marathi Keys.

3.4.2 Tagging non-English Data with NE tags

As a result of the previous step, we have obtained mappings of Keys across languages. These Key mappings when applied on a certain Wikipedia article, say Mahendra Singh Dhoni, would takes values specific to that page. The Key mappings can then be extended to map their associated Values. E.g., Name - Nam when applied on Dhoni page would result in a Value map of Dhoni (Dhoni in Hindi).

As a preprocessing step, Stanford NER is runned on English Wikipedia to identify all NEs and their associated tags. With this domain knowledge of list of English NEs and their tags, the Value maps can be tagged similarly.

As in the previous example, since Dhoni would be tagged as PERSON in English, its Hindi coun- terpart would take up the same NE tag. This process is repeated with all other pages and all Keys to get as many tags possible in the Hindi data. Hence, the Hindi words are tagged without any language expertise.

3.5 Identification of NEs from Subtitle

In Wikipedia, we have observed that, only around 10% (approximately and will vary slightly with different languages) of the pages in Indian languages has considerable amount of data. That is, from all the available pages, only few pages have subtitles and data corresponding to subtitles. Hence, these presences of fewer amounts of data need to be handled carefully.

Subtitles divide the content of an article into precise and specific sub-parts, such that each sub-part covers a definite aspect of an article. For Example: given an article on History of India , the subtitles of the article are Pre-Historic era, Early Historic period, Early Modern Period, etc. That is, the important

24 aspects of the article are divided into sub-parts and assigned titles for each part which are considered as sub-titles throughout the thesis. For the remaining part of the section the cluster on List of Cricketers is considered as an example for the sake of simple understanding.

There are some of the interesting observations about subtitles which are described below:

1. In a given English cluster (List of Cricketers in this case), there exists almost similar subtitles. For example: Cricketing career, IPL, etc., exists in almost all articles related to cricketers. Some of these subtitles distinguish the cluster with remaining clusters. As the subtitle ’Cricketing Career’ distinguishes the article to be a part of cluster-Cricketers and not as part of any other clusters like Chemicals, Astrology, etc. Hence, this subtitle similarity is explored to cluster similar documents.

2. The above observation is also applicable to Indian language clusters. That is, the subtitles in Indian languages are also similar within a cluster. Since, given a cluster the important aspects about the cluster are similar irrespective of the kind of pages in the cluster. But, relatively the data associated with Indian language articles is quite less. Hence, the numbers of subtitles in Indian languages obtained are around 57% to the subtitles obtained in the English.

3. For a given cluster (say List of Cricketers), as the important aspects or concepts that the cluster is dealing with are similar even across languages, the subtitles of similar languages are also similar. That is given a cluster in English and same cluster in Indian language, the subtitles of English and Indian language deal with similar concepts but in different languages and may be sometimes in different forms (i.e., with different words, like: Early life in cricketing career and Early cricketing career are similar).

The observations made aid for the identification of NEs. The Identification of NEs from subtitles can be done in three steps. To explain them briefly, In the first step each subtitle in Indian language is mapped with the corresponding subtitle in English by calculating the co-occurrences between the subtitles. In the second step, each English subtitle along with its corresponding Indian language subtitle is mapped with their corresponding data across different pages in the cluster. Finally in the third step, the data of each subtitle in English and Indian language is considered and term co-occurrences are calculated. The maximum co-occurred English and Indian language words are mapped and the tag of English word is replicated to Indian language word.

This whole approach can be explained in following steps:

3.5.1 Mapping the Subtitles of Hindi with English subtitles

From the observations derived, there are two factors of concern. First, is the limitation of data. That is, the number of subtitles in Hindi varies largely from number of subtitles in English. The next major concern is to develop a language-independent mapping which is a challenging task.

25 Before explaining the process in detail, there is a need to develop a language independent resource - a dictionary which will aid in mapping the subtitles across languages. The dictionary is created from the titles of Wikipedia. For the titles with single word, the dictionary is a map between the English and Indian language word. But for titles with more than one word, if some of the words in the Indian language title already exists in the dictionary, then those words and their corresponding English words (in English title) are removed. The remaining words are paired. Hence, maximum of single word single word mappings are obtained from the titles of Wikipedia to form the dictionary.

Given a cluster ’List of Cricketers’, consider each article in English and corresponding article in Hindi then collect subtitles from both the languages. Now map each English subtitle with every Hindi subtitle. However each such mapping is attached with two kinds of weights.

1. The number of co-occurrences: the number of times the particular pair occurs together throughout the dataset.

2. The order of occurrence:

mod(pos1 − pos2) (3.4)

where pos1 is position of English subtitle and pos2 is position of Hindi subtitle in their respective article. For each pair, a score is assigned based on a weighted linear combination of the co- occurrence statistics.

3. The mapping from dictionary: If the English word and Indian language word have a complete map with the dictionary, then the weight assigned is 1. Else for the partial map, i.e., if In a English phrase with more than one word and Hindi phrase with more than one word, if there is a map of given words with one of the terms in those phrases then it is considered as partial map and assigned a weight 0.5. This is an additional score used as an aid but is not included in the experimentation.

Score(ei, hj) = (β1 ∗ number of occurrences)

+(β2 ∗ order of occurrence) (3.5)

w.r.t β1 + β2 = 1 (3.6)

The values of β1 and β2 given relative importance for the three factors stated above and their values are determined experimentally.

Finally, the Hindi subtitle which has maximum score with English subtitle is mapped. If there are subtitle similar to the given subtitle but not an exact match, from the example above Early life in Cricketing career and Early cricket career are the same subtitles with slight modifications even

26 these are mapped together in English and in Indian languages by calculating the match of words between different words and if they vary with less than 10%, then they are considered the same subtitle. Hence, mapping of subtitles is done by overcoming the above mentioned limitations.

3.5.2 Clustering of Similar Subtitles

From previous step, we get a list of subtitles mapped in different languages. For each mapping, it consists of English and its corresponding Indian language subtitle. Now for every English Wikipedia document in the given cluster, search for the subtitle say International Cricket Career, if exists get the corresponding Hindi Wikipedia document and search for Hindi subtitle . The data corresponding to both the subtitles is mapped together as a sub-cluster and assigned an Id for it. The process is repeated for all documents and the data corresponding to English and Indian lan- guage corresponding to that subtitle is mapped together individually along with their Ids. Now, repeat this procedure for all subtitles corresponding to the cluster. Finally, repeat the procedure for all clusters. Thus, this step will output a set of subtitles and for each subtitle, a set of sub-clusters related to the subtitle in English and Indian language. This is one of the important contributions of the work (specifically in the medical domain) and is useful for further developments.

For example, the output of this step in medical domain especially a cluster List of Diseases is considered. The output of the step is list of subtitles namely causes of disease and precautions of disease, etc., in different languages and for different Wikipedia documents like Cancer, Multiple Sclerosis, etc., (here causes of cancer in English and Indian language will be one sub-cluster and causes of Multiple Sclerosis is another sub-cluster under the subtitle causes of disease). Hence, if a query is on causes of Cancer in English, the output can be in any language desired for. This makes this output a good contribution as it contains mappings of precise content from different languages together.

3.5.3 Term Co-occurrences

From previous step, we get a list of subtitles where each subtitle has a set of sub-clusters in En- glish and Indian language. Identification of NE’s from a subtitle can be explained in three steps: Pre-processing, Word co-occurrences and Tagging of words.

In the pre-processing step, consider each subtitle which will have different sub-clusters of En- glish and Indian language data. Given the fact that the usage of English tools doesnt hurt the extensibility of the approach to other languages, the English data is annotated with Stanford NER and the NEs are retrieved. Indian language data is preprocessed by removing the stop words. The stop words list is generated by considering words that occur above a certain frequency in the

27 overall dataset. The list is generated for each Indian language.

In the word co-occurrences step, consider the preprocessed data of each sub-cluster and map each NE tagged English word with every non-tagged Hindi word and assign a default weight(=1). The process is repeated with other English and Indian language sub-cluster of the same subtitle in a cluster. Hence, if the existing pair of tagged English word and non-tagged Hindi word when occurred again in another sub-cluster, the weight of that pair increases (by 1). Thus, the term co-occurrences between English and other language words are calculated.

Finally in the tagging of words step, each Indian language word is mapped with maximum co- occurred English word. Now, the tag of English word is duplicated to its corresponding Indian language word. Hence Indian language data is tagged without using any language dependent tools. The process is repeated for all subtitles and the corresponding Indian language data is tagged. Hence, using the English Stanford NER as an anchor, any non-English data can be tagged. Therefore, the Identification phase outputs a list of NE’s wherein each NE in English is associated with its corresponding NE’s in other languages.

For example, consider two small mappings, each with two English NEs and one sentence in Hindi. Consider the first map, with Alexander/ PERSON, India/LOCATION as English NEs and as Hindi sentence. Then each NE of English is attached with each Hindi word (except the stop words) like Alexander - , Alexander - , Alexander - , India - , etc., in all combinations. Consider the second map with ’Alexander/PERSON’, ’Philip/PERSON’ as English NEs and as Hindi sentence. The pairs would be Alexander - , Alexan- der - etc. Hence, the maximum co occurred pair would be Alexander - (Alexander in Hindi). Then the NE tag of Alexander/PERSON is attached to /PERSON. Similarly, for the remaining English NEs and Hindi terms, the maximum co-occurred pair is identified and the Hindi term is tagged.

3.6 Identification of NEs from Abstract

Abstract occurs frequently next to Infobox in Wikipedia non-English pages. Abstract is the first few lines of any Wikipedia article which summarizes the content of the document. Hence it is a good source of NEs. Abstract is considered as an anonymous subtitle to an article. Though the procedure for Identification from abstract is similar to that of subtitles, abstract is considered differently for two reasons:

(a) As Abstract is one of the important structure of Wikipedia

28 (b) The number of Wikipedia documents having abstract are quite large than the Wikipedia documents having subtitles and data corresponding to them. Hence, it is easy to calculate the contribution of abstract if considered differently.

To identify NEs from abstract, each Wikipedia document in a cluster is considered, which is rep- resented in English and a non-English language (say Hindi). Remove stop-words from Indian language abstract and tag English data with Stanford NER and remove non-NEs.

The abstract will be around 3-4 sentences and after removal of stop words we will be remain- ing with approximately 10 words. Since English data is tagged with Stanford NER, each of the tagged English word from abstract is paired with each of the remaining non-tagged Indian lan- guage words (from abstract) and a default weight(=1) is assigned to each such pair.

The process is repeated with other page abstracts in the cluster. If the same pair of tagged En- glish word and non-tagged Indian language word when occurred again in other page abstract, the weight of that pair is increased by 1. Thus, the term co-occurrences between English and other language words are calculated. Finally, the Indian language word is mapped with maximum co- occurred English word and is assigned the tag of the English word. Hence, the Indian language data is tagged.

3.7 Evaluation

3.7.1 Dataset and Test Set

Experiments were performed on Wikipedia datasets in English, Hindi, Marathi and Telugu lan- guages. English Wikipedia contains 4,128,536 documents and is approximately 9.7GB. Hindi Wikipedia contains nearly 100,000 documents and is approximately 386 MB. contains nearly 34,000 documents and is approximately 156MB. Telugu Wikipedia contains 45,273 documents and is approximately 183MB. There are 22300, 12000, 15000 articles in Hindi, Marathi and Telugu languages which are having Interlanguage links to English Wikipedia pages. We have manually tagged Hindi,Marathi and Telugu articles of 50,45, 63 random clusters respectively (as cluster size can dictate accuracies) with three NE tags (i.e., Person, Organization, Location), re- sulting in 2,328 Hindi articles with around 11,000 NE tags, 1,658 Marathi articles with 9,000 NE tags and 2,200 documents with 12,000 NE tags. All further experiments were performed on this tagged dataset.

29 3.7.2 Baseline System

For Telugu the model is compared with the system developed by [Gali et al.2008], which uses CRF along with language independent features like prefixes, suffixes, previous and next 3 tokens and compound features. Then some of the language dependent features like POS tagger, chunk tagger. The accuracy of the system is around 45%. For Hindi, we have compared our system with a Hindi NER system developed by LTRC (Language Technologies Research Centre)1, IIIT Hyderabad. They have made used of the Conditional Random Fields (CRF) and was able to achieve an F-Measure of 63%. Their system is reproduced on our dataset with a 5-fold cross validation using spell variations, pattern of suffixes and POS tagging as the features. We have observed their system as our baseline throughout our experiments. There is no available existing system for Marathi NER. Hence, the Marathi results are not compared but were just reported.

3.7.3 Evaluation Metrics

Precision, Recall and F-Measure are the evaluation Metrics and they can be defined as follows:

(a) Precision: P = c/r (b) Recall: R = c/t (c) F-Measure: F = 2*P*R / (P+R)

where c is the number of correctly retrieved (identified) NEs, r is the total number of NEs retrieved by the system being evaluated (correct plus incorrect) and t is the total number of NEs in the reference data.

3.8 Experiments and Results

The experiments conducted are broadly classified as follows:

3.8.1 Experiment 1: Exploitation of structure of the Wikipedia page

Using the structure of Wikipedia namely Category terms, we can cluster the articles which are having similar category terms. Another approach for clustering is to consider the Wikipedia page as an unstructured page and then cluster the articles based on the similarity of words present in it. We have performed Hierarchical GAAC based clustering for these experiments. Then the struc- ture of Wikipedia is explored and the contribution of each structural aspect is reported.

1http://ltrc.iiit.ac.in

30 No Category: Clustering without using the Category information: As the first experiment, the articles are clustered based on the article text and not using the category terms. That is, the documents are clustered based on the overlap of terms in the hierarchical clustering algorithm mentioned

With Category: Clustering using the Category information: In this experiment, the category terms are used for clustering the documents. The F-measure suggests that category terms better capture the semantics of an article when compared to the text of the article. Adding to the fact that category terms suggest a compact representation of an article whereas the text include noisy terms.

Then includes the structure of Wikipedia for Identifying and tagging the NE’s.

Include Subtitles: Now, consider the semi-structural aspect of Wikipedia, the subtitles. As the results show an improvement by assigning a correct tag to Hindi/Marathi data. But, the increment in result is quite less because of limitation of data in articles.

Include Abstract: Abstract is also considered to a semi-structural part of Wikipedia. By the inclusion of Abstract concept, the results improved even to a better statistics. But, the limitation of presence of abstract in articles reflects in results.

Include Infobox: The convenient tabular representation of Infobox along with its high avail- ability lead to a promising results as mentioned below. The tables shown below are for both Hindi and Marathi. But, we could not evaluate Marathi sys- tem. So, the results are just put down.

The compact representation of articles has proved to be crucial by our next set of experiments.

Precision Recall F-measure NER LTRC 64.9 50.6 56.81 No Category 69.8 62.7 66.05 With Category 73.5 64.3 68.59 Include Subtitles 74.3 65.5 69.6 Include Abstract 80.6 68.9 74.3 Include Infobox 88.5 73.7 80.42

Table 3.1 Experiment to note the contribution of structure of Wikipedia in Hindi

The values of No Category and With Category are similar across languages as the clustering is

31 Precision Recall F-measure Baseline System 64.09 34.57 44.91 No Category 69.8 62.7 66.05 With Category 73.5 64.3 68.59 Include Subtitles 74.5 68.4 71.32 Include Abstract 81.7 69.8 75.3 Include Infobox 88.6 72.9 79.98

Table 3.2 Experiment to note the contribution of structure of Wikipedia in Telugu

Precision Recall F-measure No Category 69.8 62.7 66.05 With Category 73.5 64.3 68.59 Include Subtitles 76.7 67.4 71.5 Include Abstract 82.5 70.3 75.9 Include Infobox 89.2 74.6 81.25

Table 3.3 Experiment to note the contribution of structure of Wikipedia in Marathi performed on English with or without using categories. Hence, it doesnt effect much on the other languages.

3.8.2 Experiment 2: Similarity metrics for Clustering

Different clustering metrics will yield different accuracies for a given data. Here, we will measure which similarity metric is appropriate for the dataset under study following a Category informa- tion based clustering of articles.

SLAC: Single-linkage Agglomerative Clustering: Single-linkage algorithm would make use of minimum distance between the clusters as similarity metric. One of the drawbacks for this measure is that if we have even a single document related to two clusters, the clusters are merged. In Wikipedia we will not have un-related documents, all the documents will be having a certain overlap of terms with each other. Hence, the numbers of clusters formed are relatively less compared to other two similarity measures. Thus the measures of Precision, Recall and F-measure are quite less.

CLAC: Complete-linkage Agglomerative Clustering: Complete-linkage algorithm would make use of maximum distance between the clusters as sim- ilarity metric. This results in a preference for compact clusters with small diameters over long.

32 Hence, the accuracies are improved. The drawback is that it causes sensitivity to outliers.

GAAC: Group Average Agglomerative Clustering: Group Average is the average between single-linkage metric and complete-linkage metric. Hence, covers the advantages of the both, overcoming the drawbacks of both metrics to some extent. Thus, the accuracies have improved considerably over previous experiments.

Precision Recall F-measure NER LTRC 64.9 50.6 56.81 SLAC 67.6 60.3 63.74 CLAC 70.3 61.1 65.38 GAAC 73.5 64.3 68.59

Table 3.4 Experiment to compare Similarity Metrics

3.8.3 Experiment 3: Variations in Lambda Scores

Given equation 3.3:

Score1(ei, hj) = (λ1 ∗ number of occurrences)

+(λ2 ∗ order of occurrence)

w.r.t λ1 + λ2 = 1

Where λ1 is the weight assigned to the factor num of co-occurrences

λ2 is the weight assigned to the factor order of occurence

From equation 3.3, we assumed that λ1 + λ2 =1,

Score1(ei, hj) is the weight assigned to pair of English and Hindi Keys.

The change of values of λ1 and λ2 will lead to change of values in weight Score1(ei, hj). The change of weight Score1(ei, hj) will lead to the change of Key pairs and their associated values. Hence, the tags of Hindi/Marathi data and thus, the F-Measure scores. This describes that change in λ1 and λ2 will lead to the changes in F-Measure scores. Therefore, the graph below shows the variations in F-Measure values with the variations in λ1 values.

The values of λ1 varies from 0.1 to 0.9. As shown in Figure: 3.1, with very high values of λ2 the F-Measure scores are pretty less. Because, though we observed that order of occurence of keys in Infobox are similar across pages. There is always a limitation about the amount of data present because in this case we refer to a single Hindi/Marathi article. But as the value of λ1 increases, there is no restriction as to refer a small amount of data because, the num of co-occurrences is

33 Figure 3.1 Variation of λ1 values calculated across all pages. Hence, F-Measure score increases. It is observed that, even in case of considering only num of co-occurrences also the results are not presumably, as there is al- ways a chance that one Key attribute of English can be co-occurred with many Key attributes of

Hindi/Marathi. Thus the optimum values are: λ1 = 0.65 and λ2 = 0.35.

3.8.4 Experiment 4: Varying of beta values

Given equation 3.7,

Score2(ei, hj) = (β1 ∗ number of occurrences)

+(β2 ∗ order of occurrence)

w.r.t β1 + β2 = 1 where β1 = weight assigned to num of co-occurrences

β2 = weight assigned to order of occurence

From equation 3.6, we assumed that β1+β2 = 1

Score2(ei, hj) = weight assigned to the pairs of English and Hindi subtitles.

Similar to the previous section, with the change in values of weights β1,β2 the weight Score2(ei, hj) is changed, which will inturn result in change of the assignment of Hindi subtitle to the English subtitle. This will lead to change in F-Measure Scores. Hence, the graphs below, shows the vari- ation of values between β1 and F-Measure.

The values of β1 and β2 varies from 0.1 to 0.9. Similar to the previous section, with low values of β1 in Figure: 3.2 the F-Measure score is pretty less. As the Score2(ei, hj) depends only on the order of occurence. Hence, the high mismatch of number of subtitles in English and

34 Figure 3.2 Variation of β1 values

Hindi/Marathi leads to the less scores. when β1 value increases the F-Measure also increases and for the value of β1 = 0.7 we get the maximum scores, that shows that number of co-occurrences between Hindi/Marathi and English pair plays a major role in calculating F-measure scores. But, the order of occurence is also not insignificant.

3.9 Discussions

For evaluation, we have made the documents of Hindi/Marathi to be tagged manually and then compared those tags with the tags produced through this approach. The data in Hindi/Marathi is much less compared to English. Thus, utilising the data completely has been the major concern of the method. From the first experiment conducted, F-Measure score is less than the baseline when the structure of Wikipedia is not considered. Then, accuracies raised step-by-step by the inclusion of each structural aspect of Wikipedia. Finally, the results are encouraging. In second experiment, the similarity metrics are compared among themselves and GAAC is found to be a good evaluation metric. From the third and fourth experiments, we could derive the values of unknown variables for equations 3.3 and 3.7.

3.10 Conclusions and Future Work

This chapter identifies and extracts the Named Entities for Indian languages. The major concern is for languages where data is very less compared to English. The approach suggested is very simple but efficient and easily reproducible and can be extended to any other language as it is developed

35 under a language-independent framework. Wikipedia pages across languages are merged together at subtitle level and then the non-English NEs are identified based on term-term co-occurrence frequencies. Each and every structural aspect of Wikipedia is considered separately and NEs of Indian languages are found based on NEs of English. From the Experiments, we can conclude that the accuracies increase with the proper utilisation of Wikipedia structure and its data. Hence, Wikipedia-derived system can be used as a supplement to various applications where language- dependent systems are used. Moreover, the approach suggested can be extended to any language as there is no language-dependant tools used.

36 Chapter 4

Named Entity Recognition

In the previous chapter, had we did not use any language dependent tools nor had we involved any language experts, NE’s are identified and extracted from an external knowledge namely Wikipedia by term co-occurrences approach. The output of the chapter 3 is a list of NE’s (named as WikiList of NE’s) tagged on the basis of their corresponding maximum co-occurred English words. But, this word list is restricted to the NEs in Wikipedia i.e., for NEs not present in any of the Wikipedia documents, the approach specified in previous chapter is not sufficient to recognize them. Given the fact that the data on web for Indian languages is low, there should be maximum utilization of the available data. Hence, there need to be some approaches to recognize named entities from new documents.

4.1 Named Entity Recognition Vs Named Entity Identification

Named Entity Identification (NEI) is a process of identifying or extracting NE’s from the given data. Identification is the process of getting the NE’s which are already present. That is in the given case, the NE’s in Wikipedia are the NE’s which are present in the WikiList, which is the output of the previous chapter. Given a random name of a person for example my name Mahathi Bhagavatula, the NE Mahathi Bhagavatula is not present in the WikiList of NE’s as there is no Wikipedia document or there is no chance of this name occurring in any of the Wikipedia docu- ments. Hence, this name is not identified by NEI system.

Whereas, the main task of NER is to recognize the entities based on the given text. NER is implemented by generating a statistical model. The main objective of this chapter is to build a statistical model from a set of documents using the output of the last chapter to recognize NE’s in a language independent way. This is the statistical model that is built on the data of monolingual web pages crawled from web using WikiList of NE’s as an anchor. In this model we classify the

37 words into different categories and for each category we derive a set of features from the words of monolingual web documents. These feature sets are used in recognition and tagging of NEs.

4.2 Building of Statistical Model

The model generated in this chapter is a supervised classification model, trained on the monolin- gual web pages and the output of previous chapter. The monolingual web pages are crawled from web in various languages and are used as input for building this model. There are four categories classification namely, PERSON, LOCATION, ORGANIZATION and Non-NE. The WikiList of NE’s are also tagged with the same categories namely PERSON, LOCATION and ORGANIZA- TION with the Stanford NER. The words that do not fall in any of these categories are Non-NEs. The class Non-NEs plays a crucial role to decrease the number of mis-classifiers and increase the accuracies.

The algorithm used in this chapter can be explained in three steps. During Feature generation and Selection, the monolingual documents are considered and the NE’s in those documents are extracted by using the WikiList of NE’s. The terms in the document which are not tagged as NE are considered and various scores are assigned to these terms.

4.2.1 Naive Bayes Classification

Naive Bayes Classification is a classification technique based on Bayes theorem with strong (naive) independence assumptions. A more descriptive term for the underlying probability model would be ”independent feature model”. Naive Bayes classifier assumes that the presence (or ab- sence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature, given the class variable. During the calculation of scores to the terms which are not NEs, the parameters of Naive Bayes classification are used with a slight modification. Hence, a set of features are generated from given documents.

Now, during the phase of Recognition and Tagging, thresholds are set for various parameters for extracting the correct NE in a new document. Finally, in challenges and enhancements phase, the challenges described in section 1.2.3 are quoted and the enhancements to the approach by apply- ing which the challenges can be overcomed are specified. Throughout the process the language- dependent tools or resources are not used. The approach can be explained in detail in following three steps:

38 4.3 Feature Generation and Selection

Feature selection is the process of selecting a subset of the terms occurring in the training set and using only this subset as features in text classifica- tion. Feature selection serves two main purposes. First, it makes training and applying a classifier more efficient by decreasing the size of the effective vocabulary. This is of particular importance for classifiers that are expensive to train. Second, feature selection often increases classifica- tion accuracy by eliminating noise features. A noise feature is one that, when added to the document representation, increases the classifi- cation error on new data. Suppose a rare term, say arachnocentric, has no information about a class, say China, but all instances of arachnocentric happen to occur in China documents in our training set. Then the learning method might produce a classifier that misassigned test documents containing arachnocentric to China. Such an incorrect generalization from an accidental property of the training set is called overfitting. We can view feature selection as a method for replacing a complex clas- sifier (using all features) with a simpler one (using a subset of the features). The main aim of this step is to categorize the words in documents and generate features from those words.

Consider each document and divide the document into window of 150 words. For each win- dow, get the overlap of words in the window with the list of NEs, i.e., the NEs from each window that are present in the WikiList are extracted. Consider the category of the NEs i.e., the name of PERSON, LOCATION or ORGANIZATION. The remaining words of the window (non- NEs and less frequent occurring words) form the features for the category the NE belong to. To calculate the probability or scores of the features, there are number of techniques involved.

The above approach is explained with the help of an example. There are two paragraphs (in- stead of a window of 150 words two paragraphs are considered for simplicity) shown below in Telugu describing the places Kumarabhimaramam, “Somaramam“, famous temples of Andhra Pradesh.

39 Some of the NEs in these paragraphs includes

which are identified by list lookup approach. The majority of words are categorized as LO- CATION. Hence, the paragraphs are categorized as LOCATION. Note: LOC is used instead of LOCATION in the above taggings for the sake of simplicity. Now, from the remaining set of words remove the stop-words (words which occur more frequently). These remaining words of a document are considered as terms of that document. In this case some of the words can be

Repeat the approach for all windows and collect all the terms present in each window. This set of terms from all windows is considered as Vocabulary of terms. Now, term-document matrix TD() is build i.e., TD(t,d)=1 only if there exists a term t in document d. else, TD(t,d)=0;

There are three rules while tagging a window. They are

(a) The category of NE’s present in the window is the category of the window. (b) If in case of different categories present in window, the category of maximum NEs is the cat- egory of window. In case of equal number of NE’s for two different categories the window is considered for all of these categories of NEs. (c) In case there are no NEs present in window of 150 words, then that window is categorized as Non-NE. Hence, we would derive the number of windows for each category.

From this set of Vocabulary, only certain terms are considered for each category, Such terms are considered as features of that category and are derived from assigning different possible scores for all words in the vocabulary.

40 The scores that are considered for classification are explained below in detail:

4.3.1 Mutual Information

A common feature selection method used to compute A(t, c) as the expected mutual information (MI) of term t and class c. MI measures how much information the presence/absence of a term contributes to making the correct classification decision on c.

Formally:

P (µ = et,C = ec) I (µ; C) = Σet∈1,0Σec∈1,0P (µ = et,C = ec) log2 (4.1) P (µ = et) P (C = ec) where µ is a random variable that takes values et = 1 (the document contains term t) and et = 0

(the document does not contain t), and C is a random variable that takes values ec = 1 (the docu- ment is in class c) and ec = 0 (the document is not in class c). We write µt and Cc if it is not clear from context which term t and class c we are referring to.

For MLEs of the probabilities 4.1 is equivalent to 4.2 ,

N11 NN11 N01 NN01 N10 NN10 N00 NN00 I (µ; C) = log2 + log2 + log2 + log2 (4.2) N N1N1 N N0N1 N N1N0 N N0N0 where the Ns are counts of documents that have the values of et and ec that are indicated by the two subscripts. For example, N10 is the number of documents that contain t (et = 1) and are not in c (ec = 0). N1. = N10 + N11 is the number of documents that contain t (et = 1) and we count documents independent of class membership (ec ∈ 0, 1). N = N00 + N01 + N10 + N11 is the total number of documents.

The main aim of calculating mutual information for a term t is to calculate how much is the word contributing to the category by being present as a feature to the category or how much is the word contributing by being not present as feature of the category.

From the earlier example, the category of the window is determined by the category of NEs present in the window, like the category of previous example is LOCATION as all NEs are cate- gorized as LOCATION.

41 If a term t from the vocabulary is present in window w and the window in turn is present in category c, then that implies that the term t belongs to category c.

Like if the word belongs to doc/ window D1 and D1 is under category LOCATION. So belongs to category LOCATION.

For sake of simplicity let us assume only one category LOCATION and one word (kshetram in English) :

To derive the MI of term ’kshetram’ and category LOCATION, here is the table

ec = elocation = 1 ec = elocation = 0 et = ekshetram = 1 N11 = 1506 N10 = 6694 et = ekshetram = 0 N01 = 2300 N00 = 20762

Table 4.1 Calculation of MI

1506 31262 ∗ 1506 2300 31262 ∗ 2300 I (µ; C) = log + log 31262 2 8200 ∗ 8200 31262 2 23062 ∗ 8200 6694 31262 ∗ 6691 20762 31262 ∗ 20762 + log + log 31262 2 8200 ∗ 23062 31262 2 23062 ∗ 23062 (4.3)

Thus, we have calculated the MI for term with respect to category LOCATION. This score determines how much is the presence or absence of word will contribute to the category - LOCATION.

The process is repeated for all words and finally we will have similar scores. That is MI of each term with every category is determined.

4.3.2 χ2 Feature Selection

Another popular feature selection method is χ2. In statistics, the χ2 test is applied to test the independence of two events, where two events A and B are defined to be independent if P(AB) = P(A)P(B) or, equivalently, P(A/B) = P(A) and P(B/A) = P(B). In feature selection, the two events are occurrence of the term and occurrence of the class. We then rank terms with respect to

42 the following quantity:

2 2 (Net ec − Eet ec) χ (D, t, c) = Σet∈1,0Σec∈1,0 (4.4) Eet ec where et and ec are defined as in Equation 4.2 . N is the observed frequency in D and E the expected frequency. For example, E11 is the expected frequency of t and c occurring together in a document assuming that term and class are independent.

The score will actually determine the deviation between observed and expected frequency. The expected frequency is calculated on the assumption that the term t and class c are independent of each other, i.e., the occurrence of term does not affect the occurrence of the category. If the value of chi-square is high, means that observed frequency deviates a lot from expected frequency, which in turn means if chi-square is high then the dependency between term t and category c is high. Hence, if chi-square for a term t and category c is high then the term will be a good a feature because it means that occurrence of the term will imply the occurrence of category.

χ2 test is calculated between the word and category - LOCATION. This is described as fol- lows: Now E11 is calculated for the above table as: N + N N + N E = N ∗ P (t) ∗ P (c) = N ∗ 11 10 ∗ 11 01 (4.5) 11 N N 1506 + 6694 1506 + 2300 = N ∗ P (t) ∗ P (c) = N ∗ ∗ = 7.8 N N

elocation = 1 elocation = 0 ekshetram = 1 N11 = 1506E11 = 7.8 N10 = 6694E10 = 6690.4 ekshetram = 0 N01 = 2300E01 = 2503.4 N00 = 20762E00 = 20760.3

Table 4.2 Calculation of χ2test

The above calculation is done for all terms in vocabulary and the respective scores along with the categories are stored.

4.3.3 Frequency Based Feature Selection

A third feature selection method is frequency-based feature selection that is, selecting the terms that are most common in the category. There are two frequencies that can be defined: document frequency and collection frequency. Document frequency is the number of documents in the cat- egory c that contain the term t and collection frequency is the number of tokens of t that occur in

43 documents in c. Frequency here is the sum of document frequency and collection frequency. Nor- malized frequency is the sum of document frequency divided by the total number of documents in that category and collection frequency divided by total number of terms in the document .

For term t, document set D and category C, the document, collection and normalized frequen- cies are given by:

Collectionfrequency(cf) = Σt∈Dt(ift ∈ C) (4.6)

Documentfrequency(df) = ΣdinDd(ifdhast ∈ C) (4.7)

NormalizedF requency = (df/total documents in D)+(cf/total num of terms in doc) (4.8) Frequency-based feature selection selects some frequent terms that have no specific information about the category but are important as it gives the most frequent words specific to a category.

4.3.4 Point-wise Mutual Information

The PMI of a pair of outcomes x and y belonging to discrete random variables X and Y quanti- fies the discrepancy between the probability of their coincidence given their joint distribution and their individual distributions, assuming independence. Mathematically:

P (x, y) PMI(x, y) = log (4.9) P (x) P (y)

Unlike in all the previous feature selection methods, where the contribution of each term t to the category c is calculated individually, in this feature selection method, the occurrence of pair of terms (t1, t2) together within the category c is calculated. In short, till now unigrams are consid- ered but in this method even bigrams are considered.

Consider the whole vocabulary of terms, pair every two terms in the vocabulary. This param- eter is deviation from Naive Bayes classification as it includes the aspect of dependency between the term. Calculate the presence or absence of each term in the pair (t1,t2) within the window of words. Here we get the pair of terms with documents and category matrix. The matrix is de- scribed as below:

Now, P (x, y) is the number of documents where the pair of words x and y occur together in pdf a category c let it be pdf. By normalizing, P (x, y) = no of documents categoryC ; That is, the pdf is divided by the total number of documents in category c. P (x) and P (y) are the normal- ized frequency of x and y.

44 The occurrence of a pair of terms repeatedly in different documents under the same category implies that the probability of pair of terms occurring together in the category is high. Hence, this is an important feature selection method.

Finally, total score is linearly dependent on all the above three scores. That is, total score of a term t w.r.t category c is given by:

2 totalScore (t, c) = α1∗MI (t, c)+α2∗χ (t, c)+α3∗normalized frequency+α4∗PMI (t1, t2/c); (4.10) Calculate the total score for all the terms in the vocabulary V and for all the categories of each term t. The weights assigned for α1, α2, α3, α4 are 0.41,0.32,0.20,0.07 respectively and are de- termined experimentally. From these words, only a set of words are considered as features. That is, the words whose scores are more than a threshold are considered as features. The thresholds for categories PERSON, LOCATION, ORGANIZATION and Non-NE are 63%, 65%, 62%, 67% respectively. In case if the same term is occurring in two different categories, the term is consid- ered as a feature for both the categories.

Thus, a set of features are calculated for all categories.

4.3.5 Why only these Features?

During the process of working with NE’s, I have implemented the other models namely CRF tool. The main disadvantage with any of the statistical approaches is that they need to have lots of manually tagged data for training and testing. The constraint of involvement of humans is present even in the case of writing language dependent rules for recognition of NE’s.Developing a system without human intervention will suffer from accuracies. Thus, there is a need to develop a language-independent system with good accuracies.

Hence, the problem of NER can be viewed as classification problem with categories as names of persons, organizations and locations. To classify the NE’s correctly, there need to be a set of features generated for each category. These features can be generated using any of the classifi- cation techniques. Rather naive bayes classifier considers no dependency between the terms of documents. An advantage of the naive Bayes classifier is that it only requires a small amount of training data to estimate the parameters (means and variances of the variables) necessary for classification. Because independent variables are assumed, only the variances of the variables for each class need to be determined and not the entire covariance matrix. Given that the data in web for Indian languages is very less, naive bayes classifier works well compared to other classifica-

45 tion techniques.

In the context of Text Mining, the Naive bayes classifier is used with the MI, χ2 test and frequency based features selection parameters. Mutual information measures how much information in the information- theoretic sense a term contains about the class. If a terms distribution is the same in the class as it is in the collection as a whole, then I (U; C) = 0. MI reaches its maximum value if the term is a perfect indicator for class membership, that is, if the term is present in a document if and only if the document is in the class.

χ2 feature selection only ranks features with respect to their usefulness and is not used to make statements about statistical dependence or independence of variables. In text classification it rarely matters whether a few additional terms are added to the feature set or removed from it. Rather, the relative importance of features is important.

Frequency-based feature selection selects some frequent terms that have no specific information about the class, for example, the days of the week (Monday, Tuesday, . . . ), which are frequent across classes in text. When many thousands of features are selected, then frequency-based fea- ture selection often does well. Thus, frequency-based feature selection can be a good alternative to more complex methods.

PMI is a feature which is a deviation from the assumption that the terms in the document are independent to each other. Hence, PMI will calculate the dependency between the terms in a document.

4.4 Recognition and Tagging of NEs

The purpose of this step is to recognize and tag the NEs present in a document by using the fea- tures generated in the previous step. The documents used for recognizing NEs are new set of documents not used for generating the features, such that the model will not be over-fitted with the documents used for feature selection.

The main idea of this step is to use the generated features in identifying the existence of NE. Moreover, the category of NE is recognized. The whole procedure is explained below in detail:

Consider each document d and divide the document into windows of 150 words. From each such window, remove the frequently occurring words (words that occur more than 45% in the whole corpus). The overlap of remaining words (non- frequent words) with the words in feature sets of each category is considered. Calculate and compare the scores within categories. The

46 maximum scored category is the category the window is assigned if it is more than a threshold. The thresholds vary with sizes of window and are determined experimentally.

The above process is explained with an example. Let us consider a new window of 150 words. For sake of better explanation, the window with category LOCATION describing about a place Ksheraramam - another famous temple of AP is considered similar to that in previous case:

From the window of 150 words remove the stop-words (words which occur more frequently) and the remaining words will be as follows

The scores for these words are calculated. If that exceed a threshold is more than 62% then there exists a NE in the window. Now,reduce the size of window to half and calculate the threshold for both the halves. Now if any of the half exceeds the threshold of 62% then this will indicate the presence of NE in that half. Hence the other half is eliminated. In case both the halves exceeds the threshold consider both of them. This process is repeated till we get window of 5 words or sentence of words.

For example consider that in previous example, the window size reduced to two sentences, wherein the first sentence is

the words which are considered as features are

Hence by considering each word as NE and calculating the scores for remaining words as fea- tures, we derive that

47 are the NE’s and they are tagged as LOCATION as those words are features for category LO- CATION.

Similarly for other sentence

the words considered as features are

Now by following the similar approach of calculating the scores to detect NE’s. The NE’s de- rived from this sentence are:

This process is continued for remaining sentences till we tag all the words. Hence, NE is rec- ognized in a language independent way.

4.5 Challenges and Enhancements

The basic idea behind the approach of above mentioned steps is that recognition of a NE is done by using the remaining words in the document. That is, given that document is having a particular set of words, the probability of presence of a NE is calculated. NEs generated can be in various ambiguous forms. As mentioned in section 1.5, there are a number of challenges while perform- ing the task of NER. To overcome those challenges there are some modifications for the above algorithm. The challenges and the modifications are mentioned below:

4.5.1 Grouping of Similar NE’s to overcome Variations in NE’s

The nested NE (i.e., the NE with more than one word in it) can be referred in many forms. For Example, let us consider a sentence on Abdul kalam: ” Avul Pakir Jainulabdeen Abdul Kalam (About this sound pronunciation (helpinfo); born 15 Oc- tober 1931) usually referred to as Dr. A. P. J. Abdul Kalam, is an Indian scientist and administrator

48 who served as the 11th President of India. Kalam was born and raised in Rameswaram, Tamil Nadu, studied physics at the St. Joseph’s College, Tiruchirappalli, and aerospace engineering at the Madras Institute of Technology (MIT), Chennai. ” The NE Abdul Kalam is referred as Kalam or Avul Pakir Jainulabdeen Abdul Kalam, etc.

The approach specified will tag all the forms of NEs separately. In this case Dr. Kalam, Dr. Abdul Kalam are all different forms of Abdul Kalam and are tagged as /PERSON. Now, a small enhance to the approach includes, if in case that a set of NEs fall into the same category and if they are in the same window of 150 words and if they have considerable amount of overlap between words, then these NEs are attached with each other to specify that they are different forms of the same NE. Hence, in this case we could identify that all these are the various forms of Dr. Abdul Kalam.

4.5.2 Edit Distances to overcome Variations in Spellings

Though this is headed as variation of spellings. They are various ways of defining the variations. They are:

(a) Different forms of a single word: There are many suffixes or prefixes attached to a string which will make it a string different from the former. But, they need to be indexed or scored together. (b) Spell Variations: The same word in Indian languages can be written with different spellings, which will lead to the same word scored differently.

The following enhancements to the primary algorithm of last chapter will overcome the chal- lenges listed above to a far extent. The concept of Edit Distance can be defined as follows:

Edit Distance: Edit distance between two strings is the minimum number of changes in spelling required to change one word into another. It is a metric used in measuring the amount of difference between two sequences. To normalize the edit distance, divide it by the length of the longest word.

diff of num of characters(x, y) NormalizedEditdistance(x, y) = ; (4.11) maxLength(x, y)

The overlap between two strings is the complement of normalized edit distance. That is,

Overlap(x, y) = 1 − Normalized Edit distance(x, y); (4.12)

This concept of overlap(x,y) is used in enhancing the primary algorithm of last chapter as ex- plained below:

49 Whenever there is a need of comparison between terms. That is, whether it may be during compar- ison of a word with the list of NE’s generated from Wikipedia or it may be during the calculation of scores in ’Feature Generation and Selection’ step, i.e., comparison of two terms in the docu- ment, we need the concept of overlap. In the above two cases, during the comparison of word with the list of NE’s from Wikipedia, the words which have an overlap more than 78% are con- sidered as the same words and are grouped together. During the comparison of two words in the document, if the overlap is more than 78% then the two words are considered as a single word and are grouped together.

Hence, in case of spell variations or different forms of single word, all such words are grouped together which will aid to overcome the specified challenge. Though, there are huge number of comparisons and calculations incurred, the time for the calculations is not a constraint as the whole process is offline. The values of threshold and overlap is determined experimentally.

For example these are the kind of words which group with each other:

4.5.3 Ambiguity in Tagging and Identifying NE’s

There can be ambiguity in NE categorization in either of the following ways:

(a) A word is correctly identified as NE and wrongly categorized. That is, for example the word Washington in the sentence - ’Washington is a beautiful place of visit’, is identified as NE and if it is categorized as /PERSON whereas it needs to be actually categorized as /LOCATION. Hence, ambiguity in the tagging of NEs. (b) A word is identified as NE, which it is not supposed to be a NE. For example, the word Pen in the sentence ’Paul pen is coming to visit our house tomorrow’, is an NE and is not an object. But, if it is not categorized as an NE which leads to the ambiguity in identifying the NEs.

The ambiguity in tagging and identifying the NEs is implicitly performed by the approach speci- fied. As the words are identified and tagged based on the remaining words in the window of 150 words, the NE is identified and tagged correctly.

Thus, the various challenges of recognizing and tagging of NEs are overcomed by the enhance- ments on primary algorithm.

50 4.6 Evaluation

4.6.1 Dataset and Test set

A set of 1,800 URLs of Telugu are crawled on web in the depth 3 using Apache Nutch 1.0 which obtained a set of 46,892 Telugu monolingual documents. These documents are preprocessed to remove noise and to obtain clear documents. Now, these documents are divided into two parts in the ratio of 2:1 and are given as input to two of the above steps. That is, 31,262 documents for Feature generation and selection and 15,630 documents for Recognition and Tagging.

The monolingual documents used for Hindi are the documents of FIRE corpus 2009. There are 33,435 number of documents which are again divided in the ratio of 2:1 for the two steps. That is, 22,290 documents for Feature Generation and Selection and 11,145 documents for Recognition and Tagging.

12,000 and 9,000 documents from each of the language i.e., Telugu and Hindi are manually tagged and those NEs are compared with generated NEs. The results are calculated based on these observations.

4.6.2 Metrics

Precision, Recall and F-Measure are the evaluation Metrics and they can be defined as follows:

(a) Precision: P = c/r (b) Recall: R = c/t (c) F-Measure: F = 2*P*R / (P+R) where c is the number of correctly retrieved (identified) NEs, r is the total number of NEs retrieved by the system being evaluated (correct plus incorrect) and t is the total number of NEs in the reference data.

4.6.3 Baseline

For Telugu the model is compared with the system developed by [Gali et al.2008], which uses CRF along with language independent features like prefixes, suffixes, previous and next 3 tokens and compound features.Then, some of the language dependent features like POS tagger, chunk

51 tagger. The accuracy of the system is around 45%. For Hindi, we have compared our system with a Hindi NER system developed by LTRC (Language Technologies Research Centre)1, IIIT Hyderabad. They have made used of the Conditional Random Fields (CRF) and was able to achieve an F-Measure of 63%. Their system is reproduced on our dataset with a 5-fold cross validation using spell variations, pattern of suffixes and POS tagging as the features. We have observed their system as our baseline throughout our experiments. There is no available existing system for Marathi NER. Hence, the Marathi results are not compared but were just reported.

4.7 Experiments and Results

4.7.1 Experiment 1: Variation of α values

Given equation 4.10

2 totalScore (t, c) = α1∗MI (t, c)+α2∗χ (t, c)+α3∗normalized frequency+α4∗PMI (t1, t2/c); (4.13) The features of the category are scored based on different parameters. They are unigram param- eters as Mutual Information, chi-square test and frequency based feature selection and bigram parameter PMI.

To calculate the total score, each of the parameter is multiplied with a weight and summed up to obtain the overall weight of the given feature.

The graph 4.1 will show the variation in the F-Measure scores if each of the parameter is con- sidered. The values of the weights of MI, ci-square, frequency based and PMI are 0.41, 0.32, 0.20, and 0.07 respectively. It can be observed that the weight assigned to PMI is relatively very less compared to the weights of remaining parameters. Since the values of PMI for bigrams are negligible compared to the values of unigram parameters, the weight assigned to it is also very less. Relatively among the three unigram parameters, MI is more important than chi-square than frequency based. Hence the weights are as reflected.

4.7.2 Experiment 2: Threshold for Feature Selection

In Feature Generation and Selection step, each word in the document is considered along with its score. Only some of these words will be the features of the category. That is, the maximum score of words among each category is as follows:

1http://ltrc.iiit.ac.in

52 Figure 4.1 Variations in α values

From the above table it can be observed that the score for category person is more than that

Figure 4.2 Number of features Vs F-Measure for different categories of category location which is more than score of category organization and Non-NE.

The table below shows the list of features along with their scores. The features can be varied by changing the threshold from maximum which will in-turn change the F-Measure.

The graph 4.2 which represents the change of features for every 10% increment of threshold to that of F-Measure is shown in 4.2

53 For example, consider the category person, it can be observed from the graph that for every increment of 10% of features initially, the F-Measure increases till 81%. It means that the ini- tially F-Measure increases with the increase of features till the number of features reach a certain point. Till this point it can be noted that the more the number of features the more they help in incrementing the F-Measure scores. Now, on further increment of number of features, there is a chance of addition of noise terms for each category. Hence, the F-Measure scores will start to decrease from a certain point. Thus, 63% is the maximum threshold that can be used for the given category (person). The same process is repeated for remaining categories also and the graph 4.2 show the variation of scores.

4.7.3 Experiment 3: Experiment on stabilizing the size window of words

In the above procedure, the document is divided into window of words and each window is con- sidered for experimentation. Initially, the entire document is considered for experimentation. The entire words of every document are considered and features are generated and selected for each and every NE in the document. This generation of features on entire document had come up with the following limitations:

(a) There will be more than one NE in the given document. Hence, the assignment of generated of features to the NEs was ambiguous. That is, which feature sets need to be assigned to which NE was always an unanswerable question which leads to low precision values. (b) The presence or absence of each and every word in the document will affect the generation of features. If the entire document is considered then it is really hard to notice noise terms which affects the accuracy of the system.

Hence, the document is divided into window of words and each of the windows is considered as a document to generate the features for NEs. The window size is started at 250 words and for every decrement of 50 words, the precision and recall is calculated. It has been observed that till the size of window is 150, the F-Measure scores increased and from 150 to 120 words the F-Measure scores are constant and from 75 words (approximately), the F-Measure scores started decreasing again.

Refer to Figure 4.3. The above observation might be because, as the window size is between 250 to 150 words, window of such length will be speaking about various topics or various aspects of single topic. Hence, the NEs identified are relatively from different domains. Thus, the ambi- guity of NE to feature set map still persists.

As the window size decreases till 150, the window of 150 words speaks about the same topic. Hence there will not be a topic-drift. These NEs are almost on same category and hence feature

54 set mapping to the NE is quite simple.

Now, if we further decrease the window size to 75 to 80 words or restricted to one or two sen- tences, the presence of NEs in this window is questionable. Even if in case of existence of NE in the window of 80 words, though the words surrounding the NE are considered to be more impor- tant, they cant be just considered as features as there are other words which may also contribute to the occurrence of NE but may not present in the window of 80 words. Thus from the experi- mentation, a window of 150 words is set ideal for given conditions. The graph below shows the variation of F-Measure scores with window of words.

Though it has been observed that the decrease in window size more than 150 will lead to de-

Figure 4.3 Variation of window size with F-Measure crease in F-Measure scores, the features are scored by decreasing the window size less than 150. This is because, features are selected at the time when window size is 150 and then the relative importance of features is calculated by reducing the window size.

4.7.4 Experiment 4: Threshold Vs F-Measure

The variation of values of threshold for sum of scores for the presence of NE’s in given window varies with the F-Measure. This is explained in ’Recognition and Tagging’ step. The graph 4.4 explains the variation of threshold with F-Measure. For very low values of threshold, it means that even on occurrence of few features in a window, if the window is accepted to have a NE. Then the values of F-Measure are very low. Apparently, on increment of threshold, the F-Measure increase. On certain value of threshold (62%) the F-Measure reaches to maximum, which means at certain threshold, the scores of the features which are of reasonably important are considered and on further increment of threshold, the requirement for the features in a window increases which will

55 Figure 4.4 Threshold Vs F-Measure lead to saturation. Hence, the F-Measure scores decreases. Thus, the threshold is varied with F-Measure.

4.7.5 Experiment 5: Edit distance Vs F-Measure

The variation of edit distances also affect the F-Measure scores.

Figure 4.5 Edit distance Vs F-Measure

From the graph4.5 we can derive that the lower the edit distances the higher the overlap between strings, which means the strings which are highly overlapped are only considered to cluster, this increases F-Measure linearly till 0.22 (78% of overlap between words). Though the maximum F-Measure in graph is represented at 0.8, to avoid over-fitting the edit distance is adjusted to 0.78. From 0.81 there is a rapid decrease which represents that clustering of words overlapped less than 20% leads to unpredictable results.

56 Note: For all the above experiments the values of Telugu and Hindi are similar and hence repre- sented in a single graph.

4.7.6 Experiment 6: Baseline Comparison

The parameters of feature selection and the enhancements to the primary algorithm are the major contributions for the motivating precision values. The parameters of feature selection are impor- tant in getting the accurate features. That is, the words which are having the values of parameters more than a threshold are considered as features. Hence, parameters play a major role.

The enhancements performed on the primary algorithm will allow the approach to face the chal- lenges in recognizing NEs. Hence, enhancements to the algorithm will increase the accuracy of getting good features which in turn will increase the F-Measure scores of the system.

The table below shows all the parameters of feature selection and all the enhancements on the derived approach:

Precision Recall F-measure No Features 55 71 62 MI 61 75 67 χ2 test 66 79 72 Frequency 69 82 75 PMI 70 84 76 grouping 71 85 77 edit dist 75 90 81.8 Baseline System 64.9 50.6 56.81

Table 4.3 Experiment on Hindi data

If MI is not considered, it means that the score for the presence or absence of a feature is not considered. Hence, there is almost a difference of x% in the result which shows the importance of MI. If chi-square is not considered, it means the dependency between the occurrences of the word in a particular category is not considered which will lead to a decrease of y%. Finally, if frequency of features is not considered then there is a decrease in z%. From all the above results, it can be said that relatively, MI contributes more to the F-Measure score then is chi-square then is frequency of words. PMI contributes negligible to the F-Measure which can be observed above because, the scores of occurrence of a word with other word is very less compared to the scores

57 Precision Recall F-measure No Features 55 71 62 MI 62 72 66.6 χ2 test 67 80 72.9 Frequency 69 81 74.4 PMI 71 83 76.5 grouping 72 84 77.5 edit dist 74 91 81.6 Baseline System 64.09 34.57 44.91

Table 4.4 Experiment on Telugu data generated from unigrams.

There are two enhancements that are performed on basic algorithm. They are grouping of cate- gorization of words occurring in the same window which is quoted as grouping of categorization in the above table. The contribution of grouping of categorization doesnt contribute much to the F-Measure scores. But, is an important enhancement which will face the challenge of different forms of given NEs (Section 3). The other enhancement which contributes much to the accuracy of the system is edit distances. Edit distances improve the accuracy scores near to 5%. Detecting the spell variations between two words, getting the root of a given word will increase the quality of features generated which will in turn increase the F-Measure scores. Thus, the enhancements affect the accuracies of the system.

The system when compared with baseline has outperformed as it has achieved the accuracies more than the baseline in a language independent way. Whereas, the LTRC system used language dependent tools to achieve such accuracies.

4.8 Discussions

From the all the experiments conducted it can be derived that the selection of features along with their scores affect the F-Measure scores to a vast extent. In experiment 1, the weights with respect to various feature generation parameters are calculated. These weights will show the relative importance of these feature generation parameters. In experiment 2, the number of features or the threshold of scores for feature selection is calculated for every category. In experiment 3, the size of window of words which is suitable for recognition of NE is calculated. In experiment 4, the threshold of scores of the features in a window is calculated and the variation of this threshold with F-Measure scores are calculated. In experiment 5, the variation of edit distances with F-

58 Measure scores are calculated. Finally in experiment 6, the importance of each of the parameter for feature selection i.e., how much is the presence or absence of each parameter contributes to the F-Measure score is calculated and also the contribution of enhancements performed are also calculated. Finally the F-Measure scores outperform the baseline, which shows that the results are encouraging.

4.9 Conclusions

This chapter aims to recognize the NEs using neither language dependent tools, resources or manually annotated data nor avoiding the manual or linguistic experts intervention. The approach aimed at generating features for each and every language. These features can be said as the words in the document which co-occur frequently with an NE at a certain distance. These features are used further for recognizing the NE from new documents. The challenges are faced by further improvements in the basic algorithm. These improvements will further make the system to be language independent with certain amounts of enhancements. Experiments are conducted at every step such that all the values and the weights of all values are determined. Finally, the task of NER can be performed with good accuracies from the baseline. This approach can be extended to any language as it neither used any tools nor is dependent on the grammatical structure of any language.

59 Chapter 5

Summary and Conclusions

NER is the task of recognizing proper nouns from the given text. Recognition of NEs can be done through framing grammar rules by language experts. Though the accuracies for such rule-based systems are good, the rules need to written and maintained by linguistic experts for each language and availability of such experts is a major concern. The other approaches for NER is through supervised machine learning techniques, which uses features such as chunk tagger, POS tagger, gazetteer list, etc., and build a model which is language independent. But, these approaches need high volumes of training and testing data which need to be either manually tagged or use language dependent tools. Hence, the constraint of language dependency still persists. Then the other set of approaches include using of External knowledge as an anchor to get the NEs. These are com- pletely language independent but the NEs are limited to those which are present in the dataset. Moreover, the complete usage of the dataset is important which is not done by many approaches.

The approach specified in this thesis, includes creation of list of NEs from an External knowl- edge Wikipedia and using these list of NEs in generating a statistical model which will recognize the NEs. Hence, the approach tries to overcome the disadvantages of all the above specified ap- proaches. Moreover, the approach concentrated on complete utilization of Wikipedia structure, accuracies (the system generated produced accuracies far better from baseline) and completely language independent. Another aspect of the specified approach is that it disambiguates the NEs and overcome the challenges of generic NER system.

Identification of NEs is performed on the dataset Wikipedia using the concept of term co-occurrences. The dataset Wikipedia is collaborative, multilingual, free internet encyclopaedia supported by the non-profitWikimedia Foundation. The dataset is a comparable corpus with data available in 285 languages which show that it is a major source for multilingual information. Though the data is available in many languages, English is a major part of Wikipedia. The documents available in English are much higher than the documents available in other languages (especially in Indian languages). Hence, in this approach the English Wikipedia is used as a support for recognition

60 of NEs in Indian languages. The approach includes clustering of similar English Wikipedia doc- uments based on the inherent category structure of Wikipedia or based on the overlap of terms between the documents. Hence, the set of clusters of documents which are highly similar to each other are retrieved. Now, for each cluster retrieve the documents from other languages based on the inter-language links of Wikipedia. Finally set of clusters are formed, where each cluster has documents in different languages.

For each cluster, the structural aspects of Wikipedia namely, Infobox, sub-titles, abstract are con- sidered separately. Infobox has key, value pairs where the keys are mapped between English and Indian languages using their co-occurrences and order of occurrence as heuristics. Then the In- fobox values of English is mapped with the corresponding value (if the key of English is mapped with the key of Indian language then the values of those keys are said to be corresponding) in In- dian language and the tag of English NE is replicated to the corresponding word in other language.

Identification of NEs from subtitles is done by clustering the data of similar subtitles and then calculating the co-occurrences between terms. Clustering of similar subtitles across languages is done using the co-occurrence, order of occurrence, and the occurrence of similar terms in the subtitles as heuristics. Dictionary generated by the titles of Wikipedia is used as an anchor for mapping similar subtitles across languages. The data of the subtitles from different documents are mapped and then the occurrence of each NE in English with each word in Indian language is calculated. Finally, the maximum co-occurred words are mapped with each other and the tag of English is replicated to Indian language. Abstract is considered similar to subtitles and hence followed the similar approach.

Experiments conducted in Telugu, Hindi and Marathi languages where a set of Wikipedia doc- uments are manually tagged and the results generated by the system is compared on them. The experiments reveal that the F-Measure scores for different languages are similar. Hence, the ap- proach is constant and independent on any language. Moreover, experiment on exploitation of Wikipedia structure revealed that the more we exploit the structure the more is the F-Measure. The experiments include variation of lambda and beta values to get their thresholds and compari- son between similarity metrics.

Finally, the output of the system is a list of NEs with good precision tagged accordingly and the categories used for tagging throughout the thesis are PERSON, LOCATION and ORGANI- ZATION. The approach specified is simple yet effective and can be extended to any language as none of the language dependent tools or language experts are involved.

Recognition of NEs from documents is done by building a statistical model which generates a

61 set of features and these features are used for tagging of NEs. The dataset used is a set of mono- lingual documents crawled from web from a set of URLs in the depth of 3 for Telugu and FIRE corpus (a parallel corpus) is used in case of Hindi. The dataset is divided in the ratio of 2:1 for generating the features and for tagging the NEs.

The set of monolingual documents and the list of NEs accomplished from Wikipedia are used as input to generate the set of features for categories. Consider each word from every document and different scores are calculated on it, if the word exceeds a threshold then the word is consid- ered as a feature. To achieve this, each document is divided into window of words. From each window extract the NEs which are present in the list of Wikipedia NEs and the categories of these NEs are stored. Then from remaining words generate the scores for each of the parameter MI, chi-square, frequency based selection and PMI.

MI for a word calculates how much the presence or absence of the word is will affect the presence or absence of NE. chi-square calculates the dependency of the word being present in a category, the more the dependency the more is the probability of the word occurring in a category. Fre- quency is the parameter which determines the frequency of the term being present in a category. PMI is a parameter which calculates the dependency of one word when it is known that the other word occurred in a given category. Based on all these parameters, the words which score more than a threshold are considered as features of the categories.

Recognition of NEs is done using the features generated. That is, given a new document, col- lect the words of the document which are considered as features for categories. If the scores of all these words summed up exceed a threshold, then the probability of occurrence of NE is high. This is repeated by reducing the size of window till the probability of NE occurrence is high and more than expected value. Now, based on the category of the features, the category of NE is decided. Hence, recognition and tagging of NE is done.

There are many challenges that need to be faced during the recognition of NEs. To overcome those challenges, certain enhancements need to be performed to the approach specified. To over- come the challenge of various forms of NEs, the NEs in the given window which are tagged with same category and are similar in structure are grouped together. The other challenge is ambiguity in NEs with respect to the category or with respect to the NE with various parts of speech. The approach specified will only tag according to the surrounding features present in window. Hence, these are surpassed. To face the other challenge of spell variations or different forms of same word (root word will be same), the concept called edit distances is used. Edit distance will return the maximum variation between two words. Hence, given two words if edit distance is low then they are grouped to be a single word, which will get the different forms and spells of same words

62 together.

Several experiments are conducted throughout the process. They include variation of weights in feature generation step to know the relative importance of the parameters used to generate scores which shows that parameters are rated in descending order of MI, chi-square, frequency and PMI respectively based on their importance. Moreover, another experiment on selection of features sets the threshold above which a word can be considered as a feature for the category. In recognition of NEs step, the threshold of the scores of features is varied with the F-Measure scores and those values are represented. Experiment which includes the variation of window size with F-Measure derives the appropriate size for the window of words.

Finally, the value of F-Measure reveals that the system outperforms the baseline and moreover the system generated is language independent. The F-measure scores of a language independent system which is higher or on par with the baseline system is quite encouraging.

The overall approach used in the system is efficient in terms of language independency as through- out the approach it neither used any language dependent tools nor is dependent on any language experts. Moreover, the accuracies of the systems are more than or on par with the existing system. There are several issues taken care as complete usage of Wikipedia, overcoming the challenges, etc., which are unique to this thesis. Thus, a language independent NER is developed.

63 Related Publications

(a) Mahathi Bhagavatula, Santosh GSK and Vasudeva Varma, Language Independent Identifi- cation of Named Entities using Wikipedia. In Proceedings of the workshop of Association of Computational Linguistics ACL 2012. (b) Mahathi Bhagavatula, Santosh GSk and Vasudeva Varma, Exploiting Wikipedia for Named entity Identification - A language Independent approach. In proceedings of the workshop of CIKM, 2012.

64 Bibliography

[Bikel et al.1999] Daniel M. Bikel and Richard Schwartz and Ralph M. Weischedel 1999. An Algorithm that Learns What’s in a Name, volume 34. Journal of Machine Learning Research. [Cucerzan2007] Silviu Cucerzan 2007. Large-scale named entity disambiguation based on Wikipedia data. In Proc. 2007 Joint Conference on EMNLP and CNLL, pages 708–716. [Gabrilovich and Markovitch2007] Evgeniy Gabrilovich and Shaul Markovitch 2007. Computing seman- tic relatedness using Wikipedia-based explicit semantic analysis. In Proceedings of the 20th Interna- tional Joint Conference on Artificial Intelligence, pages 1606–1611. [Sujan et al2008] Sujan Kumar Saha, Partha Sarathi Ghosh, Sudeshna Sarkar, and Pabitra Mitra 2008. Named Entity Recognition in Hindi using Maximum Entropy and Transliteration. Manuscript received July 10, 2008. Manuscript accepted for publication October 22, 2008. [Zhou and Su2002] Guodong Zhou and Jian Su 2002. Named Entity Recognition using an HMM-based Chunk Tagger. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, July 6-12, 2002, Philadelphia, PA, USA. [Gabrilovich and Markovitch2006] Evgeniy Gabrilovich and Shaul Markovitch 2006. Overcoming the brittleness bottleneck using wikipedia: enhancing text categorization with encyclopedic knowledge. proceedings of the 21st national conference on Artificial intelligence - Volume 2, pages 1301–1306. [Gabrilovich and Markovitch2005] Evgeniy Gabrilovich and Shaul Markovitch 2005. Feature generation for text categorization using world knowledge. In IJCAI05, pages 1048–1053. [Kazama and Torisawa2007] Jun’ichi Kazama and Kentaro Torisawa 2007. Exploiting Wikipedia as Ex- ternal Knowledge for Named Entity Recognition. Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 698–707. [Milne et al.2006] David Milne and Olena Medelyan and Ian H. Witten 2006. Mining Domain-Specific Thesauri from Wikipedia: A Case Study. Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence, pages 442–448. [Toral and Munoz2006] Antonio Toral and Rafael Munoz 2006. A proposal to automatically build and maintain gazetteers for named entity recognition by using Wikipedia. In EACL 2006. [Timothy Weale2006] Timothy Weale 2006. Utilizing Wikipedia Categories for Document Classification. Evaluation, pages 4.

65 [Zesch et al.2007] Torsten Zesch and Iryna Gurevych and Max Muhlh¨ auser¨ 2007. Analyzing and Access- ing Wikipedia as a Lexical Semantic Resource. Biannual Conference of the Society for Computational Linguistics and Language Technology. [Richman and Schone2008] Alexander E. Richman and Patrick Schone 2008. Mining Wiki Resources for Multilingual Named Entity Recognition. ACL08. [Bunescu and Pasca2006] Razvan Bunescu and Marius Pasca 2006. Using Encyclopedic Knowledge for Named Entity Disambiguation. EACL’06. [Gali et al.2008] Karthik Gali and Harshit Surana and Ashwini Vaidya and Praneeth Shishtla and Dipti M Sharma. 2008 Aggregating Machine Learning and Rule Based Heuristics for Named Entity Recogni- tion. IJCNLP’08. [Gali et al.2008] Praneeth M Shishtla, Karthik Gali, Prasad Pingali and Vasudeva Varma 2008 Experi- ments in Telugu NER: A Conditional Random Field Approach.. In Proceeding of workshop on ”Work- shop on NER for South and South East Asian Languages”. IJCNLP-08, Hyderabad, India. [Asif and Shivaji2010] Asif Ekbal and Sivaji Bandyopadhyay. 2010 Named Entity Recognition using Support Vector Machine: A Language Independent Approach, volume 4 - No: 2. International Journal of Electrical and Electronics Engineering. [shobhana et al.2010] Sobhana N.V, Pabitra Mitra and S.K. Ghosh 2010 Conditional Random Field Based Named Entity Recognition in Geological Text, Volume 1 No. 3. International Journal of Computer Applications (0975 8887). [georgios et al.2010] Georgios Paliouras, Vangelis Karkaletsis, Georgios Petasis and Constantine D. Spy- ropoulos 2010 Learning Decision Trees for Named-Entity Recognition and Classification. ECRAN (Extraction of Content: Research at Near-market) was a Language Engineering project. [Rohit and Niket2010] Rohit Bharadwaj G, Niket Tandon, Vasudeva Varma. 2010 An Iterative approach to extract dictionaries from Wikipedia for under-resourced languages. ICON’10. [D. M. Bikel1997] D. M. Bikel 1997 Nymble: a high-performance learning name-finder. In Proceedings of the Fifth Conference on Applied Natural Language Processing. [Zhang and Zhang2004] Li Zhang Yue Pan Tong Zhang 2004 Focused Named Entity Recognition using Machine Learning. SIGIR’04.

66