Named Entity Recognition in Assamese: a Hybrid Approach

Named Entity Recognition in Assamese: A Hybrid Approach Padmaja Sharma Utpal Sharma Jugal Kalita Department of CSE Department of CSE Department of CS Tezpur University Tezpur University University of Colorado at Colorado Springs Assam, India 784028 Assam, India 784028 Colorado, USA 80918 Email: [email protected] Email:[email protected] Email:[email protected] Abstract—Most NER systems have been developed using one NER has been applied in many applications such as of two approaches: Rule-based or Machine-Learning, with their Information Extraction, Question Answering and Event strengths and weaknesses. In this paper, we propose a hybrid Extraction. Besides these, NER can also be applied in NER approach which is a combination of both rule-based and ML approaches to improve the overall system performance for co-reference resolution, Web mining, molecular biology, a resource poor language like Assamese. Our proposed hybrid bioinformatics, and medicine, etc. approach is capable of recognizing four types of NEs: Person, The rest of the paper is organized as follows- Section 2 Location, Organization and Miscellaneous. The empirical results describes the characteristic of Assamese and challenges of obtained indicate that the hybrid approach outperforms both NER in Indian languages. Approaches to NER are described rule-based and ML when processed independently. The hybrid Assamese NER obtains an F-measure of 85%-90%. in Section 3. Section 4 describes previous work on NER using Hybrid Approaches. Section 5 describes our work and the last I. INTRODUCTION section describes the conclusion. The term Named Entity, which is used extensively in Natural Language Processing, was first introduced at the Sixth II. CHARACTERISTIC OF ASSAMESE LANGUAGE AND Message Understanding Conference [1] whose main goal was CHALLENGES OF NER to identify entities which can be considered names from a Assamese is a morphologically rich language like any other set of documents and classify them into predefined categories. Indian languages. Although Assamese is an Indo-European Tagging of Named Entities in text plays an important role language spoken by around 30 million people, very little in many NLP applications. In the Message Understanding computational linguistic work has been done for the language. Conferences (MUC) of the 1990s, it became clear that it It is written using the Assamese script. It consists of 11 is necessary to first identify certain classes of information vowels, 34 consonants and 10 digits. There are no uppercase in order to extract meaningful information from a given or lowercase letters in the script. It is a relatively a free word document. Later the conference established the Named Entity order language. For example the sentence: Recognition task, in which systems were asked to identify names, dates, times and numerical information. Thus Named Entity Recognition (NER) can be defined as the identification [E: I will go to play] of proper nouns and the further classification of these proper nouns into a set of classes such as person names, location can be written in any of the following forms given below. names, organization names and miscellaneous names. A few conventions for tagging Named Entities were established at the MUC Conferences. These include ENAMEX for names (organization, person, location), NUMEX for numerical entities (monetary, percentages) and TIMEX tags for temporal The different types of ambiguities that occur in NER are as entities (time, date, year). For example consider the sentence follows: given below: 1) Person vs. location:- In English, a word such as Mr. John visited U.S in July 2012. Washington or Cleveland can be the name of a person Using an XML format, it can be marked up as follows: or a location. Similarly, in Indian English, words such <ENAMEX TYPE=“PERSON”>Mr. as Kashi can be a person name as well as a location John</ENAMEX> visited <ENAMEX name. TYPE=“LOCATION”>U.S</ENAMEX> in 2) Common noun vs. proper noun:- Common nouns <TIMEX TYPE=“DATE”>July 2012 </TIMEX>. sometimes occur as a person name. For example, Here, the markups show the named entities in the document. Surya which means sun in Sanskrit, creates ambiguities between common nouns and proper nouns. 3) Organization vs. person name:- Amulya may be the name is required. Such corpora of significant size are still of a person as well as that of an organization, creating lacking for most Indian languages. Basic resources such ambiguity. An English example may be Trump, which as parts of speech (POS) taggers, or good morphological can be the name of a person as well as the name of a analyzers, and name lists, for most Indian languages do company or a brand. not exist or are in research stages, whereas a number of 4) Nested entities:- Nested entities such as New York resources are available in English. University, also create ambiguity because they contain two or more proper nouns. III. DIFFERENT APPROACHES TO NER Such phenomena are abundant in Indian or South Asian Three broadly used approaches in NER are: languages languages as well. These ambiguities in names can 1) Rule-based be categorized as structural ambiguity and semantic ambiguity. 2) Machine-Learning based, and A number of additional challenges need to be addressed in 3) Hybrid. languages such as Hindi, Bengali, Assamese, Telugu, Urdu Rule-based NER focuses on the extraction of names using and Tamil. The key challenges are briefly described as follows. human made rules. This approach lacks portability and Although our examples are in specific languages, similar robustness. One needs a significant number of rules to phenomena occur in all Indian languages and Assamese in maintain optimal performance, resulting in high maintenance particular. cost. There are several rule-based NER systems for English Lack of capitalization:- Capitalization plays a major role • providing 88%- 92% F-measure [2]. The main attractiveness in identifying NEs in English and some other European of the machine learning (ML) approach is that it is trainable languages. However, Indian languages do not have the and can be adapted to different languages and domains. In concept of capitalization. addition, the maintenance cost is cheaper than that of the Ambiguity:- In Indian languages, the problem of • rule-based approach. The main goal of the ML approach is ambiguity between common nouns and proper nouns to identify proper names by employing statistical models that is more difficult since names of people are usually classify them. ML models can be broadly classified into three dictionary words, unlike Western names. For example, types: [akax] and [zun] mean sky and moon, 1) Supervised, respectively, in Assamese, but also can indicate person 2) Unsupervised, and names. In fact most people’s names are dictionary words, 3) Semi-supervised. used without capitalization. Nested entities:- Indian languages also face the problem 1) Supervised: In supervised learning, the training data • of nested entities. Consider, in Assamese the expression include both the input and the output. In this approach, [nagaland bisHobidyaloi] [E:1 the construction of proper training, validation and test Nagaland University]. It creates a problem for NER in sets is crucial. This method is usually fast and accurate. the sense that the word [nagaland] [E:Nagaland] As the program is taught with the right examples, it refers to a location, whereas [bisHobidyaloi] is “supervised”. A large amount of training data is [E: University] is a common noun and thus required for good performance of this model. Several [nagaland bisHobidyaloi] [E:Nagaland supervised models used in NER are: Hidden markov University] is an organization name. Thus it becomes Model (HMM) [2],[3],[4], Conditional Random Field difficult to retain the proper class. (CRF) [5]; Support Vector Machine (SVM) [6]; and Agglutinative nature:- Agglutination adds additional Maximum Entropy (ME) [7]. In addition, a variant of • features to the root word to produce complex meaning. Brill’s transformation-based rules [8] has been applied For example, in Assamese, [monipuR] [E:Manipu] to the problem [9]. HMM is widely used in NER due refers to a location named entity whereas to the efficiency of the Viterbi algorithm [10] used to [monipuRi] [E:Manipuri] is not a named entity as it discover the most likely NE class state sequence. refers to the people who live in Manipur. 2) Unsupervised:- In the unsupervised learning method, the Ambiguity in suffixes:- Indian languages can have a aim of the model is to build a representation from the • number of postpositions attached to a root word to form data. It can be used to cluster the input data to classes a single word. In Assamese the word [tEzpuR] on the basis of statistical properties. This approach is [E:Tezpur] is a place name, but when the suffix [eeya] portable to different domains or languages unlike the is attached, it gives a different meaning compared to the rule-based approach. [11] discuss an unsupervised model original one which means the people of Tezpur. for NE classification using unlabeled data. Work on NER Resource constraints:- NER approaches are either rule using the unsupervised model can also be found in [12], • based or machine learning (ML)-based. In either case, a and [13]. good-sized corpus of the language under consideration 3) Semi-supervised:- The semi-supervised model makes use of both labeled and unlabeled data, usually resulting 1E: English meaning in high accuracy. Expertise is required to obtain TABLE I gazetteer-based approach involves the tagging of NEs using DIFFERENT WORK ON NER USING HYBRID APPROACHES look-up lists for location, person, and organization names. Reference Language Approach F-measure(%) [16] Punjabi HMM+Rule-based 74.56 A. Features used in Named Entity Recognition [20] Kannada HMM+Rule-based 94.85 [19] Hindi CRF+ME+Rule-based 82.95 Different types of contextual information along with a [17] Hindi ME+Rule-based 65.13 variety of other features are used to identify NEs.

Load more