Instructions for ACL 2007 Proceedings
Total Page:16
File Type:pdf, Size:1020Kb
Bengali Named Entity Recognition using Support Vector Machine very crucial and pose a very big challenge to the Abstract NLP researchers. The level of ambiguity in named entity recognition (NER) makes it difficult to at- Named Entity Recognition aims to classify tain human performance. each word of a document into predefined NER has drawn more and more attention from target named entity classes and is nowa- the named entity (NE) tasks (Chinchor 95; Chin- days considered to be fundamental for chor 98) in Message Understanding Conferences many Natural Language Processing (NLP) (MUCs) [MUC6; MUC7]. The problem of correct tasks such as information retrieval, ma- identification of NEs is specifically addressed and chine translation, information extraction, benchmarked by the developers of Information question answering systems and others. Extraction System, such as the GATE system This paper reports about the development (Cunningham, 2001). NER also finds application of a Named Entity Recognition (NER) sys- in question-answering systems (Maldovan et al., tem for Bengali using Support Vector Ma- 2002) and machine translation (Babych and Hart- chine (SVM). Though this state of the art ley, 2003). machine learning method has been widely The current trend in NER is to use the machine- applied to NER in several well-studied lan- learning approach, which is more attractive in that guages, this is our first attempt to use this it is trainable and adoptable and the maintenance of method to Indian languages (ILs) and par- a machine-learning system is much cheaper than ticularly for Bengali. The system makes that of a rule-based one. The representative ma- use of the different contextual information chine-learning approaches used in NER are HMM of the words along with the variety of fea- (BBN’s IdentiFinder in (Bikel, 1999)), Maximum tures that are helpful in predicting the vari- Entropy (New York University’s MENE in ous named entity (NE) classes. A portion (Borthwick, 1999)), Decision Tree (New York of a partially NE tagged Bengali news cor- University’s system in (Sekine 1998) and Condi- pus, developed from the archive of a lead- tional Random Fields (CRFs) (Lafferty et al., ing Bengali newspaper available in the, 2001). Support Vector Machines (SVMs) based web has been used to develop the SVM- NER system was proposed by Yamada et al. based NER system. The training set con- (2001) for Japanese. His system is an extension of sists of approximately 150k words and has Kudo’s chunking system (Kudo and Matsumoto, been manually annotated with sixteen NE 2001) that gave the best performance at CoNLL- tags. Experimental results with the 10-fold 2000 shared tasks. The other SVM-based NER cross validation test have demonstrated the systems can be found in (Takeuchi and Collier, overall average Recall, Precision and F- 2002) and (Asahara and Matsumoto, 2003). Score of 94.3%, 89.4% and 91.8%, respec- Named Entity (NE) identification in Indian tively. It has been shown that this system languages in general and in Bengali in particular is outperforms other existing Bengali NER difficult and challenging. In English, the NE al- systems. ways appears with capitalized letter but there is no concept of capitalization in Bengali. There has 1 Introduction been very little work in the area of NER in Indian Named Entity Recognition (NER) is an important languages. In Indian languages particularly in tool in almost all Natural Language Processing Bengali, the work in NER can be found in (Ekbal (NLP) application areas such as information re- and Bandyopadhyay, 2007a; Ekbal and Bandyop- trieval, machine translation, question-answering adhyay, 2007b). These two systems are based on system, automatic summarization etc. Proper iden- the pattern directed shallow parsing approach. An tification and classification of named entities are HMM-based NER in Bengali can be found in (Ek- bal et al., 2007c). Other than Bengali, a CRF- 3 Named Entity Recognition in Bengali based Hindi NER system can be found in (Li and McCallum, 2004). Bengali is one of the widely used languages all over the world. It is the seventh popular language 2 Support Vector Machines in the world, second in India and the national lan- guage of Bangladesh. A partially NE tagged Ben- Suppose we have a set of training data for a two- gali news corpus (Ekbal and Bandyopadhyay, class problem: {(xy11, ),.....(xNN, y)} , where xi 2007d), developed from the archive of a widely εRD is a feature vector of the i-th sample in the read Bengali news paper available in the web, has training data and yiε{ been used in this work to identify and classify NEs. +1, -1} is the class to which xi belongs. The goal is The corpus contains around 34 million word forms to find a decision function that accurately predicts in ISCII (Indian Script Code for Information Inter- class y for an input vector x. A non-linear SVM change) and UTF-8 format. The location, reporter, classifier gives a decision function f(x)=sign(g(x)) agency and different date tags (date, ed, bd, day) in for an input vector where, the tagged corpus help to identify some of the loca- m tion, person, organization and miscellaneous gx()=+∑ wiK(x,zi) b names, respectively that appear in some fixed i=1 places of the newspaper. These tags cannot detect Here, f(x)=+1 means x is a member of a certain the NEs within the actual news body. The date in- class and f(x)=-1 means x is not a member. zi s are formation obtained from the news corpus provides called support vectors and are representatives of example of miscellaneous names. A portion of this training examples, m is the number of support vec- partially NE tagged corpus has been manually annotated tors. Therefore, the computational complexity of with the sixteen NE tags as described in Table 1. g(x) is proportional to m. Support vectors and 3.1 Named Entity Tagset other constants are determined by solving a certain quadratic programming problem. Kx(,zi)is a ker- An SVM based NER system has been developed in nel that implicitly maps vectors into a higher di- this work to identify NEs in Bengali and classify mensional space. Typical kernels use dot products: them into the predefined four major categories, namely, ‘Person name’, ‘Location name’, ‘Organi- Kx(,zi)= k(x.z). A polynomial kernel of degree d zation name’ and ‘Miscellaneous name’. In order d is given by Kx(,zi)=(1+x) . We can use various to properly denote the boundaries of the NEs and kernels, and the design of an appropriate kernel for to apply SVM in NER task, sixteen NE and one a particular application is an important research non-NE tags have been defined as shown in Table issue. 1. In the output, sixteen NE tags are replaced ap- We have developed our system using SVM propriately with the four major NE tags by some (Jochims, 1999) and (Valdimir, 1995), which per- simple heuristics. forms classification by constructing an N- dimensional hyperplane that optimally separates 3.2 Named Entity Feature Descriptions data into two categories. Our general NER system Feature selection plays a crucial role in Support includes two main phases: training and classifica- Vector Machine (SVM) framework. Experiments tion. Both the training and classification processes 1 were carried out to find out most suitable features were carried out by YamCha toolkit, an SVM for NE tagging task. The main features for the based tool for detecting classes in documents and NER task have been identified based on the differ- formulating the NER task as a sequential labeling ent possible combination of available word and tag problem. Here, the pairwise multi-class decision context. The features also include prefix and suffix method and second degree polynomial kernel func- 2 for all words. The term prefix/suffix is a sequence tion were used. We have used TinySVM-0.07 of first/last few characters of a word, which may classifier that seems to be the best optimized not be a linguistically meaningful prefix/suffix. among publicly available SVM toolkits. The use of prefix/suffix information works well for highly inflected languages like the Indian lan- 1http://chasen-org/~taku/software/yamcha/ 2http://cl.aist-nara.ac.jp/~taku-ku/software/TinySVM guages. In addition, various gazetteer lists have •Word suffix: Word suffix information is helpful been developed for use in the NER task. to identify NEs. This feature can be used in two different ways. The first and the naïve one is, a NE tag Meaning Example fixed length word suffix of the current and sur- PER Single-word `JôÝX [sachin] / PER rounding words might be treated as feature. The person name second and the more helpful approach is to modify LOC Single-word ^çV[ýYÇÌ[ý the feature as binary valued. Variable length suf- location name [jadavpur]/LOC fixes of a word can be matched with predefined ORG Single-word +XãZõç×aa [infosys] / lists of useful suffixes for different classes of NEs. organization ORG The different suffixes that may be particularly name helpful in detecting person (e.g., -[ýç[ýÇ [-babu], -Vç [- MISC Single-word 100%[100%]/ MISC miscellaneous da] , -×V [-di] etc.) and location names (e.g., - _îç³Qö [- name land], -YÇÌ[ý [-pur], -×_Ì^ç [-lia] etc.) have been consid- B-PER Beginning, In- `JôÝX Ì[sachin]/ B- ered also. Here, both types of suffixes have been I-PER ternal or the PER Ì[ýã]` [ramesh] / used. E-PER End of a multi- •Word prefix: Prefix information of a word is also word person I-PER åTö³QÇö_EõÌ[ý [ten- name dulkar] / E- PER helpful.