Lao Named Entity Recognition Based on Conditional Random Fields with Simple Heuristic Information
Total Page:16
File Type:pdf, Size:1020Kb
2015 12th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD'15) Lao Named Entity Recognition based on Conditional Random Fields with Simple Heuristic Information Mengjie YANG1,2, Lanjiang ZHOU1,2,*, Zhengtao YU1,2 , Shengxiang GAO1,2 , Jianyi GUO1,2 1 School of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, China 2 Key Laboratory of Intelligent Information Processing, Kunming University of Science and Technology, Kunming 650500, China Abstract—According to characteristics of Lao named entities, the E.Fersini et al. [9] describe how the discovery of semantic paper proposes an approach of Lao Named Entity Recognition information can be viewed as an optimization problem. The (NER) based on Conditional Random Fields (CRFs) with problem aimed at assigning a sequence of labels to a set of knowledge information. Firstly, we segment the text into word interdependent variables, and dependencies among variables 1 sequence and design three labels BIO for personal name and are efficiently modeled through Conditional Random Fields location name entity recognition. Secondly, some named entity (CRFs). E. Fersini and E. Messina [10] address the problem of features of Lao Language are selected for Conditional Random extracting structured information from the transcriptions Fields (CRFs) model, such as the clue word feature, the predicate generated automatically using an Automatic Speech feature etc.. Then, candidate named entities are recognized. Recognition (ASR) system, by integrating Conditional Thirdly, we extract simple personal name and location name Random Fields (CRFs) with available background information. features of Lao Language to build heuristic information, and use the heuristic information to determine candidate named entities. M. Chang, L. Ratinov et al. [11] use the constraint way which Finally, named entities which have not been discovered by relates to the introduction of constrains during the inference Conditional Random Fields (CRFs) model are further recognized phase for preserving the necessary relationships over the by using the named entities word list, and these final named output prediction. Conditional Random Fields (CRFs) with entities are obtained. The experimental results show that the constraint is able to capture some more complex relationships method proposed is effective, and it can improve the effect of among output variables to address Named Entity Recognition named entity recognition by using machine learning method with (NER) problems. For example, Dan Roth [12] presents a new heuristic information. inference procedure based on integer linear programming inference (ILP) and extends Conditional Random Fields (CRFs) Keywords-Lao; Named Entity Recognition; Conditional models to efficiently support general constraint structures. The Random Fields; Rules; Entity Feature method has achieved a good effect on the semantic role labeling. The final method is a hybrid method in combination I. INTRODUCTION with rules and machine learning [13][14][15]. Due to the limitation of the rule-based method and the machine learning Named Entity Recognition (NER) is very important in method, nowadays, there are many scholars start to improve the many Natural Language Processing (NLP) tasks, such as effect of Named Entity Recognition (NER) by combining the Machine Translation (MT) [1], Cross-Language Information ruse-based with the machine learning method, which need to Retrieval (CLIR), Information Extraction (IE) [2] and Parsing, realize fleetly the Named Entity Recognition (NER) by etc. At present, the informationization level is low in Lao, and analyzing the features of Lao language and using the method of the work in lexical, syntactic and semantic analysis is rare, so machine learning. Lao Named Entity recognition plays a very important role in promoting the machine understanding and machine translation Compared with English Named Entity Recognition (NER), of Lao language. Lao and Chinese Named Entities are quite similar, such as they There are a lot of studies on Named Entity Recognition both do not have special features like capitalization to help (NER) in many languages especially English, Chinese and identify named entities. Moreover, in the sentence, there are not Thai, etc, but the research is still very weak in Lao languages. spaces to delimit the word. And the order of Subject, predicate The recognition methods are divided into three classes: the first and object is the same, for example, ທ່ານ (Mr.) ຫຍູ (yu) class is the rule-based approach [3], the method is studied mainly by studying a large number of domain-specific corpus, ໄຊກິງ (Zaijing) ເປັນ (is)ນັກຂຽນ (writer). Of course, it analyzing the characteristic of the language, according to the has its own characteristics, for example, rear attributive, component elements of scale-limited of its named entity and ປະຊາຊົນ (People)ຈີນ (Chinese). If the personal name is stable forming mode to set the rules. And the recognition of named entity is realized by using methods such as the rule the Lao local name, the first name is in front, the last name is in matching, etc. The second is the approach of machine learning back, for example, ສກໃຈ (Soukjai) ລດຕະນະ (Ladtana), [4][5][6][7][8], the method is used to realize the recognition of otherwise, lastname is in front, the first name is in back, for named entity by fusing some characteristics into machine learning model such as Conditional Random Fields (CRFs), example, ໂຈ່ວ (Zhou) ເອີ່ນລາຍ (Enlai). In the sentence and using methods of statistical learning. For example, of Lao language, the adverbial is generally at last, for example, *Corresponding Author is Lanjiang Zhou ([email protected]) 1. BIO: B expresses the first word of entity. I expresses that the word is part of, but not first in the segments. O expresses the irrespective word. 978-1-4673-7681-5 ©2015 IEEE 1460 Word(2) The secondright of current word ຫມູ່ເພື່ອນ (friends) ຂອງຂ້າພະເຈົ້າ (my)ສຶກສາ (learn)ໃນ (in) ກຸງ (city) ປັກກິ່ງ (Beijing). Word(-1) The firstleft of current word The front of general location name of Lao has the special word Word(-2) The secondleft of current word to be distinguished, for example, ແຂວງ (Province) POS(0) Prat of speech (POS) of current word ຫຼວງພະບາງ (Luang Prabang). The personal names of Lao have the difference of positive and negative, the personal name POS(-1) The firstleft POS of current word expresses young man if the “ທ້າວ (Tao)” is added in the POS(-2) The secondleft POS of current word front of the name, and the personal name express young TABLE I is the generic feature template for recognizing woman if the “ນາງ (Niang)” is added in the front of the name. named entities of Lao Language. In order to increase the For example, ທ້າວ (Tao) ຄາໍາແພງ (Kanpen), ນາງ context information description, the following template needs the four position (-2, -1, 1, 2) and so on. The generic feature (Niang)ມະນີ (Mani). The personal name is represented as template describes the current word, the morphology and the man if the front of name have ທ່ານ (Mr.), the person name is part of speech of several words of the context of Lao Language, expresses the limited Lao context information. In many cases, represented as woman if the front of name have ນາງ (Mrs.), etc the simple feature of morphology and the part of speech can [16][17]. So the paper adopts the research achievement of not fully describe the complex phenomenon in language, so we Named Entity Recognition such as Chinese, combines with need to dig the feature description template which is more special advantages of the rule-based method and the statistical- suitable for the inherent law of language. based method, fusions the inherent characteristic of Lao named entity, and adopts methods of combining rules and statistics to B. The define of composite template study the technology of Lao Named Entity Recognition (NER). Because Conditional Random Fields is a logarithmic linear model, we can combine characteristics of the Generic template, The paper will be organized as follows. The section 2 constitute complex, nonlinear characteristics. The composite describes the Lao Named Entity Recognition method based on template can be used in long-range dependence and the rich Conditional Random Fields (CRFs). The section 3 introduces context information. The composite template is defined as the the conditional random field models. The section 4 is Heuristic TABLE II. information. The section 5 describes the experiments and results. The Conclusion is in the Section 6. TABLE II. COMPOSITE TEMPLATE Template Template describe II. LAO NAMED ENTITY RECOGNITION METHOD BASED ON CONDITIONAL RANDOM FIELDS Word(0)/ POS(0) The current word and current word POS For the recognition of personal name and location name of Lao Language, because of the complexity and specificity of its Word(1)/ POS(1) The firstright of current word and POS features, it is almost impossible to develop the rules which can contain most of the entity under limited resources environment Word(2)/ POS(2) The secondright of current word and POS if Named Entity Recognition (NER) is realized by only considering rule-based method, so we can use statistical-based Word(-1)/ POS(-1) The firstleft of current word and POS method to recognize the type of entities, use the learning algorithm of Conditional Random Fields (CRFs), combine with Word(-2)/ POS(-2) The secondleft of current word and POS the form, the part of speech, the Lao named entity internal structural features and rich context to obtain personal and Word(0)/POS(1) The current word and the firstright POS of current word location name recognition model through the manual Word(0)/POS(2) The current word and the secondright POS of annotation corpus study. current word Word(0)/POS(-1) The current word and the firstleft POS of current A. The define of generic feature template word The feature template needs to be defined by human if we Word(0)/POS(-2) The current word and the secondleft POS of current adopt conditional random fields.