2015 12th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD'15) Lao Named Entity Recognition based on Conditional Random Fields with Simple Heuristic Information

Mengjie YANG1,2, Lanjiang ZHOU1,2,*, Zhengtao YU1,2 , Shengxiang GAO1,2 , Jianyi GUO1,2 1 School of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, China 2 Key Laboratory of Intelligent Information Processing, Kunming University of Science and Technology, Kunming 650500, China

Abstract—According to characteristics of Lao named entities, the E.Fersini et al. [9] describe how the discovery of semantic paper proposes an approach of Lao Named Entity Recognition information can be viewed as an optimization problem. The (NER) based on Conditional Random Fields (CRFs) with problem aimed at assigning a sequence of labels to a set of knowledge information. Firstly, we segment the text into word interdependent variables, and dependencies among variables 1 sequence and design three labels BIO for personal name and are efficiently modeled through Conditional Random Fields location name entity recognition. Secondly, some named entity (CRFs). E. Fersini and E. Messina [10] address the problem of features of Lao Language are selected for Conditional Random extracting structured information from the transcriptions Fields (CRFs) model, such as the clue word feature, the predicate generated automatically using an Automatic Speech feature etc.. Then, candidate named entities are recognized. Recognition (ASR) system, by integrating Conditional Thirdly, we extract simple personal name and location name Random Fields (CRFs) with available background information. features of Lao Language to build heuristic information, and use the heuristic information to determine candidate named entities. M. Chang, L. Ratinov et al. [11] use the constraint way which Finally, named entities which have not been discovered by relates to the introduction of constrains during the inference Conditional Random Fields (CRFs) model are further recognized phase for preserving the necessary relationships over the by using the named entities word list, and these final named output prediction. Conditional Random Fields (CRFs) with entities are obtained. The experimental results show that the constraint is able to capture some more complex relationships method proposed is effective, and it can improve the effect of among output variables to address Named Entity Recognition named entity recognition by using method with (NER) problems. For example, Dan Roth [12] presents a new heuristic information. inference procedure based on integer linear programming inference (ILP) and extends Conditional Random Fields (CRFs) Keywords-Lao; Named Entity Recognition; Conditional models to efficiently support general constraint structures. The Random Fields; Rules; Entity Feature method has achieved a good effect on the semantic role labeling. The final method is a hybrid method in combination I. INTRODUCTION with rules and machine learning [13][14][15]. Due to the limitation of the rule-based method and the machine learning Named Entity Recognition (NER) is very important in method, nowadays, there are many scholars start to improve the many Natural Language Processing (NLP) tasks, such as effect of Named Entity Recognition (NER) by combining the Machine Translation (MT) [1], Cross-Language Information ruse-based with the machine learning method, which need to Retrieval (CLIR), Information Extraction (IE) [2] and Parsing, realize fleetly the Named Entity Recognition (NER) by etc. At present, the informationization level is low in Lao, and analyzing the features of Lao language and using the method of the work in lexical, syntactic and semantic analysis is rare, so machine learning. Lao Named Entity recognition plays a very important role in promoting the machine understanding and machine translation Compared with English Named Entity Recognition (NER), of Lao language. Lao and Chinese Named Entities are quite similar, such as they There are a lot of studies on Named Entity Recognition both do not have special features like capitalization to help (NER) in many languages especially English, Chinese and identify named entities. Moreover, in the sentence, there are not Thai, etc, but the research is still very weak in Lao languages. spaces to delimit the word. And the order of Subject, predicate The recognition methods are divided into three classes: the first and object is the same, for example, ທ່ານ (Mr.) ຫຍູ (yu) class is the rule-based approach [3], the method is studied mainly by studying a large number of domain-specific corpus, ໄຊກິງ (Zaijing) ເປັນ (is)ນັກຂຽນ (writer). Of course, it analyzing the characteristic of the language, according to the has its own characteristics, for example, rear attributive, component elements of scale-limited of its named entity and ປະຊາຊົນ (People)ຈີນ (Chinese). If the personal name is stable forming mode to set the rules. And the recognition of named entity is realized by using methods such as the rule the Lao local name, the first name is in front, the last name is in matching, etc. The second is the approach of machine learning back, for example, ສກໃຈ (Soukjai) ລດຕະນະ (Ladtana), [4][5][6][7][8], the method is used to realize the recognition of otherwise, lastname is in front, the first name is in back, for named entity by fusing some characteristics into machine learning model such as Conditional Random Fields (CRFs), example, ໂຈ່ວ (Zhou) ເອີ່ນລາຍ (Enlai). In the sentence and using methods of statistical learning. For example, of Lao language, the adverbial is generally at last, for example,

*Corresponding Author is Lanjiang Zhou ([email protected]) 1. BIO: B expresses the first word of entity. I expresses that the word is part of, but not first in the segments. O expresses the irrespective word. 978-1-4673-7681-5 ©2015 IEEE 1460 Word(2) The secondright of current word ຫມູ່ເພື່ອນ (friends) ຂອງຂ້າພະເຈົ້າ (my)ສຶກສາ (learn)ໃນ (in) ກຸງ (city) ປັກກິ່ງ (Beijing). Word(-1) The firstleft of current word The front of general location name of Lao has the special word Word(-2) The secondleft of current word to be distinguished, for example, ແຂວງ (Province) POS(0) Prat of speech (POS) of current word ຫຼວງພະບາງ (Luang Prabang). The personal names of Lao have the difference of positive and negative, the personal name POS(-1) The firstleft POS of current word expresses young man if the “ທ້າວ (Tao)” is added in the POS(-2) The secondleft POS of current word front of the name, and the personal name express young TABLE I is the generic feature template for recognizing woman if the “ນາງ (Niang)” is added in the front of the name. named entities of Lao Language. In order to increase the For example, ທ້າວ (Tao) ຄາໍາແພງ (Kanpen), ນາງ context information description, the following template needs the four position (-2, -1, 1, 2) and so on. The generic feature (Niang)ມະນີ (Mani). The personal name is represented as template describes the current word, the morphology and the man if the front of name have ທ່ານ (Mr.), the person name is part of speech of several words of the context of Lao Language, expresses the limited Lao context information. In many cases, represented as woman if the front of name have ນາງ (Mrs.), etc the simple feature of morphology and the part of speech can [16][17]. So the paper adopts the research achievement of not fully describe the complex phenomenon in language, so we Named Entity Recognition such as Chinese, combines with need to dig the feature description template which is more special advantages of the rule-based method and the statistical- suitable for the inherent law of language. based method, fusions the inherent characteristic of Lao named entity, and adopts methods of combining rules and statistics to B. The define of composite template study the technology of Lao Named Entity Recognition (NER). Because Conditional Random Fields is a logarithmic linear model, we can combine characteristics of the Generic template, The paper will be organized as follows. The section 2 constitute complex, nonlinear characteristics. The composite describes the Lao Named Entity Recognition method based on template can be used in long-range dependence and the rich Conditional Random Fields (CRFs). The section 3 introduces context information. The composite template is defined as the the conditional random field models. The section 4 is Heuristic TABLE II. information. The section 5 describes the experiments and results. The Conclusion is in the Section 6. TABLE II. COMPOSITE TEMPLATE Template Template describe II. LAO NAMED ENTITY RECOGNITION METHOD BASED ON CONDITIONAL RANDOM FIELDS Word(0)/ POS(0) The current word and current word POS For the recognition of personal name and location name of Lao Language, because of the complexity and specificity of its Word(1)/ POS(1) The firstright of current word and POS features, it is almost impossible to develop the rules which can contain most of the entity under limited resources environment Word(2)/ POS(2) The secondright of current word and POS if Named Entity Recognition (NER) is realized by only considering rule-based method, so we can use statistical-based Word(-1)/ POS(-1) The firstleft of current word and POS method to recognize the type of entities, use the learning algorithm of Conditional Random Fields (CRFs), combine with Word(-2)/ POS(-2) The secondleft of current word and POS the form, the part of speech, the Lao named entity internal structural features and rich context to obtain personal and Word(0)/POS(1) The current word and the firstright POS of current word location name recognition model through the manual Word(0)/POS(2) The current word and the secondright POS of annotation corpus study. current word Word(0)/POS(-1) The current word and the firstleft POS of current A. The define of generic feature template word The feature template needs to be defined by human if we Word(0)/POS(-2) The current word and the secondleft POS of current adopt conditional random fields. The generic feature template word is defined as the TABLE I. Word(0)/ Word(-1) The current word and the firstleft of current word

Word(0)/ Word(-2) The current word and the secondleft of current word TABLE I. GENERIC FEATURE TEMPLATE Template Template describe C. The define of entity type relevant information feature Word(0) Current word template Although Lao named entity structure is very complex, Word(1) The firstright of current word which also provides a wealth of information for Lao Named

1461 Entity Recognition. In addition, the common personal name developed from Maximum Entropy Markov Models and location name are used to build entity glossary. TABLE III (MEMMs), Conditional Random Fields (CRFs) inherit all is the feature template of the entity type relevant information advantages as well as eliminating the limitation of Maximum which is build based on entity pointer word glossary and the Entropy Markov Models (MEMMs) or the label bias problem. common entity glossary. The entity type relevant information feature template is defined as follows. The paper takes X = x1, x2,…, xN as a word in Lao corpus and a part-of-speech sequence, Y = y1, y2,…. yN as the forecasting simple entity sequence, so we can build the TABLE III. THE ENTITY TYPE RELEVANT INFORMATION FEATURE following model: X has been known, conditional probability of TEMPLATE Y can be calculated by the following formula (1). Template Template describe 1 N = λ (1) Cur_Person current word is instruction word for personal name P(|YX ) exp(∑∑ ii fy ( i−1 ,,,)) yxj i or not. Z X ji=1 N Left_person left two words of current word contain instruction where = λ denotes the zxiiii∑∑∑exp(fy (−1 , yxj , , )) word for personal name or not. yY∈= j1 i Right_person Right two words of current word contain verb normalizing factor, f (y − , y , x, j) denotes the feature (is/are,) or not. i i 1 i λ Cur_location current word is instruction word for location name function, i denotes the feature function corresponding weight. or not. For example, x j denotes the word “ມະນີ (Mani)”, when x j−1 Left_location left two words of current word contain instruction word for location name or not. denotes the word “ທ້າວ (Niang)”, y j corresponding mark Cur_person_Com current word is well-known person name or not denotes the personal name, y j−1 corresponding mark denotes the left boundary word, the feature function equal to 1. So the Cur_location_Com current word is well-known location name or not specific feature template is integrated into model, the rest templates. The weight of feature function λ can pass model Cur_person_Com/ current word is well-known person name or not and i Cur_Person instruction word for person name or not training to get corresponding valuation. Generally, the weight λ Cur_location_Com/ current word is well-known location name or not of feature function i can be trained by adopting the Maximum Cur_location and instruction word for location name or not Likelihood Estimation, the likelihood function can be calculated by the following formula (2). D. Labels for Lao Named Entity Recognition M λ 2 For a given input sentence, we segment word by the k 2 LPYX=−log()() | (2) modified CRFPP segmentation tool, then we design three ∑∑ii 2 = 2σ labels BIO for person name and location name recognition, ik1 k which stand for: The right of the equality’s sign, the second item denotes Gaussian prior value, among them,σ 2 denotes prior variance. PERSON NAME: We adopt L-BFGS (Limited Memory BFGS) [19] to resolve PNB: the first word of person name λ optimal value of likelihood function. The weight i is obtained PNI: the middle word of person name by training, so Conditional Random Fields (CRFs) model is build. The testing corpuses are inputted into the model, the O: the irrespective word most probable marked sequence can be obtained by Viterbi LOCATION NAME: arithmetic, the formula is as follows. * LOCB: the first word of location name YPYX= arg max ( | ) (3) LOCI: the middle word of location name Y * O: the irrespective word In the result, the mark that Y correspond to is the personal name and the location name. So the personal name and the location name recognition problem can be viewed as realizing the BIO label classification task. IV. HEURISTIC INFORMATION The potential Lao named entities are tagged by using the III. CONDITIONAL RANDOM FIELD MODEL morphology and the part of speech of Lao named entities. But there are still some situations that some Lao named entities are Conditional Random Fields (CRFs) [18] is a undirected wrongly tagged or some Lao named entities have not been graphical model that is trained to maximize the conditional tagged. So we can correct these errors of Lao named entities by probability of the expected output based on the corresponding using the following features and making some Heuristic input. Currently, it has been proved to be an effective approach information. to deal with segmenting and labeling sequence data. As

2. CRFPP: a segmentation tool.

1462 A. Features B. Heuristic information for correction of Named Entities 1) Clue feature 1) Correction of NE error: The Lao named entity a) Regular grammar for personal name: The Lao candidate words by Conditional Random Fields (CRFs), the personal names usually start with a title that can be used as the next word of clue can be used as the start of a Lao named clue. In general, the native Lao personal names are composed entity, its predefined Lao named entity combinations can be of the title, one space, first name, one space and [last name] used to extend by calculating the boundary probability. The (last name is denoted as optional part), the structure is shown formula is as follows. =××× in the Figure.1, and for example, ທາວ (Mr.) ສກໃຈ (Soukjai) Pboundary()w01001 P prior NE ( W− ) PWP inNE () end NE () WP after NE () W (4)

ລດຕະນະ (Ladtana), ນາງ (Mrs.) ສແນດຕາ (Sanedta) where Pprior NE ()W-1 is the probability from the training corpus

ພະວງສາ (PhavongSa). that W-1 will be appeared before named entity. Pin NE ()W0 is

Title First Name Last Name probability from training corpus that W0 will be the named

entity constituent. Pend NE ()W0 is the probability from the Figure 1. Regular grammar for personal name written in Lao training corpus that W0 will be the end word of NE

b) Regular grammar for location name: Furthermore, Pafter NE ()W1 is probability from training corpus that W1 will location expressions such as institute, company, office, appear after named entity. Named entity boundary extension district, city, village, town, province, country, etc. also have a will be terminated if the succeeded word is a member of title that can be used as a clue. Geneally, it’s composed of the predefined stop words. We can set a named entity boundary title, one space and location name, the structure is shown in probability value P, the tag of the Named Entity is true if the Figure.2, and for example, ນກຸງ (area) ປັກກິ່ງ Pboundary ()W0 >=P. (Beijing), ບານ (Village) ສສງວອນ (SI-Sung Vone). 2) Correction of Named Entity without clue:Although the personal name and location name are corrected by the above information, there are still some named entities which Title Location Name have not clue are ignored, which can be recognized by

matching the rest of document. Figure 2. Regular grammar for location name written in Lao To correct the errors, we incorporate domain knowledge. 2) Predicate feature Because of the finite number of named entities, we build a In Lao Language, the personal name is the subject in named entities word list to store above recognized and general, and the suffix of the personal name is a verb, the corrected in 1) proper named entities. The post-processing example is shown in the Figure. 3. performs two steps. The first step is to look up the named entity in the named entities word list. If the named entity is in the named entities word list, we mark it. If the named entity is not [ທ່ານ] [ຫຍູ ໄຊກິງ] [ເປັນ][ນັກຂຽນ]. in the named entities word list and contains conjunctions such [Mr.] [Yu Zaiging] [is] [writer]. as “ແລະ/ ກໍ່ຄື (and)”, “ກັບ (with)” and so on, we take the second step to split the named entity into two parts and look up each part in the named entities word list. If the part is in the [ປະທານປະເທດ] [ບີ.ສີຈັນທີ ອາດິດຈານ] named entity word list, we mark the part. Besides these cases, [ໄປຢ້ຽມຢາມ][ສປປລາວ]. the named entity or the part is removed. [The President][B.Sivanthi Adityan] [visit to] [Lao]. V. EXPERIMENTS AND RESULTS Figure 3. Predicate feature for Named Entities in Lao language The experiment corpus in some Lao news website can be 3) Blank Feature used to recognize personal name and location name in Lao, In Lao, there is no explicit boundary indicator between which has been tagged by some Lao students and teachers. words of Lao Language. However, the blank is usually used as Especially, we adjust the segmentation granularity in Lao news a separator between words, phrases, especially between website, which has been preprocessed by segmentation consecutive named entities of Lao Language. The example is procedure. The word is relevant with our segmentation lexicon, shown in the Figure. 4. for example, the string “ສະເຫຼີມໄຊ(Sharon) ກົມມະສິດ(Race)/ n” is segmented into [ບັນດາປະເທດ][ອາຊີເວັນອອກສ່ຽງ][ໃຕ້ເຊັ່ນ]: [ປະເທດ][ໄທ] [ລາວ] “ທ່ານສະເຫຼີມໄຊ (Sharon)/PNB ກົມມະສິດ [ມຽນມາ] ແລະອື່ນໆ [countries] [Asian], [such as]: [country][Thai],[Lao],[Myanmar] etc. (Race)/PNI”, the String “ກະຊວງ (Ministry) ການ (of) ຕ່າງປະ (Foreign)ເທດ (Affairs) ແຫ່ງ (the)ສປປ (country)ລາວ (Lao)/ n” is segmented into Figure 4. Blank Feature in Lao language

1463 “ກະຊວງການຕ່າງປະເທດ (Ministry of Foreign value. And the named entity will be filtered out too much if the Maximum probability value is set to too big value. So the Affairs)/LOCBແຫ່ງ (the)/LOCIສປປ (country)/ LOCI experiment is started with the 0.40. The recall rate decrease too ລາວ(Lao)/ LOCI”. much when the probability value is too high, and the precision rate has been not changed too much. So the result of The experimental results are evaluated in terms of three experiment is the best when the probability value is set to 0.60. parameters. The evaluation metrics as follows. Obviously, the named entity correction method of rule-based has significantly improved the precision rate and the recall rate • Precision (P): The Precision measures the numbers of on the recognition of personal name and location name of Lao correct named entities in the answer file over the total Language. number of named entities in the answer file. P = the right count / the model count× 100% VI. CONCLUSION

z Recall (R): The Recall measures the numbers of In this work, we propose a method of Lao Named Entity correct named entities in the answer file over the total Recognition (NER) based on Conditional Random Fields number of named entities in the answer file. (CRFs) with knowledge from information, and solve the boundary problem of Lao named entity by using the heuristic R = the right count / the corpus count× 100% information about Lao Language. While the experimental results have been quite acceptable, it still have some question: z F-measure (F): The F-measure is a weighted Since the sentences of Lao Language are composed of combination of the precision and the recall. sequence of words formed by a stream of characters. The word 2××PR segmented influence the effect of the correctness of word F−=× Measure 100% segmentation. If characters were grouped differently, the P + R meaning of the word will be different, too. For example, given The experimental results using Conditional Random Fields a string “ປາກກາ (pen)”, there are two possible segmentation (CRFs) are showed in the TABLE IV. sequences: “ປາກກາ (pen)” and “ປາກ-ກາ (mouth-crow)”. Thus, the next work is that we will consider to recognize TABLE IV. EXPERIMENTAL RESULT IN ONLY CRFS named entities of Lao Language without using word Method Named Entity Precision(P) Recall F-measure segmentation or part of speech. Or we will consider to improve ( %) (R)( %) (F)( %) the Precision, Recall of Lao Named Entity Recognition by CRFs Personal Name 81.62 83.27 82.44 using other methods and more information. Location Name 72.53 76.24 74.34 By combining with Conditional Random Fields (CRFs) and ACKNOWLEDGMENT the two step in 1) and 2) in IV (B), we can get the different This paper is supported by National Nature Science results by setting the Maximum probability value. The results is Foundation (No.61472168,61262041,61163004), and the key showed in the TABLE V. project of Yunnan Nature Science Foundation (2013FA030), and Science and technology innovation talents fund projects of Ministry of Science and Technology (No.2014HE001). TABLE V. CRFS AND HEURISTIC INFORMATION Corresponding Author: Lanjiang Zhou ([email protected]). Proba Precisi- Recall Fmeasur Method -bility Named Entity on(P)( (R)( %) e (F)( %) REFERENCES value %) [1] G-D Zhou, J Su. "Product Named Entity Recognition Using Conditional Personal Name 81.71 83.59 82.64 Random Fields", Business Intelligence and Financial Engineering 0.40 Location Name 75.60 76.41 76.00 (BIFE), 2011 Fourth International Conference on , Wuhan, 2011: 86-89. [2] Krupka, G.R., Hausman, K.. "IsoQuest Inc: Description of the NetOwl Personal Name 82.36 84.08 83.21 0.50 Text Extraction System as used for MUC-7". In Seventh Message Location Name 76.26 77.03 76.64 Understanding Conference (MUC-7), Fairfax, Virgina, 1998. CRFs+ [3] Godeny, B.. "Rule Based Product Name Recognition and Heuristic Personal Name 83.73 85.23 84.47 0.60 Disambiguation". Workshops (ICDMW), 2012 IEEE 12th Informatio Location Name 77.61 79.56 78.57 International Conference on, Brussels, 2012: 858 - 860. -n Personal Name 84.52 79.19 81.77 [4] Su-Xiang Zhang, Guo-Yang Gao, Yin-Cheng Qi, and Y. Wilks. 0.70 "personal name and location name recognition based on conditional Location Name 78.24 68.54 73.07 random fields(CRFs)". Proceedings of the Eighth International Personal Name 85.01 78.73 81.75 Confreence on Machine Learning and Cybernetics, Baoding, 2009: 12- 0.80 15. Location Name 78.44 63.91 70.43 [5] Chao-sheng Zhang, Jian-yi Guo, Yan-tuan Xian, Zheng-tao YU, Chun- As can be seen from TABLE V, we can achieve higher ya Lei, and Hai-xiong Wang. "Named Entity Recognition of the Named Entity Recognition (NER) performance by Products with English Based on Conditional Random Fields". Computer incorporating some heuristic information about Lao Language Engineering & Science, vol.32(6), 2010: 115-117. when the Maximum probability value is set to 0.60. It will be [6] O. Bender, F-J. Och, and H. Ney. "Maximum Entropy Models for no use if the Maximum probability value is set to too small Named Entity Recognition". In 7th Conference on Computational

1464 Natural Language Learning (CoNLL-2003), Edmonton, Canada, 2003: [13] A. Mikheev, M. Moens, and C. Grover. "Named Entity Recognition 148-151. without Gazetteers". In 9th European Chapter of the Association of [7] Hui Ning, Hua Yang, Ya-zhou Tan, and Hao Wu. "A method of Chinese Computational Linguistics (EACL), Bergen, Norway, 1999: 1-8. named entity recognition based on maximum entropy model". [14] Choong-Nyoung Seon, Youngjoong Ko, Jeong-Seok Kim, and Jungyun Mechatronics and Automation, 2009. ICMA 2009. International Seo. "Named Entity Recognition using Machine Learning Methods and Conference on , 2009: 2472-2477. Pattern-Selection Rules". In Natural Language Processing Pacific Rim [8] Su-Xiang Zhang, Guo-Yang Gao, and Yin-Cheng Qi. "Personal name Symposium 2001 (NLPRS2001), 2001: 229-236. and location name recognition based on conditional random fields". [15] Hutchatai Chanlekha, Asanee Kawtrakul."Thai Named Entity Extraction Machine Learning and Cybernetics, 2009 International Conference on , by incorporating Maximum Entropy Model with Simple Heuristic 2009: 2255-2260. Information". In Proc. IJCNLP, Bangkok, 2004. [9] E. Fersini, E. Messina, G. Felici, D. Roth. "Soft-constrained inference [16] Xing-qiu Huang, Hong-gui Fan. A Comparative Study of Laos Guy and for Named Entity Recognition". Information Processing and China’s Zhuang Culture. Nationalities Publishing House, 2010. Management, vol.50(5), 2014: 807-819. [17] Lang-min Zhang. Practical Grammar about Lao Language. Foreign [10] E. Fersini, E. Messina. "Named entities in judicial transcriptions: Language Teaching and Research Press, 2001. Extended conditional random fields". Computational Linguistics and [18] John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. Intelligent Text Processing, Samos, Greece, March 24-30, 2013: 317- "Conditional random fields: probabilistic models for segmenting and 328. labeling sequence data". Proceedings of the Eighteenth International [11] M. Chang, L. Ratinov, D. Roth. "Structured Learning with Constrained Conference on Machine Learning, Williamstown, 2001: 282-289. Conditional Models". Machine Learning, vol.88, 2012: 399-431. [19] W-N. Zheng, P-B. Bo, Y. Liu, W-P. Wang. "Fast B-spline curve fitting [12] Dan Roth, Wen-tau Yih. "Integer Linear Programming Inference for by L-BFGS". Computer Aided Geometric Design vol.29(7), 2012: 448- Conditional Random Fields". International Conference on Machine 462. Learning (ICML), 2005: 736-743.

1465