Hybrid Conditional Random Fields and K-Means for Named Entity Recognition on Indonesian News Documents

Received: February 17, 2020. Revised: March 15, 2020. 233 Hybrid Conditional Random Fields and K-Means for Named Entity Recognition on Indonesian News Documents Joan Santoso1,3* Esther Irawati Setiawan1,3 Eko Mulyanto Yuniarno1,2 Mochamad Hariadi1,2 Mauridhi Hery Purnomo1,2* 1Department of Electrical Engineering, Institut Teknologi Sepuluh Nopember, Surabaya, Indonesia 2Department of Computer Engineering, Institut Teknologi Sepuluh Nopember, Surabaya, Indonesia 3Department of Informatics, Institut Sains dan Teknologi Terpadu Surabaya, Surabaya, Indonesia * Corresponding author’s Email: [email protected]; [email protected] Abstract: The hybrid approach has been widely used in several Natural Language Processing, including Named Entity Recognition (NER). This research proposes a NER system for Indonesian News Documents using Hybrid Conditional Random Fields (CRF) and K-Means. The hybrid approach is to try incorporating word embedding as a cluster from K-Means and take as a feature in CRF. Word embedding is a word representation technique, and it can capture the semantic meaning of the words. The clustering result from K-Means shows that similar meaning word is grouped in the cluster. We believe this feature can improve the performance of the baseline model by adding the semantic relatedness of the word from the cluster features. Word embedding in this research uses Indonesian Word2Vec. The dataset is consisting of 51,241 entities from Indonesian Online News. We conducted some experiments by dividing the corpus into training and testing dataset using percentage splitting. We used 4 scenarios for our experiments, which are 60-40, 70-30, 80-20, and 90-10. The best performance for our model was achieved in 60-40 scenario with F1-Score around 87.18% and also improves about 5.01% compared to the baseline models. We also compare our proposed methods with several models, which are BILSTM and BILSTM-CRF, from previous research. The experiments show that our model can achieve better performance by giving the best improvement of around 4.3%. Keywords: Named entity recognition, Word2Vec, Hybrid approach, CRF, K-means, Indonesian. and [6]. NER is a task to obtain several types of 1. Introduction entities, such as person, location, or organization from text [7]. It has been widely used in several The Hybrid approach has been implemented in Natural Language Research applications, such as Natural Language Processing Tasks. Hybrid models Fake News Detection [8], Aspect Based Sentiment have been proven to improve the performance of Analysis [9] and Machine Translation [10]. There are various models. Suncong et. al.[1] uses a hybrid various other approaches for obtaining named entity, LSTM and CNN to obtain entities and its relation. such as Conditional Random Fields (CRF) [11, 12], Another hybrid model in sentiment analysis was Long Short Term Memory (LSTM) [13], moreover combining Gradient Boosting Decision Tree and with Bidirectional Long Short Term Memory Support Vector Machine in [2]. A Hybrid method (BILSTM) [14, 15]. was also developed for Summarization of microblog Ambiguity is the most substantial concern in posts [3] and Sentiment Analysis of Political Data [4]. developing NER research. Some entity types in a The main reason why a hybrid approach is chosen as sentence can be misclassified as another type of entity. proposed methods is it can combine each model’s For an example, there is some person name that is strength to obtain better performance. also a location name in Indonesian. We use a CRF as Named Entity Recognition (NER) task has the baseline system, and it appears that some entity already been developed with Hybrid Model as in [5] International Journal of Intelligent Engineering and Systems, Vol.13, No.3, 2020 DOI: 10.22266/ijies2020.0630.22 Received: February 17, 2020. Revised: March 15, 2020. 234 has been misclassified. The features that is used in in Named Entity Recognition. Section 3 describes this baseline system are the surrounding word and how to obtain and construct the dataset for Indonesian syntactic features like Part-Of-Speech (POS). Named Entity Recognition Dataset. Section 4 However, in recent studies, there are some efforts explains about our hybrid models, and section 5 will in trying to obtain a word representation from be discussing the experiments. Section 6 and 7 will unlabelled data. This word representation is known as be the last part of this study, that describes about word embedding. Word Embedding has an advantage, discussion, conclusion, and further research. that it can capture semantic meaning of a words in document. Much research has been done to build 2. Related works on named entity word embedding, like Word2Vec in [16] and Glove recognition in [17]. Numerous research in NER tries incorporating word embedding as a feature in the Named Entity Recognition (NER) or proper name supervised classifier like in [18] and [19]. classification is one of the main components in This study proposes a hybrid model to obtain Information Extraction. NER is widely used to detect Indonesian Named Entity by incorporating Word a person’s name, location, and organization in a Embedding into baseline algorithms. Our hybrid document. However, the entity type can be broadened model is built by combining the word embedding as in other kinds of entities according to the needs. As a cluster features into Conditional Random Fields one of the vital research in Natural Language (CRF) like the previous study in [18]. We use K- Processing, NER has become the foundation of other Means as the clustering algorithm. The clustering NLP tasks such as coreference resolution [21], results will be processed as an additional feature machine translation [10, 22], and question answering besides the standard contextual word and Part-of- system [23]. Speech (POS) features into CRF. There are various methods to solve NER; one of Similar words in word embedding tend to have a them is rule-based NER by utilizing a data dictionary similar vector. Therefore, these similar words will be that consists of country, city, company, and name grouped together in the same cluster. To capture this [24]. With the rule-based approach, entity behavior, we use the cluster features in the CRF. It recognition is conducted by defining rules about will help the CRF to obtain a better result, like a result words position patterns in a sentence [25]. However, of the previous study in [18]. So, it can be concluded with rule-based and dictionary-based approach, there that CRF and K-Means have complemented each is a high dependency on the domain where the rules other’s strength in our proposed hybrid model for and dictionary were built. Thus the Named Entity Named Entity Recognition. Word embedding in our Recognition will face challenges on a new domain or research is pre-trained Word2Vec based on the new sentence model. previous study by [20]. Due to the previous limitations, machine learning Based on the methodology that we proposed in approach is widely implemented in the Named Entity this study, we will describe several main Recognition task. Hidden Markov Model (HMM) is contributions as follows: one of the machine learning algorithms that is used beside Conditional Random Fields. NER with HMM • A hybrid CRF and K-Means clustering does not need a language expert. This means it can be algorithm are proposed in this study. utilized in any language and could achieve a good • K-means algorithm is utilized to obtain the word performance score [26]. However, Conditional cluster from Word2Vec, which significantly Random Fields is advantageous compared to HMM improves the performance of the algorithms by due to its method, which receives a sequence and eliminating the ambiguity of misclassified type maximizes conditional probabilities of labels so any of entity from baseline algorithms. additional feature could be easily represented. • A standard dataset that can be used in Indonesian Another machine learning approach is by Maximum Named Entity Recognition task is rarely to be Entropy in [27], which is used to obtain Czech named found. In this study, we try to propose this entity. Nonetheless, the shortcoming of Maximum dataset as one of the standard datasets for Named Entropy label bias problem could be covered by CRF. Entity Recognition task in Indonesian. Our NER has been implemented in various languages, dataset was taken from Indonesian Online News such as Arabic [28], and especially in Indonesian. from various topics. However, NER in Indonesian has similar challenges This paper is divided into seven parts and in English, though the language structure is different. organized as follows. The first part is the introduction in section 1. Section 2 describes the previous study International Journal of Intelligent Engineering and Systems, Vol.13, No.3, 2020 DOI: 10.22266/ijies2020.0630.22 Received: February 17, 2020. Revised: March 15, 2020. 235 Table 1. Related works in Indonesian NER No. Authors Tagset Dataset Model 1 Munarko Person, Location, Organization, Other 2000 training Conditional Random Fields et.al. [11] data from tweet 2 Jaariyah [29] Person, Location, Organization, Other 2231 sentences Conditional Random Fields 3 Wibawa et.al. Person, God, Organization, Location, 457 news Naive Bayes, SVM, Simple [30] Facility, Product, Event, Natural-Object, articles with Logistic Disease, Color, Timex, Periodx, Numex, 1500 sentences Countx, Measurement 4 Al-Ash et.al. Age, Date, Doctor, Hospital, ID, 888 documents Combination of Long Short [31] Location, Patient, Phone Term Memory and Conditional Random Fields 5 Wintaka Person, Location, Organization 600 Indonesian Combination of Bidirectional et.al.[32] Tweets Long Short Term Memory and Conditional Random Fields 6 Wibisono Person, Organization, Other 2092 sentences Combination of Long Short et.al. [33] Term Memory and Conditional Random Fields 7 Rachman et. Organization, Person, Location 480 tweets BiDirectional Long Short al. [34] Term Memory 8 Anggareska Object, Location, Time, Condition, Cause, 290 tweets Naïve Bayes, SMO, IBK et.

Load more