F2N-Rank: Domain Keywords Extraction Algorithm
Total Page:16
File Type:pdf, Size:1020Kb
Automatization Zhang, Hebei North University, China . Bionic 17. Yan Aijun, Chai Tianyou, Wu Fenghua, Wang intelligent optimization algorithm based on Pu, Hybrid intelligent control of combustion MMAS and fishswarm algorithm, TELKOM- process for ore-roasting furnace. J Control NIKA Indonesian Journal of Electrical Engi- Theory Appl 2008, 6(1) 80–85. neering. 2013, Vol 11 No 9, pp.5517-5522. 18. Liu Fuchun, Yao Yu, He Fenghua, Chen Song- 16. Liwei Tian, Shenyang University, China; Lin lin, Stability analysis of networked control Tian, Liaoning Information integration tech- systems with time-varying sampling periods.. nology engineering research center of internet J Control Theory Appl 2008, 6(1):22–25 of things, China. F2N-Rank: Domain Keywords Extraction Algorithm Zhijuan Wang*, Yinghui Feng The College of Information Engineering, Minzu University of China, No.27 South Street, Zhongguancun, Haidian District, Beijing, 100081, China Minority Languages Branch, National Language Resource Monitoring & Research Center, Beijing, No.27 South Street, Zhongguancun, Haidian District, Beijing, 100081, China Abstract Domain keywords extraction is very important for information extraction, information retrieval, classification, clustering, topic detection and tracking, and so on. TextRank is a common graph-based algorithm for keywords extraction. For TextRank, only edge weights are taken into account. We proposed a new text ranking formula that takes into account both edge and node weights of words, named F2N-Rank. Experiments show that F2N-Rank clearly outperformed both TextRank and ATF*DF. F2N-Rank has the highest average precision (78.6%), about 16% over TextRank and 29% over ATF*DF in keywords extraction of Tibetan religion. Keywords: F2N-RANK, TEXTRANK, ATF*DF 1. Introduction and the domain it is trained on. And in unsupervised Domain keywords can serve as a highly condensed methods [1, 8-11], algorithms based on term frequen- summary for a domain, and they can be used as labels cy and based on graph are the most common meth- for a domain. Domain keywords should be ordered by ods. Algorithms based on term frequency such as TF, the “importance” of keywords. ATF, ATF*DF, ATF*DF are easy to realize but their In the study of keywords extraction, supervised precisions are not very high. Algorithms based on methods [2-7] always depend on the trained model graph, such as TextRank [1], are more effective than No. 9 — 2015 Metallurgical and Mining Industry 225 Automatization algorithms based on term frequency for they take into ing from a given node to another random node in the account the relationships among words. graph. The value of d is usually set to 0.85. TextRank is one of the most popular graph-based wji is the weight of the edge from the previous methods. Each node of the graph corresponds to a node Vj to the current node Vi . candidate keyword from the document and an edge In(Vi) is the set of nodes that point to it (predeces- connects two related candidates. In the entire graph, sors). only edge weights are taken into account. Node Out(Vj) is the set of nodes that node Vi points to weights are also very important. TF*IDF is the com- (successors).[1] mon method for measuring node weights. However, ∑ w jk is the summation of all edge weights TF*IDF is less suitable than ATF*DF when measur- Vk∈ Out( V j ) ing word weights in a domain. TF*IDF function gives in the previous node Vj . a higher weight to a term in a document if it appears wji is defined as the numbers that the correspond- frequently in a document and rarely occurs in the oth- ing words Vj ( and Vi ) co-occur within a window ers [12]. For domain keywords extraction, terms re- of maximum N words in the associated text, where flecting a domain should appear frequently in a large N [2,10].[14] number of documents [13] and ATF (average term TextRank only takes edge weights into account. frequency) should be used instead of TF. Node weights are also very important for node scores. In this paper, a graph-based algorithm inspired by There are several methods can be used for computing TextRank is proposed, named F2N-Rank. The node node weights. weights are taken into account and the idea of F2-mea- (a) TF sure is used for calculating node weights. F -measure ni,j 2 TF( i, j) = (2) formula gives consideration to both ATF and DF. n ∑k k,j This paper is organized as follows: Firstly, Tex- tRank algorithm is introduced. Secondly, the algo- TF(Vi) is the term frequency of node Vi. rithm that we call F N-Rank is proposed for extract- (b) ATF 2 n ing domain keywords. Thirdly, some experiments are i,j ∑ D n performed on the dataset of Tibetan religious domain, ∑k k,j ATF( Vi ) = (3) and the results are given. Finally, the conclusion is {d : ti ∈ d} given. ATF(Vi) is the average term frequency of node Vi. 2. Textrank Algorithm Graph-based ranking algorithms are an essential (c) ATF*DF way of deciding the importance of a node within a ni,j graph, based on global information recursively drawn ∑ D n ∈ ∑k k,j {d : ti d} from the entire graph [1]. The basic idea is to build ATF*DF( Vi ) = *log (4) a graph from the input document and rank its nodes {d : ti ∈ d} D according to their scores. A text is represented by DF(Vi) is the document frequency of node Vi. a graph G (V, E). Each node (Vi) corresponds to a ATF*DF( Vi ) is the product of the average term word. The goal is to compute the score of each node frequency and document frequency of node Vi. according to the formula. The score for Vi, WS(Vi), In Equation (2), Equation (3) and Equation (4), ni,j is initialized with a default value . Using (1), WS(Vi) is the number of occurrences of the word ti in docu- is computed in an iterative manner until convergence. ment dj. D is a collection of N documents; |D| is the The final values are not affected by the choice of the cardinality of D. initial value; only the number of iterations to conver- In next section, a new ranking formula that takes gence may be different [1]. into account both edge and node weights is proposed, w ji named F2N-Rank. WS( Vi ) =( 1 − d) + d* WS( Vj ) (1) ∑ w 3. Proposed Algorithm V∈ In( V ) ∑ ∈ jk j i Vk Out( V j ) TextRank algorithm only focuses on the relation- WS( Vi ) is the score of node Vi . ship among nodes, and node weights are not taken d is the damping factor that can be set between into account. Equation (5) integrates TextRank for- 0 and 1, which represents the probability of jump- mula with the node weight FV( i ) (). w ji FS( Vi ) =( 1 − d) *FVFV( i ) + d*( i ) * ∑ FS( Vj ) (5) V∈ In V w jk j( i ) ∑Vk∈ Out( V j ) 226 Metallurgical and Mining Industry No. 9 — 2015 Automatization There are several formulas can be used to calcu- Fig.1 shows the flow chat of extracting Tibetan re- late the value of FV( i ) , such as TF, ATF, ATF*DF. ligious domain keywords using F2N-Rank algorithm. ATF*DF is the most suitable of the three formulas The first step is domain the documents processing. because it takes into account both term frequency and Sub step (1) is preparing the domain documents document frequency. However, the simple combina- dataset. tion of ATF and DF does not account for their pro- Sub step (2) is word segmentation. portions. Here, the idea of F-measure is introduced Sub step (3) is removing stop words. for calculating FV( i ) .[15] The formulas are given as As they are all Chinese texts, the word segmenta- followings: tion and removing stop words is a must. The free Chi- 2 nese word segment tool is ICTCLAS Segmenter [16] (1+ β ) *ATF( Vi) *DF( V i ) FV = β = 2 (6) The second step is running F2N-Rank algorithm ( i ) 2 ( ) β *ATF( Vi) + DF( V i ) on the prepared dataset. ni,j Sub step (1) is calculating word’s weight using ∑ D F -measure(ATF,DF). ∑ nk,j 2 ATF( V) = k (7) Sub step (2) is calculating word’s score using i ∈ {d : ti d} graph-based algorithm. Finally, take the top T words as domain keywords. {d : ti ∈ d} DF( V) = log (8) i D The main steps of extracting domain keywords us- ing F2N-Rank algorithm are as followings: Step1: Identify words (nouns, adjectives, and so on) that suitable for the task, and add them as nodes in the graph. Step2: Identify relations that connect such words, and use these relations to draw edges between nodes in the graph. Edges can be directed or undirected, weighted or unweighted. Step3: Calculate the weight of nodes in the graph. Step4: Iterate the graph-based ranking algorithm until convergence. Step5: Sort nodes based on their final score. Top T words are the domain keywords. 4. Experiment and Results 4.1. Experiment Description Figure 1. The flow chat of 2F N-Rank experiment To evaluate the proposed algorithm, Tibetan re- ligious domain is selected. Tibetan is a universal 4.2. Experiment Results religion nation and religious activities have been an Two experiments are performed in this paper. integral part of most residents’ daily life. Tibetan re- The aim of experiment 1 is to show which ap- ligious keywords are microcosms of Tibetan religious proach is best in extracting domain keywords. After domain. Tibetan religious domain corpora come from comparing F2N-Rank, TextRank and ATF*DF algo- three websites.