Automatization Zhang, Hebei North University, . Bionic 17. Yan Aijun, Chai Tianyou, Wu Fenghua, Wang intelligent optimization algorithm based on Pu, Hybrid intelligent control of combustion MMAS and fishswarm algorithm, TELKOM- process for ore-roasting furnace. J Control NIKA Indonesian Journal of Electrical Engi- Theory Appl 2008, 6(1) 80–85. neering. 2013, Vol 11 No 9, pp.5517-5522. 18. Liu Fuchun, Yao Yu, He Fenghua, Chen Song- 16. Liwei Tian, Shenyang University, China; Lin lin, Stability analysis of networked control Tian, Liaoning Information integration tech- systems with time-varying sampling periods.. nology engineering research center of internet J Control Theory Appl 2008, 6(1):22–25 of things, China.

F2N-Rank: Domain Keywords Extraction Algorithm

Zhijuan Wang*, Yinghui Feng

The College of Information Engineering, Minzu University of China, No.27 South Street, Zhongguancun, Haidian District, Beijing, 100081, China Minority Languages Branch, National Language Resource Monitoring & Research Center, Beijing, No.27 South Street, Zhongguancun, Haidian District, Beijing, 100081, China

Abstract Domain keywords extraction is very important for information extraction, information retrieval, classification, clustering, topic detection and tracking, and so on. TextRank is a common graph-based algorithm for keywords extraction. For TextRank, only edge weights are taken into account. We proposed a new text ranking formula that takes into account both edge and node weights of words, named F2N-Rank. Experiments show that F2N-Rank clearly outperformed both TextRank and ATF*DF. F2N-Rank has the highest average precision (78.6%), about 16% over TextRank and 29% over ATF*DF in keywords extraction of Tibetan religion. Keywords: F2N-RANK, TEXTRANK, ATF*DF

1. Introduction and the domain it is trained on. And in unsupervised Domain keywords can serve as a highly condensed methods [1, 8-11], algorithms based on term frequen- summary for a domain, and they can be used as labels cy and based on graph are the most common meth- for a domain. Domain keywords should be ordered by ods. Algorithms based on term frequency such as TF, the “importance” of keywords. ATF, ATF*DF, ATF*DF are easy to realize but their In the study of keywords extraction, supervised precisions are not very high. Algorithms based on methods [2-7] always depend on the trained model graph, such as TextRank [1], are more effective than

No. 9 — 2015 Metallurgical and Mining Industry 225 Automatization algorithms based on term frequency for they take into ing from a given node to another random node in the account the relationships among words. graph. The value of d is usually set to 0.85. TextRank is one of the most popular graph-based wji is the weight of the edge from the previous methods. Each node of the graph corresponds to a node Vj to the current node Vi . candidate keyword from the document and an edge In(Vi) is the set of nodes that point to it (predeces- connects two related candidates. In the entire graph, sors). only edge weights are taken into account. Node Out(Vj) is the set of nodes that node Vi points to weights are also very important. TF*IDF is the com- (successors).[1] mon method for measuring node weights. However, ∑ w jk is the summation of all edge weights TF*IDF is less suitable than ATF*DF when measur- Vk∈ Out( V j ) ing word weights in a domain. TF*IDF function gives in the previous node Vj . a higher weight to a term in a document if it appears wji is defined as the numbers that the correspond- frequently in a document and rarely occurs in the oth- ing words Vj ( and Vi ) co-occur within a window ers [12]. For domain keywords extraction, terms re- of maximum N words in the associated text, where flecting a domain should appear frequently in a large N [2,10].[14] number of documents [13] and ATF (average term TextRank only takes edge weights into account. frequency) should be used instead of TF. Node weights are also very important for node scores. In this paper, a graph-based algorithm inspired by There are several methods can be used for computing

TextRank is proposed, named F2N-Rank. The node node weights.

weights are taken into account and the idea of F2-mea- (a) TF sure is used for calculating node weights. F -measure ni,j 2 TF( i, j) = (2) formula gives consideration to both ATF and DF. n ∑k k,j This paper is organized as follows: Firstly, Tex- tRank algorithm is introduced. Secondly, the algo- TF(Vi) is the term frequency of node Vi. rithm that we call F N-Rank is proposed for extract- (b) ATF 2 n ing domain keywords. Thirdly, some experiments are i,j ∑ D n performed on the dataset of Tibetan religious domain, ∑k k,j ATF( Vi ) = (3) and the results are given. Finally, the conclusion is {d : ti ∈ d} given. ATF(Vi) is the average term frequency of node Vi. 2. Textrank Algorithm Graph-based ranking algorithms are an essential (c) ATF*DF way of deciding the importance of a node within a ni,j graph, based on global information recursively drawn ∑ D n ∈ ∑k k,j {d : ti d} from the entire graph [1]. The basic idea is to build ATF*DF( Vi ) = *log (4) a graph from the input document and rank its nodes {d : ti ∈ d} D according to their scores. A text is represented by DF(Vi) is the document frequency of node Vi. a graph G (V, E). Each node (Vi) corresponds to a ATF*DF( Vi ) is the product of the average term word. The goal is to compute the score of each node frequency and document frequency of node Vi. according to the formula. The score for Vi, WS(Vi), In Equation (2), Equation (3) and Equation (4), ni,j is initialized with a default value . Using (1), WS(Vi) is the number of occurrences of the word ti in docu- is computed in an iterative manner until convergence. ment dj. D is a collection of N documents; |D| is the The final values are not affected by the choice of the cardinality of D. initial value; only the number of iterations to conver- In next section, a new ranking formula that takes gence may be different [1]. into account both edge and node weights is proposed,

w ji named F2N-Rank. WS( Vi ) =( 1 − d) + d* WS( Vj ) (1) ∑ w 3. Proposed Algorithm V∈ In( V ) ∑ ∈ jk j i Vk Out( V j ) TextRank algorithm only focuses on the relation- WS( Vi ) is the score of node Vi . ship among nodes, and node weights are not taken d is the damping factor that can be set between into account. Equation (5) integrates TextRank for- 0 and 1, which represents the probability of jump- mula with the node weight FV( i ) ().

w ji FS( Vi ) =( 1 − d) *FVFV( i ) + d*( i ) * ∑ FS( Vj ) (5) V∈ In V w jk j( i ) ∑Vk∈ Out( V j )

226 Metallurgical and Mining Industry No. 9 — 2015 Automatization There are several formulas can be used to calcu- Fig.1 shows the flow chat of extracting Tibetan re- late the value of FV( i ) , such as TF, ATF, ATF*DF. ligious domain keywords using F2N-Rank algorithm. ATF*DF is the most suitable of the three formulas The first step is domain the documents processing. because it takes into account both term frequency and Sub step (1) is preparing the domain documents document frequency. However, the simple combina- dataset. tion of ATF and DF does not account for their pro- Sub step (2) is word segmentation. portions. Here, the idea of F-measure is introduced Sub step (3) is removing stop words. for calculating FV( i ) .[15] The formulas are given as As they are all Chinese texts, the word segmenta- followings: tion and removing stop words is a must. The free Chi- 2 nese word segment tool is ICTCLAS Segmenter [16] (1+ β ) *ATF( Vi) *DF( V i ) FV = β = 2 (6) The second step is running F2N-Rank algorithm ( i ) 2 ( ) β *ATF( Vi) + DF( V i ) on the prepared dataset. ni,j Sub step (1) is calculating word’s weight using ∑ D F -measure(ATF,DF). ∑ nk,j 2 ATF( V) = k (7) Sub step (2) is calculating word’s score using i ∈ {d : ti d} graph-based algorithm. Finally, take the top T words as domain keywords. {d : ti ∈ d} DF( V) = log (8) i D The main steps of extracting domain keywords us- ing F2N-Rank algorithm are as followings: Step1: Identify words (nouns, adjectives, and so on) that suitable for the task, and add them as nodes in the graph. Step2: Identify relations that connect such words, and use these relations to draw edges between nodes in the graph. Edges can be directed or undirected, weighted or unweighted. Step3: Calculate the weight of nodes in the graph. Step4: Iterate the graph-based ranking algorithm until convergence. Step5: Sort nodes based on their final score. Top T words are the domain keywords. 4. Experiment and Results

4.1. Experiment Description Figure 1. The flow chat of 2F N-Rank experiment To evaluate the proposed algorithm, Tibetan re- ligious domain is selected. Tibetan is a universal 4.2. Experiment Results religion nation and religious activities have been an Two experiments are performed in this paper. integral part of most residents’ daily life. Tibetan re- The aim of experiment 1 is to show which ap- ligious keywords are microcosms of Tibetan religious proach is best in extracting domain keywords. After domain. Tibetan religious domain corpora come from comparing F2N-Rank, TextRank and ATF*DF algo- three websites. The description of corpora is in Ta- rithm in precision, F2N-Rank showed better results. ble 1. The corpora can be downloaded from the reli- In order to show F2N-Rank is better than gion channel of the websites. F0.5N-Rank and F1N-Rank, namely DF-oriented is more suitable for domain keywords extraction. The Table 1. The Corpora of Tibetan Religious Domain experiment 2 is conducted. Experiment 2 showed that F N-Rank has the best performance of F N-Rank, The 2 0.5 The number The kinds Corpora numbers F1N-Rank and F2N-Rank. of words of words of texts Experiment 1 http://www.amdotibet.com/ 437 368562 22814 To evaluate the performance of ranking Tibetan http://www.tibetculture.net/ 446 228651 18293 religious keywords, we conducted a performance http://www.tibet.cn/ 1230 683814 31755 measurement using precision. Now, we discuss the Total 2113 1281027 40722 evaluation of three different ranking algorithms. We

No. 9 — 2015 Metallurgical and Mining Industry 227 Automatization

compared algorithms which are: F2N-Rank, Tex- words. In order to illustrate the keywords extracted

tRank and ATF*DF. using F2N-Rank are more distinguishing, experiments

Results are shown in Fig.2 by measuring the preci- are conducted when taking β as 0.5,1 and 2. F2N-Rank

sion for top N keywords. comes from FβN-Rank when taking β as 2. F2N-Rank is more DF-oriented. Fig.3 shows term weights of the top ten keywords

in Tibetan religious domain using FβN-Rank algo-

rithm when taking β as 0.5, 1 and 2. In F2N-Rank, Y decreases significantly with increasing X of the three

liners. Keywords in F2N-Rank are more distinguish-

ing. F2N-Rank has the best performance of F0.5N-

Rank, F1N-Rank and F2N-Rank. Fig.4 shows the

precision of F2N-Rank is the highest of F0.5N-Rank,

F1N-Rank and F2N-Rank.

Figure 2. Algorithm Comparison in Precision

We can see that F2N-Rank clearly outperformed

both TextRank and ATF*DF. For F2N-Rank, Tex- tRank and ATF*DF, the average precision are 78.6%, 62.2% and 49.2%.The improvement over TextRank is around 16% in average precision and 29% over

ATF*DF. Using F2N-Rank for domain keywords ex- traction has showed better results. Table 2 shows the top 20 keywords of Tibetan re-

ligious domain using F2N-Rank Algorithm. Experiment 2 The order of domain keywords is also very im- Figure (3). Term weight of the top ten keywords for portant because it can reflect features of domain key- FβN-Rank algorithm (β=0.5, 1, 2)

Table 2. Top 20 Tibetan Religious Keywords Using F2N-Rank Algorithm No. Keywords Meaning 1 Tibetan 2 Panchen (a honorific title for monk) 3 religion 4 Living Buddha 5 Rinpoche(a title of respect for a master of ) 6 Geshe(a religious degree of school of Tibetan Buddhism) 7 Tibet 8 Gelugpa(one of the four of Tibetan Buddhism) 9 Sajia Temple(a temple of Sakyapas which is a faction of Tibetan Buddhism) 10 Lama(a title given to a spiritual leader in Tibetan Buddhism) 11 Tiaoshen(a kind of religious dance used to express stories of Gods and ghosts) 12 Lhasa(capital of the Tibetan Autonomous Region) 13 lha-rams-pa(the highest degree of Geshe) 14 Bonismo(Tibetan Religion) 15 Dazhao Temple(a Tibetan Buddhism temple) 16 Sun Buddha(a Tibetan traditional festival) 17 Reincarnation(the belief that after somebody’s death their soul lives again in a new body) 18 Tar Temple(a Tibetan Buddhism temple) 19 great master(a title of a highly respected monk) 20 activity

228 Metallurgical and Mining Industry No. 9 — 2015 Automatization matic keyphrase extraction”, In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, 2009, pp. 1318– 1327. 4. Tomokiyo, Takashi and Matthew Hurst, “A language model approach to key phrase extrac- tion”, In Proceedings of the ACL Workshop on Multiword Expressions, 2003. 5. Turney, Peter, “Learning algorithms for key phrase extraction. Information Retrieval”, 2000, Vol.2, pp.303–336. 6. Turney, Peter, “Coherent key phrase extraction via web mining”, In Proceedings of the 18th Figure (4). FβN-Rank algorithm comparison in precision (β=0.5, 1, 2) International Joint Conference on Artificial In- telligence, 2003, pp.434–439. Conclusions 7. Tomokiyo T, Hurst M, “A language model ap- Domain Keywords extraction is important for proach to key phrase extraction”, Proceedings many applications of Natural Language Processing. of the ACL 2003 workshop on Multiword ex- They not only relate to the term frequency, but also pressions: analysis, acquisition and treatment, relate to the relationship of words. In this paper, F2N- Association for Computational Linguistics, Rank algorithm inspired by TextRank is proposed 2003, Vol.18, pp.33-40. for extracting domain keywords. In F2N-Rank, word 8. Zha H, “Generic summarization and key weights are taken into account and F2-measure (ATF, phrase extraction using mutual reinforcement DF) is adopted to calculate word weights. Experi- principle and sentence clustering”, Proceed- ments show that F2N-Rank has the highest average ings of the 25th annual international ACM SI- precision (78.6%), about 16% over TextRank and GIR conference on Research and development

29% over ATF*DF. F2N-Rank clearly outperformed in information retrieval, 2002, pp. 113-120. both TextRank and ATF*DF. The method is generic, 9. Wan X, Yang J, Xiao J. Towards an itera- in the sense it can be applied to extract keywords in tive reinforcement approach for simultane- different domains. ous document summarization and keyword extraction[C]//Annual Meeting-Association Acknowledgements for Computational Linguistics. 2007, 45(1): The project was supported by by Key Program 552. of National Natural Science Foundation of China 10. Wan X, Xiao J, “Single Document Key phrase (Grant No. 61331013), National Language Com- Extraction Using Neighborhood Knowledge”, mittee of China (Grant No. WT125-46 and WT125- AAAI, 2008, Vol. 8, pp. 855-860. 11), also supported by Graduate Students Projects 11. Liu F, Pennell D, Liu F, et al, “Unsupervised of Minority Languages Branch, National Language approaches for automatic keyword extraction Resource Monitoring & Research Center (Grant using meeting transcripts”, The 2009 annual No.CML15A02) , respectively. conference of the North American chapter of the association for computational linguistics. References Association for Computational Linguistics, 1. Mihalcea R, Tarau P, “TextRank, Bringing or- 2009, pp.620-628. der into texts”, Association for Computational 12. Sebastiani F, “Machine learning in automated Linguistics, 2004. text categorization”, ACM computing surveys 2. Frank, Eibe, Gordon W. Paynter, Ian H. Wit- (CSUR), 2002, Vol. 31, pp.1-47. ten, Carl Gutwin, and Craig G. Nevill-Man- 13. Gao Y, Liu J, Ma P X, “The hot key phrase ex- ning, “Domain-specific keyphrase extraction”, traction based on tf*pdf”, Trust, Security and In Proceedings of the 16th International Joint Privacy in Computing and Communications Conference on Artificial Intelligence, 1999, (TrustCom), 2011, pp. 1524-1528. pp. 668–673. 14. Hasan K S, Ng V, “Conundrums in unsuper- 3. Medelyan, Olena, Eibe Frank, and Ian H. Wit- vised keyphrase extraction: making sense of ten, “Human-competitive tagging using auto- the state-of-the-art”, Proceedings of the 23rd

No. 9 — 2015 Metallurgical and Mining Industry 229 Automatization International Conference on Computational 15. Sasaki Y, “The truth of the F-measure”, Teach Linguistics: Posters. Association for Computa- Tutor mater, 2007, pp. 1-5. tional Linguistics, 2010, pp. 365-373. 16. Information on http://ictclas.org

Point Cloud Data Simplification Using Movable Mesh Generation

Huang Ming *, Yang Fang, Zhang Jianguang, Wang Yanmin

Beijing University of Civil Engineering Architecture Key Laboratory for Urban Geomatics of National Administration of Surveying, 100044, Beijing, 100044, China Engineering Research Center of Representative Building and Architectural Heritage Database, the Ministry of Education, Mapping and Geoinformation, 100044, Beijing, 100044, China

Abstract In order to realize massive point cloud data simplification, a new movable mesh generation algorithm was pro- posed. Firstly, point cloud model was divided into a number of spatial grids. Secondly, the mesh generation was made like the first step again but more finely and with respecting of the distance threshold, and then, each -ap propriate point of the grids generated by the second step was filtered out according to the weight values. At last, the position of the minimum point of point cloud model's bounding box was moved, next do the mesh generation again and filter out the final points. Experimental results indicate that the proposed simplification method is able to eliminate redundant data effectively. The reduction performance of this method is superior to the distance sim- plification method of Geomagic Studio obviously under the same required accuracy. It can be used in the real-time data acquisition process of 3D Reconstruction. Keywords: REVERSE ENGINEERING, POINT CLOUD, SIMPLIFICATION, MESH GENERATION, MOVABLE

1. Introduction processing of point cloud data,includes point cloud In recent years,the application of 3D laser tech- denoising,point cloud simplification and segmenta- nology has been become more and more widely in tion of point clouds, etc. Point cloud simplification is the field of Surveying and Mapping.However,with not only the most basic and important step, but also a the characteristics of large redundancy,existence of hotspot in the field of Reverse Engineering. errors and weak rules,if point cloud data acquired by The current methods are mainly concentrated in scanning are disposed directly,it will be a major ex- the typical methods as follows: bounding box meth- penditure of time and resources.So some preprocess- od, the simplification method based on geometric im- ing works should be underway before the subsequent age, the simplification method based on curvature and

230 Metallurgical and Mining Industry No. 9 — 2015