<<

2019 International Conference on Computer Science, Communications and Big Data (CSCBD 2019) ISBN: 978-1-60595-626-8

Semantic Analysis of Tourism Vocabulary Based on Similar Words Calculation 1 2 Hui PENG and Hong-yan PAN 1School of Tourism Sciences, International Studies University, Beijing, 2Lingjiu Science and Technology Company, Beijing, China

Keywords: Tourism data mining, Similar words, Semantics, Semantic diagram.

Abstract. Tourism data mining is the process of abstracting data relations from a huge number of tourism data. It can discover the implicit knowledge and rules which hidden in data. The discovery of the semantic relation between tourism words is the important content in tourism data mining. The classical similar words calculation model skip-gram in natural language processing area is introduced in the paper. The part of speech is not considered in skip-gram so when the similar words located closely in a sentence the model cannot identify them accurately. So we provide the model of skip-gram with Chinese Part of Speech—POS-skip-gram. With the help of this model and the tourism data from elong and ctrip website, we have established the semantic relations map of tourism words. The map can be the basis of tourism data mining.

Introduction The computation of similar words in the field of computer can be used to mine the semantic relationship between words in an application field. At present, there are two main methods for calculating similar words: dictionary-based method and statistics-based method. For the dictionary-based method, in the calculation of Chinese similar words, synonym word forest [1] and Hownet[2] can be used to obtain synonyms of target words. The dictionary-based method has the characteristics of high accuracy, but with the increase of network information, the network new words, slang, code names, proper nouns and other increasing, the word meaning of the vocabulary is also constantly enriched, thesaurus method has the disadvantages of slow update and limited word quantity. The statistical method calculates the similarity between words by counting the number of co-occurrence of each word and other words. With the enrichment of the statistical corpus, this method can obtain better and better results, which is especially suitable for the expanding needs of network vocabulary and the statistical results between words in the application field. Based on the statistical method, the word is transformed into a vector, and then the similarity between the two vectors is used to determine whether the two words are synonyms. CBOW[3] and Skip-gram[4][5] are classical neural network probabilistic language models. Skip-gram is n words based on the current word prediction context, and both CBOW and Skip-gram are used to calculate similar words. Skip - gram Model Skip-Gram model is based on the current word to predict the context of the word n words. N is a constant that determines the size of the context window. Skip-Gram model has three layers: input layer, projection layer and output layer, in which the input layer is the current word, the projection layer generates the word vector space, and the output layer is the context vocabulary of the current word. In order to further improve the calculation accuracy of similar words, the structure of a language model of Skip-gram with grammatical information—Pos-skip-gram[6] is shown in figure 1. The model is adjusted as input layer, filter layer, part of speech tagging layer, projection layer and output layer.

199

Figure 1. Pos-skip-gram Model. Among them, the filtering layer filters out the symbols used in the network language, making the input more standardized. The part of speech tagging layer uses the Chinese part of speech tagging set of the institute of computer science of the Chinese academy of sciences to mark the part of speech [7], obtaining better results of similar words. The part of speech mark set divides the words into two categories: content words and function words. Among them, words include nouns, verbs, adjectives, adverbs, idioms, idioms and other special words. Function words include prepositions, conjunctions, interjections, articles, several times and quantifiers. Nouns also include names of people, places, groups, proper nouns, etc. The model obtained word sequences posW1, posW2, posW3,••• and posWt by marking the part of speech of words in different locations. Then, 2c words of Context(posWt+j) of posWt were used to complete the prediction of posWt according to formula (1). Formula (2) P(posWt+j |posWt) expression is defined by SoftMax function and is the basis of the whole model.

1 T logP ( posW posW ) (1) T t1  c  j  c , j  0 t  j t

T exp(v ' WOI vW ) PWW(/)oI WT (2)  W1exp(v ' w vW I ) Calculation of Similar Words Based on Pos-skip-gram Language Model The higher-dimensional space word vectors generated by the pos-skip-gram language model contain certain semantic relations. In addition, this section also introduces the analysis of Chinese grammatical relations, so that the conditions of grammatical relations are added to the distance calculation of word vectors. Therefore, in the calculation of vector space, cosine similarity is used as the calculation method, such as formula (5), and grammatical relation is used as the calculation criterion to calculate the similarity of word vectors, and TopN method is used to extract the optimal solution.

wpos vpos Sim(,) wpos vpos  (3) ////wpos vpos TopN algorithm is one of the classical algorithms for selecting the best, and the TopN optimal terms are obtained through ranking as results [8]. In this paper, TopN is used in the calculation of similar words, and the first N words are selected as the result set after the whole word vector space is traversed and calculated by combining cosine similarity and part of speech information.

200 The Semantics of Tourism Vocabulary Calculated by Pos-skip-gram Model The experimental data used in this paper is elong, ctrip travel corpus. ICTCLAS/NLPIR word segmentation tool of Chinese academy of sciences is used to conduct word segmentation and part-of-speech tagging [9][10]. The size of the corpus is 2GB. Part of the trained word set is shown in Figure 2. Attractions Keywords Chinese National Chinese National Park, National Museum, National Exhibition Hall, Sculpture square, Park Tropical forest, Water cave, Panlong Waterfall, Alishan Shenmu, Wuyuan rock painting Western Cherry Garden, Northern Diversion Lake Scenic Area, Southern Zhongshan Yuyuantan Island, Spring Garden Skydiving viewing tower, Huanhu Road, Bungee tower Grand View Garden Courtyard area, Natural scenic spot, Buddhist temple scenic spot, Temple Scenic Area Liantang Huayu Scenic Area, Central Island Scenic Area, Longyige Scenic Area, Longtan Park Wanliutang Scenic Area, Dragon Word Stone Forest Scenic Area Xiang Mountain Biyun Temple, Five hundred Luo Hantang,Mirror Jingzhao Temple, tranquil heart studio Beijing , Hippo Pavilion, Tiger Mountain, Bear mountain, Antelope Hall Wan Chunting, Double octagonal octagonal, Blue tile pavilion with round eaves, Beijing Children's Palace Beijing Pacific Hongshihuan , Seal show, Dynamic cinema Underwater World Yuetan Park LingXing Gate, Dian pool, Served hall, Shenku,Sacrifice pavilion, Clock tower Beijing Garden Expo Yongding Tower, Jinxiu Valley Park Park Altar,Xitianmen, Beitianmen, Shenku Shen Kitchen, Festival mural,Jade Garden Qinglian Island, Mingyue Island, Fuyin Zizhuyuan, Chengbi Mountain House, Zizhuyuan Park Yanshiyuan Anti-Japanese Chinese People's War of Resistance Against , Sculpture group, Central Square, Sculpture Garden Wanping City Wall, Green forest World Flower Grand Exotic flowers, Rare trees, Classic landscape garden View Garden Lingguang Temple, Chang'an Temple, Sanshan An, Dabei Temple, Longquan Temple, Park Xiangjie Temple, Baozhu hole, Zhengguo Temple Ditan Park Fangze Temple, Huangfu Temple, Zhai Palace, Botanical garden, Grove garden, Greenhouse area, Bonsai Garden, Temple of the Beijing Garden Park Reclining Buddha, Cao Xueqin Memorial Hall, Longjiao Temple Site Qiong island, Religious architecture, Chengguang Temple Jade Buddha, Beihai Mission City, One pool sanxian Mountain Xingguo Temple, Songbai Crossing Pavilion, GeYan Pavilion, Xi Li Pavilion, Lanting Eight Columns, Qingyun Tablets, Qinglianduo, Paint moon, Anzhi, Tang Huawu, Discovery Park, Xinghua Village, Huifangyuan Main mountain main lake Jingdong Grand Flying down, Cliff high, Mingtan Lianzhu, The plank road is suspended, Lake Mingjing Canyon Beijing Wildlife Scattered viewing area, Walking viewing area, Animal performance entertainment area, Park Popular science education area, Children's zoo Beijing Xishan East Fourth Tomb, Xishan Eight Hospital, Sanshan Wuyuan Forest Park Yaochi Shilian, Dragon Palace Harp, Flag scroll, Cave three columns, World Cave Shihua Hole Wonders Exhibition, Wildlife Show, Kistler Exhibition Figure 2. Keywords part of tourist attractions. The pos-skip-gram model was adopted for training, and the Window value was 5,200 layers of network depth for corpus learning, and the results of similar words calculation of tourism vocabulary were obtained. The results were shown in the semantic relation diagram as shown in Figure 3.

201

Figure 3. Semantic Relation among Tourism Words.

Summary The mining and representation of the semantic relationship of tourism words is an important part of tourism data mining. By acquiring the semantic relationship between words, we can mine the internal connection of data, conduct more effective and intelligent management and retrieval of information, and develop better tourism planning and services.

Acknowledgement This research was financially supported by the National Social Foundation (14BTQ031).

References [1] Synonym Word Forest: http://baike. baidu. com/view/1355899. htm [2] http://www.keenage.com/ [3] B. Yoshua, D. R´Ejean, V. Pascal, et al. A Neural Probabilistic Language Model. Journal of Machine Learning Research, 2003, 3(6):1137-1155 [4] B. Yoshua, B. Samy. Modeling High-Dimensional Discrete Data with Multi-Layer Neural Networks. Advances in neural Information Processing Systems 12, 2001:400-406 [5] T. Mikolov. Language Modeling for Speech Recognition in Czech, Master's thesis, Brno University of Technology, 2007. [6] Hongyan Pan. Research and Application of Hybrid Recommendation Algorithm Based on SNS. Dissertation for the Master Degree.2015(in Chinese) [7] http://ictclas.nlpir.org/nlpir/html/readme.htm [8] T. Mikolov, K. Martin, C. Jan, et al. Recurrent Neural Network Based Nanguage Model. In Proceedings of Interspeech, 2010:1045-1048 [9] Zhang Huaping, Gao Kai, Huang Heyan et.al. Big Data Search and Mining, Science Press, 2014(in Chinese) [10] http://ictclas.nlpir.org/

202