2014 IEEE International Conference on Data Mining Workshop

Towards Summarizing Popular Information from Massive Tourism Blogs

Hua Yuan, Hualin Xu, Yu Qian, Kai Ye School of Management & Economics, University of Electronic Science and Technology of China, Chengdu 610054, China Email: [email protected], [email protected], [email protected], [email protected]

Abstract—In this work, we propose a research method to should keep their positional order as that in the original blog summarize popular information from massive tourism blog so that we can mine the frequent sequential relationships data. First, we crawl blog contents from website and segment for some hot geographical terms.Inreal,suchasequential each of them into a semantic word vector separately. Then, we select the geographical terms in each word vector into relation can be seen as a travel route. Third, we propose a acorrespondinggeographical term vector and present a new vector subdividing method to collect the data set of local method to explore the hot tourism locations and, especially, their features for each hot location.Thesignificantresultof frequent sequential relations from a set of geographical term this method is shielding the impacts of irrelevant word co- vectors. Third, we propose a novel word vector subdividing occurrences which have very high frequency. Further, we method to collect the local features for each hot location, and introduce the metric of max-confidence to identify the present a new method basing on the measurement of max- Things of Interest (ToI) associated to the location from the confidence to identify the ToIs for each hot location from collected data. We illustrate the benefits of this approach by its local features. applying it to a Chinese online tourism blog data set. The experiment results show that the proposed method can be used II. RELATED WORK to explore thehot locations, as well as their sequential relations A. Blog summarization and text mining and corresponding ToI, efficiently. In blog summarization literature, we note that the basic Keywords-blog mining; hot tourism locations; things of interest; max-confidence technology used in online text processing is text-mining [2], [3], which is used to derive insights from user-generated contents and primarily originated in the computer science I. INTRODUCTION literature [4], [5]. Thus some previous research were focused In recent years, tourism has been ranked as the fore- on automatically extracting the opinions of online contents most industry in terms of volume of online transactions [6] and the hot topics [7]. These methods used in blog [1] and most tourism sites (such as blog.tripadvisor.com mining not only involves reducing a larger corpus of multiple and www.travelblog.org) enable consumers to post blogs to documents into a short paragraph conveying the meaning of exchange information, opinions and recommendations about the text, but also is interested in features or objects on which the tourism destinations, products and services within a customers have opinions. Especially, some research have web-based communities. Meanwhile, some readers are more been focused on mining tourism blogs for better successors’ likely to enjoy a high quality travel experience from others’ decision-making [8] . blogs. By obtaining reference knowledge from these blogs, One important application of blog mining is sentiment individuals are able to visualize and manage their own travel analysis, which is to judge whether an online contents ex- plans. For instance, a person is able to find some places that presses a positive, neutral or negative opinion [9]. In recent, attract him from other people’s travel routes, and schedule sentic computing [10] has brought together lessons from an efficient and convenient (even economic) path to reach both affective computing and common-sense computing to these places. grasp both the cognitive and affective information (termed In this work, by deeming each tourism location in one’s semantics and sentics) associated with natural language targeted destination as a travel “topic”, and the ToI as opinions and sentiments [11], [12], which involves a deep some special interested local features associated with the understanding of natural language text by machine. location, we propose a research framework to summarize the popular tourism information from blogs as a whole. First, B. Topic model and feature selection we crawl blog contents from website and divide each of Atopicmodelisatypeofstatisticalmodelfordiscovering them into a semantic word vector respectively. Then, we the “topics” that occur in a collection of documents and topic select the geographical terms from each blog vector to form modeling is a way of identifying patterns in a corpus. An ageographicaldataset.Thisdatasethastwocharacters: early topic model was probabilistic latent semantic indexing all the elements in the !-th record are the geographical (PLSI), created by Thomas Hofmann [13] and the Latent terms that mentioned in the !-th blog; and, each record is Dirichlet Allocation (LDA), perhaps the most common topic transactional. More important, the elements in each record model currently in use, is a generalization of PLSI developed

978-1-4799-4274-9/14 $31.00 © 2014 IEEE 409 DOI 10.1109/ICDMW.2014.29 by David Blei etc. in 2002 [14]. LDA is an unsupervised IV. BLOG CONTENTS EXTRACTION learning model basing on the intuition that documents are In BEWS subsystem, three subtasks of blog extraction, represented as mixtures over latent topics where topics word segmentation and data cleaning are involved to trans- are associated with a distribution over the words of the form a piece of blog into a word vector. vocabulary. Therefore, it is good at finding word-level topics Web crawling technology can help people extracting in- [15] and there were lot of proposed latent variable models formation from the website. In this wrok, it used to obtain basing on it [16]. large-scale users generated blogs from a tourism website. Another type of method is try to look through a corpus All the blogs are crawled into an initial data set of !. for the clusters of words and groups them together by a Word segmentation is usually involving the tokenization process of similarity or relevance, for example, frequent of the input text into words at the initial stage of text analysis pattern analysis [17] and co-expression analysis [18], [19]. for NLP task [21]. In this work, it is the problem of dividing In these methods, a text or document is always represented astringofwrittenlanguageintosomecomponentunits. as a bag of words which raises two severe problem: the For the work of data cleaning, we put it into an equivalent high dimensionality of the word space and the inherent task on judging the usefulness of a component unit generated data sparsity. In literature, feature selection is an important by the segmentation process. Here, we simply keep the technology used to deal with the problems [20]. follows as the useful word segments: However, there are few impressive researches on provid- Semantic word (phrase) [22], [23]: Our goal is to find ∙ ing blog readers valuable knowledge that they are personally the hot tourism locations and the ToI associated with interested in, i.e., the common topics from massive contents, them. In tourism blogs, almost all of these two things as well as the special local features of these topics. are presented in the form of nouns. Punctuation marks [24]: In any text-based document, ∙ III. THE METHODOLOGY aperiod,aquestionmark,oranexclamationmarkis arealsentence-ending.Therefore,peoplecouldtake A. Problem statement them as the sentence boundaries. We use symbol “ ”to represent all types of these reserved punctuation marks.∥ Given a set of tourism blogs !,andassumethateach After data processing with the BEWS subsystem, finally, blog in it can be represented by a word vector,thus! can be asetofword vectors as follows can be generated: represented with a set of vectors as B = b1,...,b!,...,b , { ∣!∣} and the total items in B is ΣB. There are two metrics of: B = b1, b2,...,b . (1) { ∣!∣} $%&&(') (with a predefined threshold (!)! $%&&), ∙ B All the items in B is ΣB = ∣ ∣ - ,where- is the which is used to measure the frequency of itemset '; !=1{ !' } !' .-th term in b!. * ",$ (with a threshold *0), which is used to measure ∙ { } ∪ the dependency of item + on , (or vice versa). V. F REQUENT TRAVEL ROUTES MINING The research problem can be specified as two subtasks: A. Non-geographical terms eliminating for the data set B,findoutasetoffrequent geograph- To find out the frequent geographical terms from the ∙ % ical terms - ΣB (i.e., hot locations in !)such word vectors,ageographical name table is needed. This { }∈ that $%&&(-% ) (!)! $%&&.Thepositionrelationsof table can be extracted temporarily from an official travel ≥ these frequent geographical terms are studied as well; guide providing by the local government, or provided by a for each term of -% ,findoutanappropriatesetofterms ∙ creditable third part, for example, www.geonames.usgs.gov -! ΣB such that * &! & *0. and maps.google.com. { }∈ { "} ≥ Given a geographical name table denoted by /01 , B. Research framework firstly, we use it to filter out the non-geographical terms from b! to form a geographical term vector as: The presented research framework in this work is about b = b /01, (2) three parts: blog extraction and word segmentation (BEWS), (! ! ∩ frequent travel routes mining (FTRM) and interesting things then, a geographical data set of B is generated: detection (ITD). In BEWS, each piece of blog is segmented ( B into a set of semantic words so that it can be transformed ∣ ∣ B = b . (3) into a word vector,inwhich,theelementsareonlysemantic ( { (" } !=1 words and necessary punctuation marks after data cleaning. ∪ The FTRM subsystem is introduced to mine the travel route Different from the traditional method, in B(,allthe from the blog generated word vectors.TheITDsubsystem elements in the !-th record are the geographical terms that is used to mine the ToIs for each hot location. mentioned in the !-th blog, and more important, the elements

410 (1) in each record should keep their positional order as that in Here, the elements in 78 (B() indicate the common B# the original blog. All the items in B( is ΣB# = ∣ ∣ -!' , and frequent concerns of the bloggers (travellers). Thus, !=1 { } where -!' is the .-th term in b(" . relation (5) tells us the hot locations mentioned in the !- ∪ Example 1: Given a /01 = 2, 3, 4, 5, 6, 7 , th blog. { } and a data set B composed of five word vectors as in Table Lemma 1: b ! b( b!. (" " I. We can obtain a geographical data set with relation (2) as Proof: According⊆ to⊆ (2) and (5), lemma 1 holds true. shown in Table II.

Table I Table II Further, we can calculate DATA SET B. DATA SET B!. B ∣ ∣ Elements Elements B(! = b(! . (6) b1 a1 A a2 Bb1b2 Cc1 D b! ABCD { " } { ∥ ∥ ∥ } 1 { } !=1 b2 A a1 a2 Bb1 D E b ABDE { ∥ ∥ ∥ } !2 { } ∪ b3 Aa1 b1 B b2 Cc1Ac2 b ABCA !3 All the items in B ! is ΣB . b {Bb1∥ Dd1E ∥ } { } ( #! 4 b!4 BDE b {a1 a2∥ A a3 Bb1b2} Cc1 F { } Next, assume that (! hot locations appear sequentially in 5 b!5 ABCF { ∥ ∥ ∥ } { } the !-th blog:

% % % b(! = -!1 -!2 ... -!, . B. Hot tourism location mining " { " } From the perspective of data set in database technology, To keep the position information of these hot locations in word vector b! B is also a transactional data record, so do b(! , we transform the formation of b(! into ∈ " " the geographical term vector of b(" B(.Therefore,we ∈ ˜ % % % % % % % % can mine the itemsets that appear in B( frequently as the b(! = -!1-!2,-!2-!3,...-!(' 1)-!' ,-!' -!('+1),..., (7) " { − } hot tourism locations for common people. % % where “- - ”inb˜ ! means that two hot locations The frequent )-itemset ' in data set B is denoted as: !' !('+1) (" % % of -!' and -!('+1) are mentioned sequentially in blog b!. ()) Similarly, 78 (B)= ' ' ΣB, ' = ), $%&&(') (!)! $%&& , B { ∣ ⊆ ∣ ∣ ≥ } ∣ ∣ (4) ˜ ˜ B(! = b(! . (8) { " } where $%&&(') means the support of ' in data set B and !=1 is a predefined threshold. ∪ (!)! $%&& (1) Similarly, we can define 78()) on any transaction da- For example 1, if 78 (B()= A, B, C, D ,thenthe ˜ { } calculation results of B(! and B(! are shown in Table III ta set. Especially, the frequent 1-itemsets in B(,i.e., 78(1)(B ),canbeseenasthehottourismlocations.In and IV respectively. In Table IV, if we set (!)! $%&& = ( (1) ˜ example 1, we can obtain 78(1)(B )= A, B, C, D with 40%,then78 (B(! )= 23, 34, 35 .Anyelements ( (1) ˜ { } (!)! $%&& =60%. { } in 78 (B(! ) indicates a correlation between two hot Note that, the frequent pattern mining technology is not locations which are mentioned sequentially and frequently the main concern in this work, thus any feasible algorithms by blogers. can be used depending on the contents of B. Table III Table IV DATA SET B˜ ! . C. Travel routes generation DATA SET B!! . !

From a semantic perspective, the position relationship Vector Elements Vector Elements of all the elements in vector b shows the real travel ˜b ! AB BC CD (" b!! ABCD ! 1 { } 1 { } ˜b AB BD sequence (location correlations) of blogger !.Thus,the b!! ABD !! 2 { } 2 { } b ABCA ˜ common relations of all these geographical terms should lie !! b!! AB BC CA 3 { } 3 { } in B .Generally,thebasiccorrelationbetweentwolocations b ! BD ˜ ( ! b!! BD 4 { } 4 { } is the co-occurrences and the adjacent position relationship b!! ABC ˜ 5 b!! AB BC { } 5 { } of their representative terms in B(.Giventwolocationsof - b and - b ,theadjacentpositionrelationship !* ∈ (" !+ ∈ (" From a network perspective, the hot locations in means that -!* and -!+ are mentioned frequently in ! and (1) (1) 78 (B ) and their correlations in 78 (B˜ ! ) can form 89$(- ) 89$(- ) =1.Inthefollowing,wewillpropose ( ( ∣ !* − !+ ∣ a route network of / =(:,6) by setting the vertex set anewmethodtorevealtheselocationcorrelations. (1) as : = 78 (B() and setting the edge set as 6 = First, we filter out all the unpopular locations (infrequent (1) 78 (B˜ ! ). geography terms) from b to get a simplified frequent ( (" In order to facilitate the calculation, the weight geographical term vector: of edge (-!' ,-!('+1)) can be set as ;(&! ,&! ) = (1) "$ "($+1) ! % % (1) b( = b(" 78 (B(). (5) $%&&( - ,- ).Forthe78 (B˜ ! ) generated by the " { !' !('+1)} ( ∩

411 data in Table IV ((!)! $%&& =40%), a weighted route Obviously, there are two key elements in a cut vector, network is formed as follows: i.e., a hot location term as the semantic key word, and two

0.8 0.4 punctuation marks as the semantic topic boundaries for such 2 3 5. akeyword.Infact,4<1(-% ) can be deemed as a set of % 0.6 local contexts for term - which refers to either an ordered sequence or unordered set of other useful words in the same 4 sentence (or in a set of adjacent sentences), such that they VI. TOIEXTRACTION co-occur, have syntactic dependencies, or both [25], [26]. Consequently, we can expect that all the potential ToI about A. Hot-location-based blog word vector subdividing location -% may lie in 4<1(-% ). It is known to all, if there are too many hot tourism In example 1, we know that hot locations are locations and popular interesting things in the same word (1) 78 (B()= A, B, C, D while (!)! $%&& =60%,then vector,theimpactofnoisecorrelationsbetweenelementsis the data sets{ of local features} for all the hot locations are inevitable when mine the frequent patterns directly from B. shown in Tables V-VIII. Fortunately, blogers tend to present something interesting around (or nearby) the appearances of a hot location in Table V Table VI blogs. Moreover, if the contents of some adjacent sentences !"#($). !"#(%). are about the same hot location (topic), then these sentences Vector b !"#($ b ) Vector b !"#(% b ) " ∣ " " ∣ " may constitute a topic domain for the location. Also, the b1 a1 A a2 b1 Bb1b2 { } { } b2 Aa1a2 b2 Bb1 topic domain are usually bounded by a set of punctuation { } { } b3 Aa1Cc1Ac2 b3 b1 B b2 marks. We can expect the most close local features (ToI) for { } { } b4 b4 Bb1 {} { } each hot location would exist in such a topic domain and b5 a1 a2 A a3 b5 Bb1b2 are bounded by a pair of adjacent punctuation marks. Table{ VII } Table VIII{ } !"#(!). !"#(&). According to lemma 1, the elements in b! can be classified into two types (overlap is allowed): hot locations (i.e., Vector b !"#(! b ) Vector b !"#(& b ) " ∣ " " ∣ " elements also existing in b(! )andtheothers.Therefore, b1 Cc1 b1 D " { } { } b2 b2 D b! can be subdivided into sub-itemsets according to the {} { } b3 Cc1Ac2 b3 topic boundaries of its ( hot locations (see relation (V-C)). { } {} ! b4 b4 D d1 E % {} { } For the .-th hot location - ,wejustneedtocutthe b5 Cc1 b5 !' { } {} adjacent elements which belong to the same topic domain % of -!' .Sinceblogsentencesareusuallyendedbyasetof % punctuation marks, so the divided sub-itemset for -!' can be B. Dependency between two frequent co-occurred terms % started at the first punctuation mark before -!' (the first term When bloggers record a tourist location -1 in a city, denoted by -!'% )andendedatthefirstpunctuationmarkjust % some of them would tend to review a scenic spot -2 which before the next hot location of -!('+1) (the last term denoted has relation with -1.Undersuchacase,theappearances by -!'& ): of scenic spot ($%&&( -2 ))aredependentontheblogsin % % { } b! = ..., -!' ,..., - ,...,-!' - ,..., - , which both the location and the scenic spot were reviewed { ∥ % !' & ∥ !('+1)% !('+1)} ($%&&( - ,- )). That is to say, the scenic spot has a strong ' +ℎ */& !+0,*0+ 1 2 − − relation{ (dependency)} with the tourism location, if somebody % (1) where - $ 78 (B%&(),. =1',...,(! and symbol “ ” inquires the information about the location - ,thenthe !' ∈ ∥ 1 denotes the punctuation marks. According to the usual text information of -2 should be provided since it is an necessary writing style of blogers, we can expect that the ToI for ToI of -1.Here,weidentifythedependentrelationbetween % % location - may lie in the itemset of -!' ,..., - ,...,-!' - and - in the 2-itemset ' = - ,- with a metric of !' { % !' & } 1 2 1 2 with a high possibility. More generally, given a hot location max-confidence,whichisdefinedas[27]:{ } % (1) - 78 (B(),wedefineitscut vector in b! as follows: ∈ $%&&(') % % *4 = . (11) 4<1(- b!)= -!'% ,..., - ,...,-!'& . (9) min $%&&( - ),$%&&( - ) ∣ { } { { 1} { 2} } All the cut vectors for -% from B is: In this work, the max-confidence is used to measure the % % dependency of a candidate ToI on a hot location:givena 4<1(- )= 4<1(- b!) . (10) { ∣ } predefined value of *0 [0, 1],wecalled- Σ ! is ! ∈ ∈ 123(& ) ∪ aToIassociatedtohot location -% ,if % All the items in 4<1(- ) is Σ123(&! ) = ! 123(& ) % ! (12) ∣ ∣ - ,where- is the .-th term in 4<1(- b ). * & ,& *0. !=1 { !' } !' ∣ ! { } ≥ ∪

412 C. ToI Extraction Algorithm VII. EXPERIMENTAL RESULTS Given a hot location -% 78(1)(B ) and a potential A. Experiment setup ∈ ( ToI term -,theToIextractionprocessesaremainlyfo- The blogs data were extracted from the www.mafengwo. % cused on calculating the value of $%&&(- 4<1(- )) and com, one of the most famous blog sites in China for tourism % ∣ (!) $%&&(- ),$%&&(-) and analyzing the dependency of information sharing. Altogether, 450 blogs posted from 10- { % } % - on - in data set 4<1(- ) with max-confidence. 01-2006 to 01-31-2014 in “Hongkong” (targeted destination) tourism channel were collected with a blog extraction tool. Algorithm 1 ToI Extraction Algorithm The blogs with empty text (some blogs are pictures only) 1: Input:WordvectorsetB;Geographicalwordvectorset were removed and 396 valid travel blogs were remained for B ;HottourismlocationsetΣ ; * ; the following experiments. ( B#! 0 2: Output: ToI set; The contents of /01 are extracted from the attraction 3: 4<1 = =; terms on the www.tripadvisor.com. 4: for ! =1to B do ∣ ∣ (1) B. Word segmentation 5: b ! = b( 78 (B(); (" " 6: for . =1to b ! do First, an initial data set of all terms is generated after data (" ∩∣ ∣ % 7: Calculate 4<1(-!' ) from b!; cleaning. Then, we extract the data set of nouns from the all 8: 4<1(-% ) 4<1(-% ) where -% = -% ; terms by removing the non-noun word segments.Toidentify ← !' !' 9: end for the hot locations and their sequential relations efficiently, we 10: end for further select a special data set of geographical terms from 11: 4<1 4<1(-% ) for all the -% 78(1)(B ); the nouns according to the /01 .Thefrequencyofterms ← ∈ ( 12: Calculate 78(1)(B); in the data set of all terms, nouns and geographical terms 13: for ! =1to 4<1 do are sorted in Fig.1. ∣ (1)∣ (2) 14: Calculate 78 (4<1!) and 78 (4<1!);

(1) 3 15: 10 for each - 78 (4<1!) do Almost the same All terms(Remove stop words) ∈ Noun+gerund+form nouns 16: if * &! ,& *0 then Geographical terms { " } ≥ 17: if * (2) * then 2 &,&′ 56 (123") 0 10 { }∈% % ≥ 18: 19>(- ) <- ,-,-′ >; ! ← 19: else 1 10 20: 19>(-% ) <-% ,->; Frequecy (Word Counts) ! ← 21: end if

0 10 0 1 2 3 4 5 22: end if 10 10 10 10 10 10 Terms rank 23: end for 24: end for Figure 1. Frequencies of word segment in blogs. % 25: return 19>(-! ). Different from the systems characterized by short text, ∪ Algorithm 1 goes through three phases: such as micro-blogs, the frequency of terms in all terms to their corresponding ranks does not follow the common Finding out the hot locations from blog ! (Line 5); ∙ power-law distribution. The most likely reason is that the Obtaining cut vector for -% in blog ! and putting it into ∙ !' document length of blogs is much longer than that of the % the appropriate data set of 4<1(- ) (Line 6-10). All contents in BBS or microblogs. This result indicates that % the 4<1(- ) were put into 4<1 (Line 11). the blogging behaviors of bloggers are independent from Analyzing the dependency of all the elements in each other, but they try to sketch out the travel experiences ∙ % Σ123(&! ) on hot location - (Line 12-24). in detail to obtain blog readers’ appreciations, which will To obtain the complete interdependent relationships in result in longer document length and various words used in 4<1(-% ),wesetthemax-confidencecomputationasa blog contents. On the other hand, the ranks of geographical progressive process (Line 16-22). terms to their corresponding frequencies is of power-law The algorithm shows (1) the analysis of the dependency of distribution. This illustrates that, there are few geographical potential ToI term - on -% in data set 4<1(-% ) need to scan terms are very popular while the most of others are not. all the items in B;(2)thecomputationcomplexityhasbeen approximate to # 9A ℎ9C CDE($ 78(1)(4<1(-% )) C. Hot tourism locations and travel routes 78(2)(4<1(-% )) .Thatis,theefficiencyofthealgorithm×∣ ∣× For the blogs about Hongkong tourism, the ranks of ∣is affected by the ∣minimum support threshold in mining geographical terms to their frequency are shown in Fig- frequent 1- and 2-itemset from 4<1(-% ). ure 2(a). There are two significant turning points on the

413 J(3")(%:"#)*",% curve: “Central” and “”. In order to reserve as J":<()/%J(/)+6*+#)% 25(%H."#%&.#$% O"#(/% 2"3"**"BN/%D"*,1% '*(1.)%

98(#%:"6#)*0% '+38% 2.:<()/% C6MM%H.$5)0(+*% C.,,/% much valuable information, we take the top 15 geographical 2+*M+#N/%2*((5"6/(% 48+:(%A"6#)+.#% '(,(@*+)."#% 2+./5+#%A"6#)+.#%

71=(#)6*(,+#1% J+>)% D""10% A+#0%71=(#)6*(/% 7/)*"%C,+/)(*/% O6#$,(%J.=(*%'*6./(% 71=(#)6*(/% ?"6:5(*/% K+#)+/0% A6/.:+,% I+*1(#% ;+*+1./(% terms whose frequencies are bigger (and equal) than that of J.=(*% D.##.(%)5(%;""5% 7B+*1/% A+M(% K,0.#$% 4#"B%D5.)(% '+/),(% '+*)""#% 4,((8.#$%C(+6)0% ;*"L(:)% 4).):5% ;""5% A.:<(0% C+@0% &.#$1"3% K+#)+/0%I+*1(#% RK9%73(*.:+#%)"B#% K+.*0%)+,(% “Kowloon” as the hot locations in Hongkong. '68% S-% K+#)+/.+% ;*.#:(//% T- A.##.(% 736/(3(#)%;+*<%-./#(0,+#1% 2.:<()% A"#/)(*/% P#/8(:)."#% K+#)+/0,+#1Q I",1%A6/.:+,% I*.MM,0%?+,,(0% !",,0B""1%!")(,% 4,((8.#$%C(+6)0%'+/),(% C")),(% H(+8% 7#.3+)."#% '5+*+:)(*% -"#+,1%-6:<% ;+*<% '+*"6/(,% 43+,,%D"*,1% -63@"% ;"8:"*#% J")+).#$%)5(%D"*,1% K,"+)/% 2*((%!"6/(% 2"B#% H+.%&.#$% A+#"*%K.*(B"*

H+*$(%>"*)% 300 A+:+"%A6/(63% C((%K*.(#1/%'6,)6*(% -.#.#$%

J+.#:"+)%-",85.#% 250 2"6*% D"#1(*/% 2*"8.:+,% 456(%D+#% 2"**(#)% K./5% &+,(.1"/:"8(%

I",1>./5% 200 ":(+#+*.63U+)."#+,%)*(+/6*(% 9:(+#%GE8*(//% 7F6+*.63% K(**./%D5((,% I.+#)%;+#1+%

4(+%H."#%9:(+#%;+*<%'5.#+ Frequency

;+#1+% 150 45+*<% O(,,0>./5%

;(#$6.#% J+.#>"*(/)% C+65.#.+%4F6+*(% 100 9:(+#%

50

0 (a) Threshold of Max-confidence=0.6. Lo Wu Macao Airport Lantau Jordan Central Nathan Kowloon Paradise Admiralty The Peak Wan Chai Hung Hom Tao Heung Tao Disneyland Ocean Park Ocean 46##0%C+0% Repulse Bay Sheung Wan The Venetian Wax Museum Temple Street Lamma Island '+38% Causeway Bay Lan Kwai Fong The Peak Tram Peak The Langham Place Avenue of Stars Tsuen Wan Line Victoria Harbour

Hong Kong Island 43+,,%D"*,1% 48+:(%A"6#)+.#% Chungking Mansions

2"3"**"BN/%D"* (a) Frequent geographical terms. O"#(/%

((8.#$%C(+6)0% -./#(0,+#1%!",,0B""1%!")

-./#(0,+#1%

!+*@"6*%'.)0% '+/),(% 73(*.:+#%)"B# 9:(+#%;+*<% K+#)+/.+% '(,(@*+)."#% 4).):5% A.:<(0%A"6/(%

'(#)*+,% D+E%A6/(63% (b) Threshold of Max-confidence=0.95. 2/.3%45+%2/6.% 25(%;(+<% !"#$%&"#$% 7=(#6(%">%4)+*/% Figure 3. Extracted ToI for “Disney park”. A+:+"%

?.:)"*.+%!+*@"6*% '+6/(B+0%C+0%

7.*8"*)% &"B,""#% A"#$%&"<% the public transport (“Tung Chung Line”) are also presented, (b) Popular travel routes for Hongkong. which are ToI highly relevant to the Disney tourism and may provide richer information for people’s travel planning. Figure 2. Backbone-nodes-based travel routes for Hongkong. However, a relative loose threshold of max-confidence would Using the 15 hot locations as network nodes and analyzing cause the correlations between ToI to become more complex. their position relations in the blogs, we can sketch out a To obtain the clear relationship between a hot location and popular travel routes for blogers’ travelling in Hongkong its most close ToI, thus, a bigger threshold of max-confidence (Figure 2(b)). In which, each node represents a hot location, is needed (Fig 3(b)). As we can see, these ToI are the most and the node size shows the frequency of the location in all popular tourist projects for “Disney land”. the blogs. Obviously, the larger the node, the tour location

250 it represented is more frequent. The lines in Figure 2(b) Central Disneyland Tsim Sha Tsui Airport show that there exists some travel sequential relationships 200 Ocean Park The Peak Avenue of Stars between the connected nodes. For example, “Ocean park- Victoria Harbour Mong Kok 150 Causeway Bay Macao Disneyland”, “Central-Harbour City”, and so on. Harbour City Wax Museum Kowloon 100 D. ToI extraction # of extracted TOI We use the “Disney land”asanillustrationfortheToI 50 extraction. Firstly, the local features located nearby the 0 0.6 0.7 0.8 0.9 Threshold value of Max−Confidence ( θ ) term of “Disney land” and bounded by a pair of adjacent 0 punctuation marks are cut from each blog generated word Figure 4. Number of extracted ToI for the top-14 hot locations. vector to form the data set of 4<1(“5!$)D+ FG)H”).Then, the frequent 1- and 2-itemsets in 4<1(“5!$)D+ FG)H”) At different values of *0,thenumberofextractedToIfor are mined. Finally, the extracted ToI for the hot location of the top-14 hot locations in Hongkong (term “Hongkong” “Disney land” are shown in Figure 3. is ignored) are shown in Figure 4, in which, we can see Setting the threshold of max-confidence=0.6, people can that “Disneyland”, “Macao” and “Ocean park” are three hot obtain more ToIs around Disneyland (Fig 3(a)). Some nearby locations having more ToI than the others. Which means, tourist attractions, such as “Ocean Park”, and their primary when people tour in these locations, they may spend more ToI has also been extracted partially because they are de- time and money. This is in accordance with the common scribed frequently by blogers in the same context of a com- sense in nature. For the rest hot locations with few ToI, parative or associated manner. Interestingly, other matters people can bind them into several travel packages according related to Disney tourism, such as the ticket (“Tickets”) and to their geographic relationship so that these locations in the

414 same package can be visit together. % % here are 4<1(-! ),where-! (! =14)isoneofthetop- 14 hot locations in Hongkong (see Fig. 2(b)). The averaged E. Performance results are shown in Figure 6: In this section we discuss the measures used in evaluating

1 60 DF DF the performance of the experiments. TF−IDF TF−IDF Max−confidence Max−confidence 0.9 50 k terms) k terms) − 1) Evaluation measure: For the task of blog extracting, − 0.8 40 people has four possible outcomes for the extracted and 0.7 30 inherent ToI, as shown in Table IX. 0.6 20

0.5 10 Average precision (in extracted top Average # of TOIs (in extracted top

Table IX 0.4 0 TOP5 TOP10 TOP20 TOP50 TOP80 TOP5 TOP10 TOP20 TOP50 TOP80 CLASSIFICATION OF THE RESULTS OF A TOI EXTRACTION TASK. (a) Average precision. (b) Average #'(. Extracted Not extracted Relevant ToI True-Positive ('()False-Negative()*) Figure 6. Comparison results between metrics. Irrelevant words False-Positive ()()True-Negative('*) For the results, max-confidence dominated all other ∙ In literature of information retrieval, 8EDI!$!9) and metrics, and the DF metric shows the worst perfor- JDIGFF are used to measure the extraction results: mance. Max-confidence is better than TF-IDF because #C& #C& TF-IDF does not take into account interesting word co- 8EDI!$!9) = , JDIGFF = . (13) occurrences containing terms with low IDF [28]. #C& +#A& #C& +#A) Along with the number of requested features becomes ∙ In the following, we will compare the efficiency of our bigger, the number of extracted ToI by different metrics method, namely Term Vector Subdividing (TVS), with some are all keep increasing (Fig 6(b)). This result makes classic methods, and show the comparison results of (1) the sense that, a bigger threshold will result in a larger set average &EDI!$!9) and (2) the average number of ToI (#C&) of candidates for real ToI. in the extracted top-k terms. VIII. CONCLUSION 2) Experiment results: Firstly, we conduct the compari- son experiment with TVS and LDA on the data set of nouns. In this work, we propose a research methodology to The results are shown in Figure 5. summarize the popular information from massive tourism blog data. To this end, we firstly crawl blog contents from

1 60 website and divide them into semantic word vectors as data LDA LDA TVS TVS 0.95 50

k terms) source. Second, we collect the geographical data from all k terms) − − 0.9

40 0.85 the blog vectors, and mine the hot tourism locations and 0.8 30 their frequent sequential relations in it. The results of this 0.75 20 0.7 part can be used to summarize the popular information about 10 0.65 Average precision (in extracted top Average # of TOIs (in extracted top “where to go” (trip route) in a set of tourism blogs. Then, we 0 TOP5 TOP10 TOP20 TOP50 TOP80 TOP5 TOP10 TOP20 TOP50 TOP80 propose a vector subdividing method to collect local features (a) Average precision. (b) Average #'(. for each hot location,andintroducethemax-confidence metric to identify the ToI for the corresponding hot location. Figure 5. Comparison results between TVS and LDA. The captured ToI for each hot location are account for the question about “what to play” at a specific tourism location. Notably, in extracting a small number of ToI (e.g., less Notably, the significant result of this method is that the than 50), the precision of TVS is superior to that of LDA disturbances from high frequent irrelevant word (noise) are (Fig 5(a)). This shows that the TVS method is good at shielded. Finally, we illustrate the benefits of this approach extracting terms that linked very closely with the key term by applying it to a Chinese online tourism blog data set. (hot location). However, the accuracy of TVS begins to The experiment results show that the proposed method decrease with the increasing number of top-k threshold can be used to explore the hot tourism locations (their (provide more ToI), whereas, LDA performs better. One sequences as well) and their corresponding ToI from massive possible explanation is that TVS must run with a relative blogs efficiently. Future work is about reducing the algorithm lower threshold of max-confidence when it is required to complexity and presenting an optimization method to extract provide more extraction contents. Obviously, this may in- more precise correlations for a hot term. troduce increasing number of noise into the ToI candidates. In additional, TVS is a method basing on feature selection ACKNOWLEDGMENTS. with the metric of max-confidence.So,wedosomeexperi- The work was partly supported by the National Nat- ments to see the differences between the classic TF-IDF, DF ural Science Foundation of China (71271044/U1233118/ and the max-confidence in ToI extraction. The data set used 71102055).

415 REFERENCES [16] S. Moghaddam and M. Ester, “On the design of LDA models for aspect-based opinion mining,” in Proceedings of the 21st [1] H. Werthner and F. Ricci, “E-commerce and tourism,” Com- ACM International Conference on Information and Knowl- munications of the ACM,vol.47,no.12,pp.101–105,2004. edge Management,2012,pp.803–812.

[2] Q. Cao, W. Duan, and Q. Gan, “Exploring determinants of [17] H. D. Kim, D. H. Park, Y. Lu, and C. Zhai, “Enriching text voting for the ”helpfulness” of online user reviews: A text representation with frequent pattern mining for probabilistic mining approach,” Decision Support Systems,vol.50,no.2, topic modeling,” Proceedings of the American Society for pp. 511–521, 2011. Information Science and Technology,vol.49,no.1,pp.1–10, 2012. [3] A. Ghose and P. G. Ipeirotis, “Estimating the helpfulness and economic impact of product reviews: Mining text and [18] M. Rokaya, E. Atlam, M. Fuketa, T. C. Dorji, and J.-i. Aoe, reviewer characteristics,” IEEE Transactions on Knowledge “Ranking of field association terms using co-word analysis,” and Data Engineering,vol.23,no.10,pp.1498–1512,2011. Information Processing and Management,vol.44,no.2,pp. 738–755, 2008. [4] M. Hu and B. Liu, “Mining and summarizing customer reviews,” in Proceedings of the Tenth ACM SIGKDD Interna- [19] F. Figueiredo, L. Rocha, T. Couto, T. Salles, M. A. Gonc¸alves, tional Conference on Knowledge Discovery and Data Mining, and W. Meira Jr., “Word co-occurrence features for text 2004, pp. 168–177. classification,” Information Systems,vol.36,no.5,pp.843– 858, 2011. [5] B. Pang and L. Lee, “Opinion mining and sentiment analysis,” Foundations and Trends in Information Retrieval,vol.2,no. [20] T. Liu, S. Liu, Z. Chen, and W.-Y. Ma, “An evaluation on 1-2, pp. 1–135, 2008. feature selection for text clustering.” in ICML.AAAI,2003, pp. 488–495. [6] K. Dave, S. Lawrence, and D. M. Pennock, “Mining the [21] J. Gao, M. Li, C.-N. Huang, and A. Wu, “Chinese word peanut gallery: Opinion extraction and semantic classification segmentation and named entity recognition: A pragmatic of product reviews,” in Proceedings of the 12th International approach.” Computational Linguistics,vol.31,pp.531–574, Conference on World Wide Web,2003,pp.519–528. 2005.

[7] N. Li and D. D. Wu, “Using text mining and sentiment [22] R. Sproat, C. Shih, W. A. Gale, and N. Chang, “A stochastic analysis for online forums hotspot detection and forecast,” finite-state word-segmentation algorithm for Chinese,” Com- Decision Support Systems,vol.48,no.2,pp.354–368,2010. putational Linguistics, vol. 22, pp. 377–404, 1996.

[8] N. Sharda and M. Ponnada, “Tourism blog visualizer for [23] A. Stavrianou, P. Andritsos, and N. Nicoloyannis, “Overview better tour planning,” Journal of Vacation Marketing,vol.14, and semantic issues of text mining,” SIGMOD Record, no. 2, pp. 157–167, 2008. vol. 36, no. 3, pp. 23–34, 2007.

[9] B. Liu, “Sentiment analysis and subjectivity,” Handbook of [24] J. Tang, H. Li, Y. Cao, and Z. Tang, “Email data cleaning,” Natural Language Processing,vol.2nded,2010. in Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining,2005, [10] E. Cambria and A. Hussain, Sentic Computing: Techniques, pp. 489–498. Tools, and Applications. Netherlands: Springer, 2012. [25] H. T. Ng and H. B. Lee, “Integrating multiple knowledge [11] E. Cambria, B. Schuller, Y. Xia, and C. Havasi, “New avenues sources to disambiguate word sense: An exemplar-based in opinion mining and sentiment analysis,” IEEE Intelligent approach,” in Proceedings of the 34th Annual Meeting on Systems,vol.28,no.2,pp.15–21,2013. Association for Computational Linguistics,1996,pp.40–47.

[12] S. Poria, E. Cambria, G. Winterstein, and G. Huang, “Sentic [26] D. Lin, “Using syntactic dependency as local context to patterns: Dependency-based rules for concept-level sentiment resolve word sense ambiguity,” in Proceedings of the Eighth analysis,” Knowl.-Based Syst.,vol.69,pp.45–63,2014. Conference on European Chapter of the Association for Computational Linguistics,1997,pp.64–71. [13] T. Hofmann, “Probabilistic latent semantic indexing,” in Proceedings of the 22nd Annual International ACM SIGIR [27] T. Wu, Y. Chen, and J. Han, “Re-examination of interest- Conference on Research and Development in Information ingness measures in pattern mining: A unified framework,” Retrieval,1999,pp.50–57. Data Mining and Knowledge Discovery,vol.21,no.3,pp. 371–397, 2010. [14] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet [28] A. Pons-Porrata, R. Berlanga-Llavori, and J. Ruiz-Shulcloper, allocation,” Journal of Machine Learning Research,vol.3, “Topic discovery based on text mining techniques,” Informa- pp. 993–1022, 2003. tion Processing & Management,vol.43,no.3,pp.752–768, 2007. [15] A. Banerjee and S. Basu, “Topic models over text streams: Astudyofbatchandonlineunsupervisedlearning.”inSDM. SIAM, 2007, pp. 431–436.

416