Towards Summarizing Popular Information from Massive Tourism Blogs
Total Page:16
File Type:pdf, Size:1020Kb
2014 IEEE International Conference on Data Mining Workshop Towards Summarizing Popular Information from Massive Tourism Blogs Hua Yuan, Hualin Xu, Yu Qian, Kai Ye School of Management & Economics, University of Electronic Science and Technology of China, Chengdu 610054, China Email: [email protected], [email protected], [email protected], [email protected] Abstract—In this work, we propose a research method to should keep their positional order as that in the original blog summarize popular information from massive tourism blog so that we can mine the frequent sequential relationships data. First, we crawl blog contents from website and segment for some hot geographical terms.Inreal,suchasequential each of them into a semantic word vector separately. Then, we select the geographical terms in each word vector into relation can be seen as a travel route. Third, we propose a acorrespondinggeographical term vector and present a new vector subdividing method to collect the data set of local method to explore the hot tourism locations and, especially, their features for each hot location.Thesignificantresultof frequent sequential relations from a set of geographical term this method is shielding the impacts of irrelevant word co- vectors. Third, we propose a novel word vector subdividing occurrences which have very high frequency. Further, we method to collect the local features for each hot location, and introduce the metric of max-confidence to identify the present a new method basing on the measurement of max- Things of Interest (ToI) associated to the location from the confidence to identify the ToIs for each hot location from collected data. We illustrate the benefits of this approach by its local features. applying it to a Chinese online tourism blog data set. The experiment results show that the proposed method can be used II. RELATED WORK to explore thehot locations, as well as their sequential relations A. Blog summarization and text mining and corresponding ToI, efficiently. In blog summarization literature, we note that the basic Keywords-blog mining; hot tourism locations; things of interest; max-confidence technology used in online text processing is text-mining [2], [3], which is used to derive insights from user-generated contents and primarily originated in the computer science I. INTRODUCTION literature [4], [5]. Thus some previous research were focused In recent years, tourism has been ranked as the fore- on automatically extracting the opinions of online contents most industry in terms of volume of online transactions [6] and the hot topics [7]. These methods used in blog [1] and most tourism sites (such as blog.tripadvisor.com mining not only involves reducing a larger corpus of multiple and www.travelblog.org) enable consumers to post blogs to documents into a short paragraph conveying the meaning of exchange information, opinions and recommendations about the text, but also is interested in features or objects on which the tourism destinations, products and services within a customers have opinions. Especially, some research have web-based communities. Meanwhile, some readers are more been focused on mining tourism blogs for better successors’ likely to enjoy a high quality travel experience from others’ decision-making [8] . blogs. By obtaining reference knowledge from these blogs, One important application of blog mining is sentiment individuals are able to visualize and manage their own travel analysis, which is to judge whether an online contents ex- plans. For instance, a person is able to find some places that presses a positive, neutral or negative opinion [9]. In recent, attract him from other people’s travel routes, and schedule sentic computing [10] has brought together lessons from an efficient and convenient (even economic) path to reach both affective computing and common-sense computing to these places. grasp both the cognitive and affective information (termed In this work, by deeming each tourism location in one’s semantics and sentics) associated with natural language targeted destination as a travel “topic”, and the ToI as opinions and sentiments [11], [12], which involves a deep some special interested local features associated with the understanding of natural language text by machine. location, we propose a research framework to summarize the popular tourism information from blogs as a whole. First, B. Topic model and feature selection we crawl blog contents from website and divide each of Atopicmodelisatypeofstatisticalmodelfordiscovering them into a semantic word vector respectively. Then, we the “topics” that occur in a collection of documents and topic select the geographical terms from each blog vector to form modeling is a way of identifying patterns in a corpus. An ageographicaldataset.Thisdatasethastwocharacters: early topic model was probabilistic latent semantic indexing all the elements in the !-th record are the geographical (PLSI), created by Thomas Hofmann [13] and the Latent terms that mentioned in the !-th blog; and, each record is Dirichlet Allocation (LDA), perhaps the most common topic transactional. More important, the elements in each record model currently in use, is a generalization of PLSI developed 978-1-4799-4274-9/14 $31.00 © 2014 IEEE 409 DOI 10.1109/ICDMW.2014.29 by David Blei etc. in 2002 [14]. LDA is an unsupervised IV. BLOG CONTENTS EXTRACTION learning model basing on the intuition that documents are In BEWS subsystem, three subtasks of blog extraction, represented as mixtures over latent topics where topics word segmentation and data cleaning are involved to trans- are associated with a distribution over the words of the form a piece of blog into a word vector. vocabulary. Therefore, it is good at finding word-level topics Web crawling technology can help people extracting in- [15] and there were lot of proposed latent variable models formation from the website. In this wrok, it used to obtain basing on it [16]. large-scale users generated blogs from a tourism website. Another type of method is try to look through a corpus All the blogs are crawled into an initial data set of !. for the clusters of words and groups them together by a Word segmentation is usually involving the tokenization process of similarity or relevance, for example, frequent of the input text into words at the initial stage of text analysis pattern analysis [17] and co-expression analysis [18], [19]. for NLP task [21]. In this work, it is the problem of dividing In these methods, a text or document is always represented astringofwrittenlanguageintosomecomponentunits. as a bag of words which raises two severe problem: the For the work of data cleaning, we put it into an equivalent high dimensionality of the word space and the inherent task on judging the usefulness of a component unit generated data sparsity. In literature, feature selection is an important by the segmentation process. Here, we simply keep the technology used to deal with the problems [20]. follows as the useful word segments: However, there are few impressive researches on provid- Semantic word (phrase) [22], [23]: Our goal is to find ∙ ing blog readers valuable knowledge that they are personally the hot tourism locations and the ToI associated with interested in, i.e., the common topics from massive contents, them. In tourism blogs, almost all of these two things as well as the special local features of these topics. are presented in the form of nouns. Punctuation marks [24]: In any text-based document, ∙ III. THE METHODOLOGY aperiod,aquestionmark,oranexclamationmarkis arealsentence-ending.Therefore,peoplecouldtake A. Problem statement them as the sentence boundaries. We use symbol “ ”to represent all types of these reserved punctuation marks.∥ Given a set of tourism blogs !,andassumethateach After data processing with the BEWS subsystem, finally, blog in it can be represented by a word vector,thus! can be asetofword vectors as follows can be generated: represented with a set of vectors as B = b1,...,b!,...,b , { ∣!∣} and the total items in B is ΣB. There are two metrics of: B = b1, b2,...,b . (1) { ∣!∣} $%&&(') (with a predefined threshold (!)! $%&&), ∙ B All the items in B is ΣB = ∣ ∣ - ,where- is the which is used to measure the frequency of itemset '; !=1{ !' } !' .-th term in b!. * ",$ (with a threshold *0), which is used to measure ∙ { } ∪ the dependency of item + on , (or vice versa). V. F REQUENT TRAVEL ROUTES MINING The research problem can be specified as two subtasks: A. Non-geographical terms eliminating for the data set B,findoutasetoffrequent geograph- To find out the frequent geographical terms from the ∙ % ical terms - ΣB (i.e., hot locations in !)such word vectors,ageographical name table is needed. This { }∈ that $%&&(-% ) (!)! $%&&.Thepositionrelationsof table can be extracted temporarily from an official travel ≥ these frequent geographical terms are studied as well; guide providing by the local government, or provided by a for each term of -% ,findoutanappropriatesetofterms ∙ creditable third part, for example, www.geonames.usgs.gov -! ΣB such that * &! & *0. and maps.google.com. { }∈ { "} ≥ Given a geographical name table denoted by /01 , B. Research framework firstly, we use it to filter out the non-geographical terms from b! to form a geographical term vector as: The presented research framework in this work is about b = b /01, (2) three parts: blog extraction and word segmentation (BEWS), (! ! ∩ frequent travel routes mining (FTRM) and interesting things then, a geographical data set of B is generated: