Segmentation of Review Texts by using Thesaurus and Corpus-based Similarity

Yoshimi Suzuki and Fumiyo Fukumoto Interdisciplinary Graduate School of Medicine and Engineering, University of Yamanashi, Kofu, Japan

Keywords: , Reviews, Thesaurus, Wikipedia.

Abstract: Recently, we can refer to user reviews in the shopping or hotel reservation sites. However, with the exponential growth of information of the Internet, it is becoming increasingly difficult for a user to read and understand all the materials from a large-scale reviews that is potentially of interest. In this paper, we propose a method for review texts segmentation by guest’s criteria, such as service, location and facilities. Our system firstly extracts which represent criteria from hotel review texts. We focused on topic markers such as “ha” in Japanese to extract guest’s criteria. The extracted words are classified into classes with similar words. The classification is proceeded by using Japanese WordNet. Then, for each hotel, each text with all of the guest reviews is segmented into word sequence by using criteria classes. Review text segmentation is difficult because of short text. We thus used Japanese WordNet, extracted similar word pairs, and indexes of Wikipedia. We performed text segmentation of hotel review. The results showed the effectiveness of our method and indicated that it can be used for review summarization by guest’s criteria.

1 INTRODUCTION Moreover, it is difficult to find boundary for every item. Because reviews explain one item by using two Recently, we can refer to user reviews in the shop- or more sentences, or two or more items are explained ping or hotel reservation sites. Since a user’s cri- by using only one review sentence. terion is estimating the user review compared with In this paper,we proposea methodfor reviewtexts the information which a contractor offers, there is segmentation by using guest’s criteria, such as ser- a possibility that many information which is not vice, location and facilities. included in a contractor’s explanation is included. These customer/guest reviews are often included var- ious information about products/hotels which are dif- 2 RELATED WORK ferent from commercial information provided by sell- ers/hotel owners, as customer/guest have pointed out Text segmentation is one of the challenging tasks of with their own criteria, e.g., service may be very im- Natural Language Processing. It has been widely portant to one guest such as business traveler whereas studied and many techniques (Kozima, 1993) (Hearst, another guest is more interested in good value for se- 1997) (Allan et al., 1998) have been proposed. Hearst lecting a hotel for his/her vacation. However, there presented a method of TextTiling (Hearst, 1997), are at least seven problems as follows: which is based on lexical cohesion concerning to the • There is a large amount of reviews for each prod- repetition of the same words in a document. Utiyama uct/hotel. and Isahara proposed a statistical method for domain- • Each review text is short. independent text segmentation (Utiyama and Isahara, 2001). Hirao et al. proposed a method based on • There are overlapping contents. lexical cohesion and word importance (Hirao et al., • The wrong information may be described. 2000). They employed two different methods for text • It is not faithful to grammar. segmentation. One is based on lexical cohesion con- sidering co-occurrences of words, and another is base • The terms are not unified. on the changes of the importance of the each sentence • There are many miss spellings, words. in a document.

Suzuki Y. and Fukumoto F.. 381 Segmentation of Review Texts by using Thesaurus and Corpus-based Word Similarity. DOI: 10.5220/0004140903810384 In Proceedings of the International Conference on Knowledge Engineering and Ontology Development (KEOD-2012), pages 381-384 ISBN: 978-989-8565-30-3 Copyright c 2012 SCITEPRESS (Science and Technology Publications, Lda.) KEOD2012-InternationalConferenceonKnowledgeEngineeringandOntologyDevelopment

3 SYSTEM OVERVIEW two words such as “kyakushitsu”:(guest room) and “heya”:(room) are often used in the same sense in the Figure 1 illustrates overview of our system. The sys- hotel review domain while those are different senses. tem consists of two modules, namely “Extraction of We thus collected similar words from hotel re- criteria”, “Text segmentation”. views by using Lin’s method (D.Lin, 1998). Some technical terms do not frequently appear in reviews even if we use large corpora. Therefore, we applied smoothing technique to the hierarchical struc- ture of semantic features. Firstly, we extracted simi- lar word pairs using dependency relationships. De- pendency relationship between two words is used for extracting semantically similar word pairs. Lin pro- posed “dependency triple” (D.Lin, 1998). A depen- dency triple consists of two words: w,w′ and the grammatical relationship between them: r in the in- put sentence. ||w,r,w′|| denotes the frequency count of the dependency triple (w,r,w′). ||w,r,∗|| denotes the total occurrences of (w,r) relationships in the cor- pus, where “∗” indicates wild card. We used 3 sets of Japanese case particles as r. Set A consists of 2 case particles: “ga” and “wo”. They Figure 1: System overview. correspond subject and object, respectively. Set B consists of 6 case particles. Set C consists of 17 case particles. We selected word pairs which are extracted by using two or three sets. 4 EXTRACTION OF CRITERIA In order to extract the corresponding semantic fea- ture of the new word, we extracted dependency triples Firstly, we define criteria as words which reviewers of the new word and the extracted words. Using some notice in the reviews. In the text in Japanese, crite- extracted words, many types of dependencytriples are ria are represented followed by “ha”. For extracting extracted. For extracting the similar words from the criteria in reviews, firstly we extract postpositional core thesaurus, I(w,r,w′) is calculated by using For- particle “ha” from whole review texts. Next we ex- mula (1). tract words followed by “ha”, and finally, we collected I(w,r,w′) words which are index words of Japanese WordNet from the extracted words as criteria words. Table 1 = −log(PMLE (B)PMLE (A|B)PMLE (C|B)) shows extracted criteria words with high frequency. −(−logPMLE (A,B,C)) ||w,r,w′|| × ||∗,r,∗|| = log (1) Table 1: Candidate words of criteria (top 5). ||w,r,∗|| × ||∗,r,w′|| No words frequency where PMLE refers to the maximum likelihood esti- mation of a probability distribution. 1 room 56,888 ′ 2 breakfast 25,068 Let T (w) be the set of pairs (r,w ) such ||w,r,w′||×||∗,r,∗|| 3 meal 17,107 that log ||w,r,∗||×||∗,r,w′|| is positive. The similarity 4 support 16,677 Sim(w1,w2) between two words: w1 and w2 is defined 5 location 14,866 by Formula (2).

Sim(w1,w2)

∑ (I(w1,r,w) + I(w2,r,w)) (r,w)∈T(w )∩T(w ) 5 SIMILAR WORD PAIR = 1 2 (2) ∑ , , ∑ , , EXTRACTION I(w1 r w) + I(w2 r w) (r,w)∈T(w1) (r,w)∈T(w2) Review texts are written by many different people. Table 2 shows the extracted pairs as similar word People may express the same thing by using differ- pairs. ent expression. For example, “heya”, “oheya” and In Table 2, there are some notational variants. “ruumu” are the same sense, i.e., room. Moreover, In general, the pair of “morning newspaper” and

382 SegmentationofReviewTextsbyusingThesaurusandCorpus-basedWordSimilarity

Table 2: Results of extracting similar pairs using particle set We decide border of criteria between 2 different A, B, C. criterion words. The following algorithm is how to decide border of criteria: No. word1 word2 for(i=0; i

1url= http://travel.rakuten.co.jp/ We used Rakuten travel review data provided by Rakuten Institute of Technology 2url=http://ja.wikipedia.org/ We used Wikipedia data of Figure 2: Text segmentation method. 2012-03-26.

383 KEOD2012-InternationalConferenceonKnowledgeEngineeringandOntologyDevelopment

words such as named entities which are not index ACKNOWLEDGEMENTS words of Japanese WordNet. We employed Lin’s method (D.Lin, 1998) for extracting similar word The authors would like to thank the referees for their pairs in hotel review texts. comments on the earlier version of this paper. This We had experiments for dividing reviews into every work was partially supported by The Telecommuni- criterion. We used review texts of 5 hotels. The av- cations Advancement Foundation. erage number of review texts per hotel was 51.2. The number of criteria consists of 256. Table 4 shows the results of text segmentation. REFERENCES

Table 4: Results of text segmentation. Allan, J., Carbonell, J., Doddington, G., Yamron, J., and Yang, Y. (1998). Topic detection and tracking pilot Our method TextTiling study final report. In the DARPA Broadcast News Precision 863/1024=0.842 742/1024=0.725 Transcription and Understanding Workshop. Recall 863/902=0.925 742/1119=0.663 Bond, F., Isahara, H., Uchimoto, K., Kuribayashi, T., and F-measure 0.882 0.692 Kanzaki, K. (2009). Enhancing the japanese . In The 7th Workshop on Asian Language Resources, We compared our method with TextTiling (Hearst, in conjunction with ACL-IJCNLP. 1997) which is a well known text segmentation tech- D.Lin (1998). Automatic retrieval and clustering of similar nique. TextTiling is a technique for subdividing texts words. In Proceedings of 36th Annual Meeting of the into multi-paragraph units that represent passages, or Association for Computational Linguistics and 17th subtopics. The discourse cues for identifying major International Conference on Computational Linguis- tics Proceedings of the Conference, pages 768–774. subtopic shifts are patterns of lexical co-occurrence and distribution. TextTiling shows high perfor- Fukushima, T., Ehara, T., and Shirai, K. (1999). Partitioning long sentences for text summarization. Journal of Nat- mance/accuracy for documents which have rather ural Language Processing (in Japanese), 6(6):131– long topics such as magazine articles. However, Text- 147. Tiling could not obtain good results as review text for Hearst, M. A. (1997). Texttiling: Segmenting text into each criterion of is very short. As can be seen clearly multi-paragraph subtopic passages. In Association for from Table 4, the results obtained by our method were Computational Linguistics, pages 111–112. much better than those of TextTiling. This demon- Hirao, T., Kitauchi, A., and Kitani, T. (2000). Text seg- strates that lexical information such as similarity be- mentation based on lexical cohesion and word im- tween words, hypernyms of words and named entities portance. Information Processing Society of Japan, were effective for text segmentation. 41(SIG3(TOD6)):24–36. Kozima, H. (1993). Text segmentation based on similarity between words. In Proceedings of the 31th Annual Meeting, pages 286–288. 8 CONCLUSIONS Kudo, T. and Matsumoto, Y. (2002). Japanese depen- dency analysis using cascaded chunking. In CoNLL In this paper, we proposed a method for review texts 2002:Proceedings of the 6th Conference on Natural segmentation by guest’s criteria, such as service, lo- Language Learning 2002, pages 63–69. cation and facilities. The results showed the effec- Utiyama, M. and Isahara, H. (2001). A statistical model for domain-independent text segmentation. In Proceed- tiveness of our method as the results attained at 0.84 ings of the 39th Annual Meeting on Association for precision, 0.92 recall, and 0.88 F-measure, and it Computational Linguistics, pages 499–506. was outperformed the results obtained by TextTiling which is used for text segmentation. Future work will include: (i) applying the method to a large number of guests reviews for quantitative evaluation, (ii) apply- ing the method to other data such as grocery stores: LeShop3, TaFeng4 and movie data: MovieLens5 to evaluate the robustness of the method.

3www.beshop.ch 4aiia.iis.sinica.edu.tw/index.php?option=com docman& task=cat view&gid=34&Itemid=41 5http://www.grouplens.org/node/73

384