Using Flickr Geotags to Find Similar Tourism Destinations

POLITECNICO DI MILANO Master of Science in Computer Engineering for the Communication Department of Computer Engineering USING FLICKR GEOTAGS TO FIND SIMILAR TOURISM DESTINATIONS Supervisor: Prof. Lorenzo Cantoni Co-Supervisor: Dr. Davide Eynard Master Thesis of: Leonardo Gentile, matricola 744177 Academic Year: 2010 - 2011 Thanks to Professor Cantoni who gave me the great chance to undertake this thesis on a very interesting topic. Thanks to Alessandro Inversini and Elena Marchiori for their advices in the communication field and for spreading my survey around the world. Thanks to Eng. Giuseppe Moscato aka PeppeSka who kindly shared his homemade server letting me gather 1 Millions of tags from Flickr. Thanks to Stefano Celentano who kindly shared his internet connection. Thanks to my family for supporting me during my long student life (yes don’t worry, It’s over..) A very special thanks to Davide Eynard who patiently and always very kindly guided and advised me during the creation of this work. Abstract The amount of geo-referenced information available on the Web is constantly increasing due to the large availability of location-aware mobile devices and map interfaces. This is enabling new search paradigms (e.g. “What is here”) but also it is generating a large amount of unexplored georeferenced collections. In particular, in photo collections like Flickr the co-existence of geographical metadata in conjunction with text-based annotations (tags) generates interesting location-driven trends and patterns in textual data. When enough information is available, analysis systems can identify these patterns and extract aggregate knowledge. This inspired me in creating a novel method to extract representative place descriptions using users’ text annotations obtained from Flickr geo-referenced photos. In such a way I propose an attempt to predict similar locations based on the similarity of their respective descriptions. The prototype has been implemented as a web based tool and it has positively evaluated, through a survey, by more than a hundreds of users. I Contents Abstract I 1 Introduction 2 1.1 The Web2rism project . 2 1.2 Motivations . 3 1.3 Objective . 4 1.4 Thesis Outline . 5 2 Background 6 2.1 Folksonomies . 6 2.1.1 Broad Folksonomy . 8 2.1.2 Narrow Folksonomy . 10 2.1.3 Folksonomies Conclusions . 11 2.2 The Geo World & Geo Web . 12 2.2.1 The GeoTags . 14 2.2.2 Geotagging Photos . 15 2.2.3 Yahoo! GeoPlanet & WoeId . 19 2.2.4 Flickr . 23 2.3 Weighting and Scoring Methods . 25 2.3.1 TF-IDF . 25 2.3.2 Vector Space Model . 27 2.3.3 VSM Related Issues . 29 2.4 Related Works . 30 2.4.1 Wormholes . 30 2.4.2 World Explorer . 33 3 My Approach 38 3.1 Approach Introduction . 39 3.2 Assumptions . 40 3.3 Datasets . 41 II 3.4 Definitions . 43 3.5 Flickr: from Narrow to Broad Folksonomy . 44 3.6 Extract a representative description . 47 3.6.1 Vector Space Model Representation . 47 3.6.2 Problem Decomposition . 48 3.6.3 TF-IDF Weights . 49 3.6.4 Weighting Systems . 51 3.6.5 VSM Limitations . 52 3.7 Scoring and Retrieving Similar Places . 54 3.7.1 Scoring . 54 3.7.2 Retrieving Similar Places . 55 4 Implementation 56 4.1 Development Platform . 56 4.1.1 PHP . 57 4.1.2 MySQL . 57 4.1.3 Python . 57 4.2 API & Online Services . 59 4.2.1 Yahoo Geoplanet . 59 4.2.2 Flickr . 60 4.2.3 Yahoo Query Language . 61 4.3 System Architecture . 63 4.3.1 Data Storage . 64 4.3.2 Data Analysis . 65 4.4 User Interface Design & Data Presentation . 69 5 Tests and Evaluations 71 5.1 About Similarity . 71 5.1.1 Critical Factors . 73 5.2 Test . 74 5.2.1 Survey . 75 5.3 Evaluation . 77 6 Conclusion 82 6.1 Current Status of the Work . 82 6.2 Application Fields . 83 6.3 Future Work . 83 A Terms and Abbreviations 89 B Blacklist 90 List of Figures 2.1 Generic Tag Distribution in the broad folksonomies . 9 2.2 Mobile devices positioning systems and accuracies . 13 2.3 Geoplanet Hierarchy and Relationships . 21 2.4 Flickr Map . 23 2.5 A three-dimension example of the Vector Space Model . 28 2.6 Geotagged World photos distribution collected for the “Worm- holes” research . 31 2.7 Wormholes detection from Mount Everest with σ= 50 km. 32 2.8 The World Explorer Map for a large scale of details . 34 2.9 The World Explorer Map for a narrow scale of details for the City of Rome in Italy . 34 3.1 Tag distribution for the city of New York using the “Top 100” dataset . 44 3.2 Tag distribution for the city of New York using the “Random” dataset truncated to the 150th tag . 45 3.3 Tag distribution for the city of New York using the “Random” dataset truncated to the 150th tag using a log-log scale . 46 4.1 Data Flow Representation . 63 4.2 Database Representation for the “Random” Dataset . 64 4.3 Table Representation for the generic “score” table . 68 4.4 Search Field . 69 4.5 Disambiguation . 69 4.6 Similarities for the city of “Rome” . 70 5.1 Tag distribution for the city of New York using the “Random” dataset truncated to the 150th tag . 72 5.2 Tag Weights distribution for the city of New York using the “Random” dataset truncated to the 150th tag using the W- rnd1 e W-rnd3 weights . 73 IV 1 5.3 Survey Introduction and Instruction . 75 5.4 Screenshot of the survey for the city of Marseille . 76 5.5 Users’ survey answers for the first day of evaluation. 77 5.6 Users’ survey answers for the last day of evaluation . 78 5.7 Shared tags between “Seville” and “Cordoba” according to the System A ........................... 79 5.8 Shared tags between “Seville” and “Cordoba” according to the System D (truncated) . 79 5.9 Five cities similar to “Rome” according to System A and Sys- tem D (in ranked order) . 80 5.10 Shared tags between “Rome” and “Tarragona” according to System A ............................. 81 5.11 Five cities similar to “Rome” according to System B and Sys- tem C (in ranked order) . 81 B.1 Stop-words or Blacklist composed by the most common Flickr tags . 90 0 Chapter 1 Introduction This chapter represents an introduction overview to the whole report. First, the general information about the side-related “Web2rism” project is pre- sented and the motivation supporting my decision to undertake this research project is explained. Next, the objective that this research attempts to satisfy and its connection with my motivation is described. Lastly, the structure of the report is given to let the reader have a clear idea of what each chapter is about. 1.1 The Web2rism project The Web2Rism project has been carried out by the webatelier1 lab at Uni- versità della Svizzera Italiana (USI - Lugano, Switzerland) and funded by the CTI - the Swiss Confederation’s Commission for Technology and Inno- vation2 - and a private company called PromAx Communication3. The webatelier lab, directed by the professor Lorenzo Cantoni4, is a research and development laboratory, which deals with a broad range of topics related to new media communication especially in the eTourism field com- bining a strong academic background and a relevant business experience. Research projects of Webatelier deal with online communication strategies for destinations and tourism companies: eWord-of-Mouth and destinations’ online reputation, eLearning and gaming in tourism, argumentation in user generated contents, usability and usages studies, websites’ information architecture, booking engines design. 1http://www.webatelier.net 2http://www.kti.admin.ch 3http://www.promax.ch/index.html 4http://newmine.blogspot.com/ 2 1.2. Motivations 3 The aim of the Web2Rism (web 2.0 and tourism) project was to build a business intelligence software for the tourism field, which analyzes the online reputation of a given destination based on User Generated Contents (UGC), published on different services in the so-called web 2.0, e.g.: blogs, wikis, social networks, and so on. The project was divided in a research phase followed by its development for a total duration of two years (2008-2010). The software, also named Web2rism, has been designed, developed and released in December 2010 by researchers of Webatelier whom also released several research papers on the topic of web reputation for touristic destination[11][12]. 1.2 Motivations The main aim of the “Web2rism” project was to analyze online reputation of tourism destinations. When I joined the “Web2rism” team I started my research not exactly on the reputation analysis but on a related topic, that is, find similar tourism destination with the aim of creating a tourism suggestion system. There are different ways and already available online tools that satisfy these needs, most of them based of data generated by user travel behaviors or exploiting the users’ reviews about tourism destinations. These methods are, for example, widely used by big travel and tourism online portals such as Expedia5 or Venere6. However, I wanted to develop a system not based on users’ reviews or tourism analyses. In particular I started to wonder if it may have been possible to extract knowledge from the users’ online photos. In other words, when a person, for example a tourist takes a photo in a particular place he is, in a way, expressing his estimation for that place. This, alone, represents a valuable information letting the analysts identify usage patterns (e.g. a large number of photos taken in a place during a particular day may identify an event/concert/parade, ectr.).

Load more