<<

POLITECNICO DI MILANO Master of Science in Computer Engineering for the Communication

Department of Computer Engineering

USING GEOTAGS TO FIND SIMILAR TOURISM DESTINATIONS

Supervisor: Prof. Lorenzo Cantoni Co-Supervisor: Dr. Davide Eynard

Master Thesis of: Leonardo Gentile, matricola 744177

Academic Year: 2010 - 2011 Thanks to Professor Cantoni who gave me the great chance to undertake this thesis on a very interesting topic.

Thanks to Alessandro Inversini and Elena Marchiori for their advices in the communication field and for spreading my survey around the world.

Thanks to Eng. Giuseppe Moscato aka PeppeSka who kindly shared his homemade server letting me gather 1 Millions of tags from Flickr.

Thanks to Stefano Celentano who kindly shared his connection.

Thanks to my family for supporting me during my long student life (yes don’t worry, It’s over..)

A very special thanks to Davide Eynard who patiently and always very kindly guided and advised me during the creation of this work. Abstract

The amount of geo-referenced information available on the Web is constantly increasing due to the large availability of location-aware mobile devices and map interfaces. This is enabling new search paradigms (e.g. “What is here”) but also it is generating a large amount of unexplored georeferenced collections. In particular, in photo collections like Flickr the co-existence of geographical in conjunction with text-based annotations (tags) generates interesting location-driven trends and patterns in textual data. When enough information is available, analysis systems can identify these patterns and extract aggregate knowledge. This inspired me in creating a novel method to extract representative place descriptions using users’ text annotations obtained from Flickr geo-referenced photos. In such a way I propose an attempt to predict similar locations based on the similarity of their respective descriptions. The prototype has been implemented as a web based tool and it has positively evaluated, through a survey, by more than a hundreds of users.

I Contents

Abstract I

1 Introduction 2 1.1 The Web2rism project ...... 2 1.2 Motivations ...... 3 1.3 Objective ...... 4 1.4 Thesis Outline ...... 5

2 Background 6 2.1 ...... 6 2.1.1 Broad ...... 8 2.1.2 Narrow Folksonomy ...... 10 2.1.3 Folksonomies Conclusions ...... 11 2.2 The Geo World & Geo Web ...... 12 2.2.1 The GeoTags ...... 14 2.2.2 Photos ...... 15 2.2.3 Yahoo! GeoPlanet & WoeId ...... 19 2.2.4 Flickr ...... 23 2.3 Weighting and Scoring Methods ...... 25 2.3.1 TF-IDF ...... 25 2.3.2 Vector Space Model ...... 27 2.3.3 VSM Related Issues ...... 29 2.4 Related Works ...... 30 2.4.1 Wormholes ...... 30 2.4.2 World Explorer ...... 33

3 My Approach 38 3.1 Approach Introduction ...... 39 3.2 Assumptions ...... 40 3.3 Datasets ...... 41

II 3.4 Definitions ...... 43 3.5 Flickr: from Narrow to Broad Folksonomy ...... 44 3.6 Extract a representative description ...... 47 3.6.1 Vector Space Model Representation ...... 47 3.6.2 Problem Decomposition ...... 48 3.6.3 TF-IDF Weights ...... 49 3.6.4 Weighting Systems ...... 51 3.6.5 VSM Limitations ...... 52 3.7 Scoring and Retrieving Similar Places ...... 54 3.7.1 Scoring ...... 54 3.7.2 Retrieving Similar Places ...... 55

4 Implementation 56 4.1 Development Platform ...... 56 4.1.1 PHP ...... 57 4.1.2 MySQL ...... 57 4.1.3 Python ...... 57 4.2 API & Online Services ...... 59 4.2.1 Yahoo Geoplanet ...... 59 4.2.2 Flickr ...... 60 4.2.3 Yahoo Query Language ...... 61 4.3 System Architecture ...... 63 4.3.1 Data Storage ...... 64 4.3.2 Data Analysis ...... 65 4.4 User Interface Design & Data Presentation ...... 69

5 Tests and Evaluations 71 5.1 About Similarity ...... 71 5.1.1 Critical Factors ...... 73 5.2 Test ...... 74 5.2.1 Survey ...... 75 5.3 Evaluation ...... 77

6 Conclusion 82 6.1 Current Status of the Work ...... 82 6.2 Application Fields ...... 83 6.3 Future Work ...... 83

A Terms and Abbreviations 89

B Blacklist 90 List of Figures

2.1 Generic Distribution in the broad folksonomies ...... 9 2.2 Mobile devices positioning systems and accuracies ...... 13 2.3 Geoplanet Hierarchy and Relationships ...... 21 2.4 Flickr Map ...... 23 2.5 A three-dimension example of the Vector Space Model . . . . 28 2.6 Geotagged World photos distribution collected for the “Worm- holes” research ...... 31 2.7 Wormholes detection from Mount Everest with σ= 50 km. . . 32 2.8 The World Explorer Map for a large scale of details ...... 34 2.9 The World Explorer Map for a narrow scale of details for the City of Rome in Italy ...... 34

3.1 Tag distribution for the city of New York using the “Top 100” dataset ...... 44 3.2 Tag distribution for the city of New York using the “Random” dataset truncated to the 150th tag ...... 45 3.3 Tag distribution for the city of New York using the “Random” dataset truncated to the 150th tag using a log-log scale . . . 46

4.1 Data Flow Representation ...... 63 4.2 Database Representation for the “Random” Dataset . . . . . 64 4.3 Table Representation for the generic “score” table ...... 68 4.4 Search Field ...... 69 4.5 Disambiguation ...... 69 4.6 Similarities for the city of “Rome” ...... 70

5.1 Tag distribution for the city of New York using the “Random” dataset truncated to the 150th tag ...... 72 5.2 Tag Weights distribution for the city of New York using the “Random” dataset truncated to the 150th tag using the W- rnd1 e W-rnd3 weights ...... 73

IV 1

5.3 Survey Introduction and Instruction ...... 75 5.4 Screenshot of the survey for the city of Marseille ...... 76 5.5 Users’ survey answers for the first day of evaluation...... 77 5.6 Users’ survey answers for the last day of evaluation ...... 78 5.7 Shared tags between “Seville” and “Cordoba” according to the System A ...... 79 5.8 Shared tags between “Seville” and “Cordoba” according to the System D (truncated) ...... 79 5.9 Five cities similar to “Rome” according to System A and Sys- tem D (in ranked order) ...... 80 5.10 Shared tags between “Rome” and “Tarragona” according to System A ...... 81 5.11 Five cities similar to “Rome” according to System B and Sys- tem C (in ranked order) ...... 81

B.1 Stop-words or Blacklist composed by the most common Flickr tags ...... 90 0 Chapter 1

Introduction

This chapter represents an introduction overview to the whole report. First, the general information about the side-related “Web2rism” project is pre- sented and the motivation supporting my decision to undertake this research project is explained. Next, the objective that this research attempts to sat- isfy and its connection with my motivation is described. Lastly, the structure of the report is given to let the reader have a clear idea of what each chapter is about.

1.1 The Web2rism project

The Web2Rism project has been carried out by the webatelier1 lab at Uni- versità della Svizzera Italiana (USI - Lugano, Switzerland) and funded by the CTI - the Swiss Confederation’s Commission for Technology and Inno- vation2 - and a private company called PromAx Communication3. The webatelier lab, directed by the professor Lorenzo Cantoni4, is a re- search and development laboratory, which deals with a broad range of topics related to new media communication especially in the eTourism field com- bining a strong academic background and a relevant business experience. Research projects of Webatelier deal with online communication strategies for destinations and tourism companies: eWord-of-Mouth and destinations’ online reputation, eLearning and gaming in tourism, argumentation in user generated contents, usability and usages studies, ’ information ar- chitecture, booking engines design.

1http://www.webatelier.net 2http://www.kti.admin.ch 3http://www.promax.ch/index.html 4http://newmine.blogspot.com/

2 1.2. Motivations 3

The of the Web2Rism (web 2.0 and tourism) project was to build a business intelligence software for the tourism field, which analyzes the online reputation of a given destination based on User Generated Contents (UGC), published on different services in the so-called web 2.0, e.g.: , wikis, social networks, and so on. The project was divided in a research phase followed by its development for a total duration of two years (2008-2010). The software, also named Web2rism, has been designed, developed and re- leased in December 2010 by researchers of Webatelier whom also released several research papers on the topic of web reputation for touristic destination[11][12].

1.2 Motivations

The main aim of the “Web2rism” project was to analyze online reputation of tourism destinations. When I joined the “Web2rism” team I started my research not exactly on the reputation analysis but on a related topic, that is, find similar tourism destination with the aim of creating a tourism suggestion system. There are different ways and already available online tools that satisfy these needs, most of them based of data generated by user travel behaviors or exploiting the users’ reviews about tourism destinations. These methods are, for example, widely used by big travel and tourism online portals such as Expedia5 or Venere6. However, I wanted to develop a system not based on users’ reviews or tourism analyses. In particular I started to wonder if it may have been possible to extract knowledge from the users’ online photos. In other words, when a person, for example a tourist takes a photo in a particular place he is, in a way, expressing his estimation for that place. This, alone, represents a valuable information letting the analysts identify usage patterns (e.g. a large number of photos taken in a place during a particular day may identify an event/concert/parade, ectr.). Usually in the online photo collections, which Flickr is the most representative example, the users can geotag their photos, meaning that they can annotate the geographical place where the photos were taken, but also they can annotate them using textual keywords with the aim of creating a short description of their photos. This represents a very interesting information because we can analyze not only geographical photo trends but also the trends amongst the textual description aggregated by geographical locations (e.g. how many users are

5http://www.expedia.com/ 6http://www.venere.com/ 1.3. Objective 4 using the tag “Trevi” in the city center of “Rome”?). I began the research studying the existing research papers about the topic, formalizing the hypotheses and setting the objectives.

1.3 Objective

In the past the information retrieval communities have studied methods for efficiently indexing, retrieving, ranking and browsing documents in geo- graphic data collections[17, 30]. However, in collections like Flickr the co- existence of location metadata together with unstructured text-based anno- tations allows the generation of interesting location-driven aggregate knowl- edge: when enough information is available, analysis systems can identify useful location-driven trends and patterns in the text data. The exploitation of Flickr’s geotags in conjunction with users’ text annota- tions has shown to be effective for various tasks: global event detection [25], mapping of popular tags and photos to geographical locations[1, 15], finding important landmarks and representative photos[4]. Furthermore, several methods have been proposed to predict the geotags of a photo, based on its textual tags[29], visual information[4] and individual user travel patterns[14]. I propose an attempt to predict similar geographic location using the Flickr georeferenced photos and relative tags. The similarity is based on location descriptions that are, in turn extracted aggregating textual photos annota- tion. As far as I know only another research[3] trying to reach this aim have been published. 1.4. Thesis Outline 5

1.4 Thesis Outline

The thesis begins with an introduction of the background topics necessary to understand the whole research. The Background chapter begins with an excursus about the concept of folksonomy considered particularly important in the scope of this research. Later on the discussion focuses on what the “Geoweb” is and why it is becoming an important emerging trend, considering also its related topics analyzed in this research. The Background chapter ends with an overview of the Information Retrieval methods used in the research and with the dis- cussion of the related researches that mainly affected this work. In Chapter 3, the theoretical approach of this work is explained starting from the assumptions representing the foundations of the whole research ending with a step-by-step illustration of all the choices I made. The system architecture behind the project heavily uses online data ser- vices and API to obtain the information on which the implemented tool performs the analysis, scoring and retrieving, that are described in Chapter 4 along with the whole system architecture. Chapter 5 presents the tests conducted using a survey on a users fo- cus group, discussing later the evaluation based on the collected opinions. Finally, in Chapter 6 the conclusions are drawn, highlighting the reached goals together with the main limitations of the current prototype, and sug- gesting possible future improvements. Chapter 2

Background

This chapter’s aim is to provide the necessary background to better un- derstand the characteristics of the project I worked on. It begins with an excursus about the concept of folksonomy and its subdivision in narrow and broad. Then, the discussion focuses on what the “Geoweb” is and why it is becoming an important emerging trend, considering also its related topics analyzed in this research (tags, geotags). Following the two main online data source, Flickr and Yahoo GeoPlanet, used during the work will be presented. This chapter continues with an overview of the Information Retrieval meth- ods used in the research, that is the Vector Space Model and the TF-IDF weighting method. Finally, it ends with the discussion of the related re- searches that mainly affected this work.

2.1 Folksonomies

Collaborative tagging is a phenomenon where users assign free-form keywords or short sentences (called tags) to annotate, describe and categorize shared typically over the web[16]. This practice is also known as social classification, social indexing, and social tagging. Collaborative tag- ging became popular on the Web around 20041 as part of highly popular web and social applications that enable collaborative tagging in some forms such as the music service Last.fm2, the application De- licious3, the social networking site Facebook4 and the photo management

1http://vanderwal.net/folksonomy.html 2http://www.last.fm/ 3http://www.delicious.com/ 4http://www.facebook.com/

6 2.1. Folksonomies 7 and sharing tool Flickr5. The fact that big systems like these use collabora- tive tagging shows that it has become a common and likely effective way to describe various forms of digital content on the web. Together, the tags, in their respective contexts form a vocabulary often referred to as a folksonomy[22], which can be used for organization and re- trieval of the digital content the folksonomy describes. Folksonomies can be also defined as large-scale bodies of lightweight annota- tions provided by humans, and they are becoming more and more interesting for research communities that focus on extracting machine-processable se- mantic structures from them[2]. Folksonomy, a term coined by Thomas Vander Wal6, is a blend of the terms folks - multiple people with no particular designation - and taxonomy - a hierarchical structure of classification - meaning that by adding metadata to objects or resources a community builds a personalized taxonomy. Since the tags are usually free-form and unconstrained (although some tag- ging sites do not allow spaces or other non-alphanumeric characters to be included in tags) without imposing the use of pre-built vocabularies, folk- sonomies represent an alternative mechanism to the semantic web approach where experts build ontologies7 with predetermined relationships among key-words. This later approach, usually, requires domain-field experts and a community agreeing on most of the experts choices; while in collaborative tagging, keyword indexing grows as a natural process[5]. In other terms we can refer to folksonomy as a means for people to tag objects (web pages, photos, videos, , etc.) using their own vocabulary so that it is easy for them to refind that information again. It is important to notice that the folksonomies being often in social networks context let others that use the same vocabulary to find the object as well. Folksonomies work best when the tags used to describe objects are in the common vocabulary and not what a person perceives others will call it. It is possible to derive two different concepts from the general term foksonomy: broad and narrow folksonomies, that go beyond a simple understanding of tagging.

5http://www.flickr.com 6http://www.vanderwal.net/about.php 7http://www.shirky.com/writings/ontology_overrated.html 2.1. Folksonomies 8

2.1.1 Broad Folksonomy An example of broad folksonomy is represented by a tool like . The broad folksonomy has many people tagging the same object and every person can tag the object with their own tags in their own vocabulary. We can describe this process dividing it in three main actions: • A person creates the object (content) and makes it accessible to others.

• Other people (groups of people with the same vocabulary) tag the object with their own terms.

• The people also find the information based on the tags. Analyzing the broad folksonomies we can usually gather emerging trends, often with a distribution shape following the power law curve8 (Figure 2.1).

Broad Folksonomy & Power Law Curve There are both benefits and drawbacks about the tagging approach to be considered. The main positive aspect of the tagging systems is that they allow much greater malleability and adaptability in organizing information than do formal classification systems because “groups of users do not have to agree on a hierarchy of tags or detailed taxonomy, they only need to agree, in a general sense, on the ‘meaning’ of a tag enough to label similar mate- rial with terms for there to be cooperation and shared value” 9. However, a number of problems arise from organizing information through folksonomies including ambiguity in the meaning of tags and the use of synonyms which creates informational redundancy[7]. The main concern about the use of collaborative tagging to organize metadata is whether or not the system becomes relatively “stable” with time and use. With “stable” we mean to point out that users have developed some consensus about which tags best describe an object and those tags are used most often. The most problem- atic claim for tagging systems would be that because users are not under a centralized controlling vocabulary, no coherent categorization scheme can emerge at all. In this case, tagging systems would be essentially unstable, where the tags used and their frequency of use would be in a constant state of flux.

A tag distribution for an object or resource is defined as the collection of all tags and their frequencies ordered by rank frequency for a given resource.

8http://www.shirky.com/writings/powerlaw_weblog.html 9http://www.adammathes.com/academic/computer-mediated- communication/folksonomies.html 2.1. Folksonomies 9

Figure 2.1: Generic Tag Distribution in the broad folksonomies

It has been empirically proven that tag distributions for broad folksonomies actually stabilize over time producing a distribution known as a power law[7] like the one shown in Figure 2.1. This means that from the broad folkson- omy we can gather trends describing how a wide range of people are tagging one object. We can use the Figure 2.1 in order to describe this trend coming out from the action of tagging a “bookmark” on Delicious carried out by different users. The tags spike with tag “2” getting the largest portion of the tags with 13 entries and tag “1” receiving 10 identical tags. From this point the trends for popular tags are easy to see with the spikes on the left (power terms) identifying some trends that could be used to ex- tract a controlled vocabulary or at least to have a broad spectrum of people knowing how to call the object and so find it (similar to those that tagged the object, considering also that those that tag may not be representative of the whole). We also see those tags out at the right end of the curve, known as the long tail. This is where there is a small minority of people who call the object by a term, but those people tagging this object would allow others with a similar vocabulary mindset (or maybe same language) to find the object, even if they do not use the terms used by the masses over at the left end of the curve. If we take this example and spread it out over 400 or 1,000 people tagging the same object we will see a similar distribution with even more pronounced spikes and drop-off and a longer tail because one important feature of power laws is that they can often be “scale-free” such that regardless of how larger the system grows, the shape of the distribution remains the same, and thus “stable”. The long tail and power curve are benefits of the broad folksonomy coming from the richness provided by many people openly tagging the same object. 2.1. Folksonomies 10

As it will be described next, the narrow folksonomy does not have the same properties, but it will have other benefits. These benefits are non-existent for those just simply tagging items, most often done by the content creator for their own content.

2.1.2 Narrow Folksonomy The narrow folksonomy, which a like Flickr represents, provides benefit in tagging objects that are not easily searchable or have no other means of using text to describe or find the object (in this case images and photographs). The narrow folksonomy is done by one or a few people providing tags that the person uses to get back to that information. The tags, unlike in the broad folksonomy, are singular in nature for each object (only one tag with the term is used as compared to 13 people in the broad folksonomy using the same tag). Often in the narrow folksonomy the person creating the object is providing one or more of the tags to get things started. The goals and uses of the narrow folksonomy are different than the broad, but still very helpful as more than one person can describe the one object. Also with the narrow there are few probabilties of really knowing how the tags are consumed or what portion of the people using the object would call it what, therefore it is not helpful in finding emerging vocabulary or trends. We do find that tags used to describe are also used for grouping, which is particularly visible and relevant in Flickr. The narrow folksonomy does not have the richness of the broad folksonomy, but it still add value. The value, as in the case of Flickr, is in text tags being applied to objects that were not findable using traditional search engines or other text related tools that comprise much of how we find things on the internet today. The narrow folksonomy does provide various audiences the means to add tags in their own vocabulary that will help them and those like them to find the objects at a later time. 2.1. Folksonomies 11

2.1.3 Folksonomies Conclusions We benefit from folksonomies as the both the personal vocabulary and the social aspects help people to find and retain a “chain” to objects on the web that represent an interest to them. Who is doing the tagging and how the tags are consumed are important factors to understand. This also helps to see that not all tagging is a folksonomy, but is just tagging. Folksonomy tagging can provide connections across cultures, languages and disciplines (a photograph can be tagged using two different languages and alphabets but we can find valuable information because one object is tagged by both communities using their own differing terms of practice). As a conclusion we can say that we take different advantages from folk- sonomies even if it is a narrow folksonomy 2.2. The Geo World & Geo Web 12

2.2 The Geo World & Geo Web

The amount of geographically annotated data over the Web is drastically increasing, generating a new big trend where the geographic data (and meta- data) are available in several applications and online services[1]. The Geospatial Web or Geoweb is a relatively new term that implies the merging of geographical (location-based) information with the abstract in- formation that currently dominates the Internet. This would lead to an environment where one could search for things based on location instead of by keyword only – e.g. “What is Here?”. The concept of a Geospatial Web may have first been introduced in 1994 by Dr. Charles Herring in his US Department of Defence research paper[9]. The interest in the Geoweb has been guided by new technologies, concepts and products. Virtual globes such as Google Earth10 and NASA World Wind11 as well as mapping websites such as Google Maps12, Bing Maps13 and Yahoo Maps14 have been major factors in raising awareness towards the importance of geography and location as a method to index information. The increase in advanced web development methods such as and the availability of geographical Application Programming Interfaces (API) such as Yahoo! Geoplanet API (see the next paragraph) and Google Maps API Family15 are providing inspiration to move Geographical Information Sys- tems (GIS) into the web. This is also due to the availability of map interfaces in several kind of location-aware mobile devices that are becoming accessible to the mainstream market.

Location-Aware Mobile Devices With the increasing popularity of mobile communications and mobile com- puting, the demand for location-aware and adaptive applications grows. Location-aware devices and applications exploit knowledge about the phys- ical location of real-world objects such as mobile persons and devices, to adapt their functional behaviour and their appearance towards the user[19]. The user can be located with different positioning systems. Due to the massive production of affordable GPS-enabled cameras and mobile phones[21, 24] location metadata such as latitude and longitude are automatically associated with the content generated by users.

10http://www.google.com/earth/ 11http://worldwind.arc.nasa.gov/ 12http://maps.google.com/ 13http://www.bing.com/maps/ 14http://maps.yahoo.com/ 15http://code.google.com/apis/maps/index.html 2.2. The Geo World & Geo Web 13

If the device is equipped with a GPS (Global Positioning System) module, the location is calculated in the user device and it can be defined very ac- curately within the range of 2–20 meters[13]. A mobile phone can be also located by the telecom operator in the network[13]. The positioning is based on identifying the mobile network cell in which the phone is located, or on measuring distances to overlapping cells. In urban areas the accuracy can be down to 50 meters, whereas in rural areas the accuracy may be several kilometres. The advantage of the cell-based positioning method is that no extra equipment is needed - an or- dinary mobile phone is already capable.

Figure 2.2: Mobile devices positioning systems and accuracies

Finally user can also be identified at a service point, utilizing e.g. WLAN (Wireless Local Area Network), Bluetooth or infrared technologies. These kinds of proximity positioning systems require a dense network of access points[13]. The density of the network depends both on the required loca- tion accuracy and on the range of the access points. The accuracy can be 2.2. The Geo World & Geo Web 14 down to 2 meters, even if practical test using an iPhone device16 showed an average accuracy of 30 meters17. The user needs mobile devices able to connect to WLAN and Bluetooth services, but nowadays these are becom- ing very common in current mobile smartphones. Because of the required infrastructure, such localization methods can only be used in a predefined area, e.g. a shopping centre, an exhibition area or an office building. The location of the user is available only when the user is in the service area.

2.2.1 The GeoTags Geotagging refers to the process of assigning geospatial context information, ranging from specific point locations to arbitrarily shaped regions to objects and online resources. The concept is similar to the action of tagging on- line resources explained in the Chapter 2.1, but in this case the objects and resources are being annotated with geographical metadata instead of using free-form textual keywords. Different sources of geospatial context information for annotating Web re- sources often co-occur in real world applications[20]:

• Annotation provided by the user, manually or through location-aware devices such as car navigation systems, RFID-tagged products and GPS-enabled cellular phones. These devices geotag information auto- matically when it is being created.

• Determining the location of the user analyzing his connection point to the Internet – e.g. by querying the Whois18 database for domain registrations or using the W3C Geolocation API19 for an higher accu- racy. In case of mobile devices with one of the methods mentioned in the above paragraph.

• Automated annotation of existing documents. The processes of recog- nizing geographic context and assigning spatial coordinates are com- monly referred to as geoparsing and geocoding, respectively.

The geospatial metadata (geotags) usually consist of latitude and longitude coordinates, even though they can also include annotation like altitude, bear- ing, distance, accuracy data, and place names.

16http://www.apple.com/iphone/specs.html 17http://www.wired.com/gadgets/wireless/magazine/17-02/lp_guineapig 18http://www.dnsstuff.com/ 19http://dev.w3.org/geo/api/specsource.html 2.2. The Geo World & Geo Web 15

Geotagging can help users find a wide variety of location-specific informa- tion. For instance, one can find images taken near a given location by en- tering latitude and longitude coordinates into an appropriate image search engine. Geotagging-enabled information services can also potentially be used to find location-based news, websites, or other resources. The related term geocod- ing refers to the process of taking non-coordinate based geographical iden- tifiers, such as a textual street address, and finding associated geographic coordinates (or vice versa for reverse geocoding). Such techniques can be used together with geotagging to provide alternative search techniques. The geocoding activity usually analyzes unambiguous structured location refer- ences, such as postal addresses and formatted numerical coordinates, while Geoparsing handles ambiguous references in unstructured discourse, such as “Venice” which represents the name of several places, including towns in both Italy and USA.

2.2.2 Geotagging Photos There are several circumstances in which the location where a picture was taken is important: tourists shoot photos of family while traveling on va- cation, botanists record images of plant species, and real-estate firms post shots of houses and neighborhoods[32]. These represent only few examples in which the geographic location where the photographs were taken provides critical context. Other factors could be represented by the social sharing ac- tivities, in order to let the highest number of users as possible to reach our pictures by geotagging them20. Geotagging can for example tell users the location where a given picture was taken, and conversely some media plat- forms allow to show pictures relevant to a given location. Users have the opportunity to spatially organise and browse their personal media, and photo sharing services (see Flickr in next section) are leading the growing enthusiasm for personal location-awareness[31]. Geo-referenced photos can be organised in a browsable taxonomy of major locations or pin- pointed on a map to identify very small regions. Some of the most popular examples are Flickr Places and Google Panoramio21.

The base resource for geotagging digital objects is represented by the

20http://www.msnbc.msn.com/id/22732770/ns/technology_and_science-internet 21http://www.panoramio.com/ 2.2. The Geo World & Geo Web 16 position, that in almost every case, is derived from the GPS, and based on the latitude/longitude coordinate system that presents each location on the earth from 180° west through 180° east along the Equator and 90° north through 90° south along the prime meridian. There are two main options for geotagging photos: capturing positioning information (usually GPS) at the time the photo is taken or “attaching” the photograph to a map after the picture is taken. In order to capture GPS data at the time the photograph is captured, the user must have a camera with built in GPS or a standalone GPS along with a digital camera. Because of the requirement for wireless service providers in to supply more precise location information for 911 calls by September 11, 201222, more and more cell phones have built-in GPS chips also sold all around the world. Some cell phones like the iPhone and different devices using the Android23 Operative System already utilize a GPS chip along with built-in cameras to allow users to automatically geotag photos. Others may have the GPS chip and camera but do not have internal software needed to embed the GPS information within the picture. A few digital cameras also have a built-in GPS that allow for automatic geotagging such as Nikon, Sony and Ricoh. Almost any digital camera can be coupled with a stand alone GPS and post processed with photo mapping software (such as GPS-Photo Link24, Alta425, or EveryTrail26) to write the location information to the image’s header. An alternative way to know the location of the pictures is represented by the use of a camera with an SD memory card or SDHC card27 with wireless connection enabled and geotagging capabilities. The most common SD memory card with these characteristics is the Eye-Fi Geo28 providing a unique method in extract place location.The Eye-fi card geotags pictures through Wi-Fi Positioning System (WPS)[19] technology. Using the built- in Wi-Fi module, the Eye-Fi Card senses surrounding Wi-Fi networks while the user is taking pictures. The location is not locally recorded in conven- tional Exif coordinate form, but the geotags are inserted into Exif when photos are uploaded using the Eye-Fi Service. Geographic coordinates can also be added to a photograph after the pho-

22http://www.fcc.gov/cgb/consumerfacts/wireless911srvc.html 23http://www.android.com/ 24http://www.geospatialexperts.com/productfeatures.php 25http://www.alta4.com/eng/geoimaging/camera/index.php 26http://www.everytrail.com/garmin_import.php 27http://www.sdcard.org/ 28http://uk.eye.fi/products/geox2 2.2. The Geo World & Geo Web 17 tograph is taken by “attaching” the photograph to a map[29] using online services such as Flickr and Panoramio. These tools can then write the lat- itude and longitude into the photos Exif header after selecting the location on an online map.

Tag vs. Geotag In the online photo sharing communities the user text-based annotation (tags) and the location metadata (geotags) often co-exist. In these contexts it is not rare to refer to a geotag as the textual tag carrying also the georef- erence information associated to the photo that it describes. In this report the term “geotag” refers to the geographic annotations (e.g. where a photo was taken) while “tag” is always intended as the textual an- notation related to a photo (e.g. “cat”).

EXIF Metadata Geotag information is typically embedded in the metadata (stored in EXIF format). These data are not visible in the picture itself but are read and written by almost any digital imaging programs and most digital cameras and modern scanners. EXIF stands for “Exchangeable image file format”, and represent a specifi- cation29 for the image file format used by digital cameras (including smart- phones) and scanners. The specification uses the existing JPEG, TIFF Revi- sion 6.0, and RIFF WAV file formats, with the addition of specific metadata tags. It is not supported in JPEG 2000, PNG, or GIF. The specification Version 2.1 is dated June 12, 1998 and the latest version 2.3 dated April 201030, was jointly formulated by JEITA31 and CIPA32. Though the specification is not currently maintained by any industry or standards organization, its use by camera manufacturers is nearly universal. The metadata tags defined in the EXIF standard cover a broad spectrum:

• Date and time information. Digital cameras will record the current date and time and save this in the metadata.

• Camera settings. This includes static information such as the camera model and make, and information that varies with each image such as

29http://www.exif.org/specifications.html 30http://www.cipa.jp/english/hyoujunka/kikaku/pdf/DC-008-2010_E.pdf 31http://www.jeita.or.jp/english/ 32http://www.cipa.jp/english/index.html 2.2. The Geo World & Geo Web 18

orientation (rotation), , shutter speed, focal length, metering mode, and ISO speed information.

• A thumbnail for previewing the picture on the camera’s LCD screen, in file managers, or in photo manipulation software.

• Descriptions and copyright information.

• Geotags

Latitude and longitude are stored in units of degrees with decimals, in this format, a positively signed coordinate indicates Northern or Eastern hemi- sphere, while negative sign indicates Southern or Western hemisphere. An example readout for a photo might look like:

GPS Latitude : 57 deg 38’ 56.83” N GPS Longitude : 10 deg 24’ 26.79” E GPS Position : 57 deg 38’ 56.83” N, 10 deg 24’ 26.79” E 2.2. The Geo World & Geo Web 19

2.2.3 Yahoo! GeoPlanet & WoeId Dealing with geo-referenced data and metadata we always face the ambi- guity. The ambiguity can arise from at least three different critical factors that have always affected the geographic representation systems long time before the birth of the GeoWeb: location names, coordinates precisions and boundaries. The first factor is due to the fact that different people call the same place with several different names depending on the user’s language, alphabet or cultural background. Every location on the Earth can have hundreds of different names referring the same geographical object. For each location there can be: • different names in English

• different names in other languages (including the local one)

• well-known (but unofficial) variants for the place (e.g. “New York City” for New York)

• colloquial names for the place (e.g. “Big Apple” for New York)

• version of the names stripped of accent characters

• abbreviations or code for the place (e.g. “NYC” for New York) The second factor deals with the accuracy of the coordinates identifying a place. A geographical place can be targeted using several distinct sources and each one provides its own version and representation of the coordinates identifying that place, and, with an high probability, they will never over- lap. The accuracy is the main reason of this problem; we can, for example, deal with accuracies of centimeters, meters or kilometers supplied by three different sources and everyone is identifying the same geographical object. Someone could say that the solution to this problem is the adoption of the system supplying the coordinates with the highest accuracy. This lead to the third source of ambiguity, that is the boundary limits. Even adopting the system supplying the best accuracy we have to face that the three system exposed in the example can identify the same geo- graphic place boundaries in different ways. Since the geographic places may have very difference real-world boundary shapes it is a challenge to identify: • the center of the object

• the shape of the object

• the accuracy representing the shape 2.2. The Geo World & Geo Web 20

The Geoplanet Service Yahoo! provide an online service and public API that attempts to solve some of the mentioned problems: the Yahoo! GeoPlanet33 service (Geo- planet in short). GeoPlanet is designed to bridge the gap between the Real and Virtual worlds by providing an open, comprehensive, and intelligent in- frastructure for geo-referencing data on Earth’s surface. In practical terms, GeoPlanet is a resource for managing all geo-permanent named places on earth. It provides a vocabulary and grammar to describe the world’s geog- raphy in an unequivocal, permanent, and language-neutral manner, and is designed to facilitate spatial interoperability and geographic discovery.

Where On Earth Identifier GeoPlanet provides information for about six million named places globally. Spatial entities provided by GeoPlanet are referenced by a 32-bit identi- fier: the “Where On Earth ID” (WOEID). WOEIDs are unique and non- repetitive, and are assigned to all entities within the system. A WOEID, once assigned, is never changed or recycled. If a WOEID is deprecated it is mapped to its successor or parent WOEID, so that requests to the service using a deprecated WOEID are served transparently.

The Hierarchy The service uses a hierarchical model for places that provides both verti- cal consistency and horizontal consistency of place geography. The model ensures that places in each layer in the hierarchy overlay the correct and corresponding places in other layers, and that geographical relationships are preserved. The hierarchy allows to query the geographic context of every named place represented by a WOEID. Every place belongs to a number of containing, superior (larger) geographic entities, and in turn may contain a number of inferior (smaller) geographic entities. The smallest fully contain- ing official geographical entity for a place is called its parent. The list of containing official geographic entities for a place is called its ancestors. The fully contained geographic entities for a place are called its children. The hierarchy recognizes a distinction between “official” administrative places, such as country, state, county, and city, and “informal” places, such as col- loquial places and historical administrative places. These informal places are included in a separate collection called belongtos.

33http://developer.yahoo.com/geo/geoplanet/ 2.2. The Geo World & Geo Web 21

Figure 2.3: Geoplanet Hierarchy and Relationships

Relationship Places have relationships with other places; Yahoo! GeoPlanet allows users to identify places that have specific relationships to others, such as the par- ent, children, and neighbors. For example, a list of states (or first-level administrative areas) in a particular country can be obtained by requesting the children of that country; in a similar manner, the surrounding postal codes of a particular postal code can be obtained via a call for its neighbors. The following relationships are provided by GeoPlanet:

• Parent

• Children

• Neighbors

• Siblings

• Belongtos

• Ancestors 2.2. The Geo World & Geo Web 22

Place Type Places are categorized to help identify the geographic entity. These Place Types have unique codes that may be used to filter results for some re- sources. They also have localized names, so they can be displayed along with the localized place name. The following list describe a little subset of the supported place types.

• Continent (code:29)

• Country (code:12)

• Town (code:7)

Positional Consistency Places in GeoPlanet are roughly represented in Longitude/Latitude coordi- nates using the WGS8434 data. All places are represented within a single positional context to ensure that content is organized in a consistent way globally. GeoPlanet also recognizes that a place has a center and an area of influence and represents these respectively by its centroid and its bound- ing box. Thus every place within each theme has a geometric description. Different areas within different themes overlap to enable the most granular location for an address to be found. The coordinates provided are illustrative, not normative; the service does not aim to be the authority on the exact bounds of any particular place. The main feature is instead to provide a common naming convention, and to ensure that places are correctly represented in relation to each other in a global, consistent framework. In practice this means that the service does not claim that a particular neighborhood stops at one block and starts at the next, only that the concept of that neighborhood be identified consis- tently. The primary concerns are geography, and the semantics of place.

The role of Geoplanet inside my research The Geoplanet Service and the relative API (Section 4.2.1) have been used during the project and exploited mainly to clear identify different places on earth (the dataset used is composed of 233 cities). One important fea- ture used was represented by the WOEID; dealing with an unique integer identifiers, once identified each city I could refer to them without caring about names, ambiguity, geographic coordinates and bounding boxes. The WOEID was a really important aspect of the project that simplified several development stages (see Chapter 4).

34http://earth-info.nga.mil/GandG/publications/tr8350.2/tr8350_2.html 2.2. The Geo World & Geo Web 23

2.2.4 Flickr Flickr35 is a popular image and video hosting and created by Ludicorp and later acquired by Yahoo!. In September 2010, it reported that it was hosting more than 5 billion im- ages36. Flickr let photo submitters to organize images using tags, which enable searchers to find images related to particular topics, such as place names or subject matter. Flickr was also an early website to implement tag clouds, which provide access to images tagged with the most popular keywords. Because of its support for tags, Flickr has been cited as a prime example of effective use of folksonomy, although Thomas Vander Wal suggested Flickr is not the best example37 defining it a narrow folksonomy (Chapter 2.1).

Since 200638, Flickr lets also users geotag their uploaded pictures drag- ging them over a map[29]39 or importing photos that have been already geotagged using other tools or services including the automatic mobile geo- tagging methods explained in the Section 2.2.1.

Figure 2.4: Flickr Map

35http://www.flickr.com/ 36http://blog.flickr.net/en/2010/09/19/5000000000/ 37http://www.vanderwal.net/random/entrysel.php?blog=1781 38http://blog.flickr.net/en/2006/08/28/great-shot-whered-you-take-that/ 39http://www.flickr.com/map 2.2. The Geo World & Geo Web 24

The import system is able to extract the geographic metadata from the EXIF information with a strong support for the Where On Earth identifier provided by Yahoo! Geoplanet (Section 2.2.2).

For mobile users, Flickr has an official app for iPhone40, BlackBerry41 as well as several 3rd party apps. All these mobile apps let the users, often geolocated, to interact with the system aware of the users’ location (depend- ing on the used devices) and automatically upload geotagged photos from the devices. Finally, Flickr offers a fairly comprehensive web-service API that enables to create applications that can perform almost any function a user on the Flickr site can do (Section 4.1.1).

Online photo sharing systems, which Flickr represents the most popular, are strongly contributing in the growing enthusiasm for personal location- awareness[29]. It is worth to mention that for the launch of this new service, on the August 28, 2008 the Flickr developers made projections guessing how many photos Flickr members would geotag; they though that they could hit a million in the first month or maybe in the best scenario in two weeks. Instead, 24 hours after the launch of the geotag service, there were 1,234,384 geotagged photos42. Lately, on January 8th, 2011 it was announced that there were 190 Million of available geotagged photos with a constant increas- ing trend43.

The role of Flickr inside my research Flickr aggregate a huge collection of images, most of them annotated with a wide variety of textual tags but also with other forms of information includ- ing the “owner” of a tag, geolocation, time and photographer, very precious for the aim of this research in order to extract patterns and generate “knowl- edge”.

40http://itunes.apple.com/us/app/flickr/id328407587?mt=8 41http://us.blackberry.com/smartphones/features/social/flickr.jsp 42http://blog.flickr.net/en/2006/08/29/geotagging-one-day-later/ 43http://code.flickr.com/blog/2011/01/08/flickr-shapefiles-public-dataset-2-0/ 2.3. Weighting and Scoring Methods 25

2.3 Weighting and Scoring Methods

As earlier anticipated the research I carried out can be resumed in the fol- lowing steps:

1. Extract a representative description of each city

2. Find similarities between these representations and storing the ranked list of place similarities

3. Given a city as a query term, find the most similar analyzing the ranked list previously calculated

In particular for the step 1 the TF-IDF approach has been used (the motivations will be clear reading the assumptions in the Chapter 3.2). For the step 2 and step 3 the Vector Space Model has been adopted in order to calculate and retrieve the similarities.

2.3.1 TF-IDF The Term Frequency-Inverse Document Frequency is a weight often used in information retrieval and text mining. It was first introduced by Gerard Salton44 in 1975[27] in conjunction with the Vector Space Model (See Next Section). The weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases pro- portionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. Variations of the tf–idf weighting scheme are often used by search engines as a central tool in scor- ing and ranking a document’s relevance given a user query.

Definitions:

The term count (or frequency ) for a term ti in a document dj is the num- ber of times a given term appears in that document. This count is usually normalized to prevent a bias towards longer documents (which may have a higher term count regardless of the actual importance of that term in the document) to give a measure of the importance of the term ti within the particular document dj.

44http://en.wikipedia.org/wiki/Gerard_Salton 2.3. Weighting and Scoring Methods 26

Thus the term frequency (normalized) is defined as follows:

ni,j ti,j = (2.1) Σknk,j where ni,j is the number of occurrences of the considered term (ti) in doc- ument dj, and the denominator is the sum of number of occurrences of all terms in document dj, that is, the size of the document |dj| .

The inverse document frequency is a measure of the general impor- tance of the term along the corpus, obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient: |D| idfi = log2 |d:ti∈d| where: • |D| represents the total number of documents in the corpus

• |d : ti ∈ d| is number of documents where the term ti appears (that is ni,j ≠ 0 ) The reason why the the logarithm of the ratio is needed for calculating the idfi is well explained by Dr. E. Garcia in one of the most influential blogs about Information Retrieval topics45.

The weight that determines the importance of a term ti for a document dj is computed as:

Wi,j = tfi,j · idfi A high weight in tf–idf is reached by a high term frequency (in the given document) and a low document frequency of the term in the whole collection of documents; the weights hence tend to filter out common terms. Various (mathematical) forms of the tf-idf term weight can be derived de- pending on the probabilistic distributions of the terms and documents under analysis. I personally used different variations of the idf terms in order to compute different weights based on different assumptions (Chapter 3.6.3).

The tf-idf weighting scheme is often used in the vector space model to- gether with cosine similarity to determine the similarity between two docu- ments. 45http://irthoughts.wordpress.com/2009/04/15/why-idf-is-expressed-using-logs/ 2.3. Weighting and Scoring Methods 27

2.3.2 Vector Space Model Vector space model (VSM) is an algebraic model for representing text doc- uments (and any objects, in general) as vectors in a multidimensional space of index terms. It is used in information filtering, information retrieval, in- dexing and relevancy rankings. A first version was first introduced in 1960 in the Gerard Salton’s SMART46 Information Retrieval System. Each dimension corresponds to a separate term and it is based on the as- sumption of independency between terms. In the Vector Space Model a query is treated just as another document and both are represented as vectors in the term space. Each document (or query) is represented by a vector in a M-dimensional space, where M is the number of index terms:

dj = (w1,j, w2,j, ..., wM,j) q = (w1,q, w2,q, ..., wM,q)

Each term is identified by a unit vector ti = (0, 0, ..., 1, ..., 0) pointing in the direction of the i-th axis (orthogonality assumption). The set of vectors M ti, i = 1, ..., M forms a canonical basis for the Euclidean space R . Any document vector dj can be represented by its canonical basis expansion:

∑M dj = wi,j · ti (2.2) i=1 If a term occurs in the document, its value in the vector is non-zero. Several different ways of computing these values, also known as (term) weights, have been developed. One of the best known schemes is tf-idf (See previous section) weighting. The definition of term depends on the application. Typically terms are single words, keywords, or longer phrases. If the words are chosen to be the terms, the dimensionality of the vector is the number of words in the vocabulary (the number of distinct words oc- curring in the corpus). In my research the distinct tags x in the corpus are adopted as the basis of the VSM, representing each city (that can be seen as a document) as vectors in this space (See the Chapter 3.7.1) computing also the weights using the tf-idf approach (See 3.6.3). Vector operations can be used to compare documents with queries, in particular it is often used the assumption that documents that are close to each other in the vector space are similar to each other. Using this assumption in a keyword search, the relevance rankings of documents to a given query can be calculated.

46http://www.tcnj.edu/~mmmartin/EThul/SMART/smart-pres.pdf 2.3. Weighting and Scoring Methods 28

Cosine Similarity

The Vector Space Model computes the similarity score SC(q, dj) between the query and each document, and produces a ranked list of documents. There are various measures that can be used to assess the similarity between doc- uments (Euclidean distance, Inner product, Jaccard and Dice similarity,..) but the most used is the cosine similarity[28]. Using the assumption that documents that are close to each other in the vector space are similar to each other, it measures the similarity (not the distance) between two vectors vectors by measuring the cosine of the angle between them. The result of the Cosine function is equal to 1 when the angle is 0, and it is less than 1 when the angle is of any other value. Calculating the cosine of the angle between two vectors thus determines whether two vectors are pointing in roughly the same direction. The cosine measure normalizes the results by considering the length of the document vector and for normalized vectors, the cosine similarity is equal to the inner product:

T q dj SC(q, dj) = cos(α) = |qdj | by comparing the deviation of angles between each document vector and the original query vector where the query is represented as same kind of vector as the documents.

Figure 2.5: A three-dimension example of the Vector Space Model 2.3. Weighting and Scoring Methods 29

2.3.3 VSM Related Issues The Vector Space Model has been first introduced in the 60s but it still widely used nowadays(with all the improvements) in several Information Retrieval applications. Nevertheless there are some issues to take into con- sideration in adopting this model to compute a ranked list of document similarities to a given query (the attempt of this research): 1. Search keywords must precisely match document terms because word substrings might result in a false positive match (i.e. “car” vs “car- toon”).

2. Semantic sensitivity; documents with similar context but different term vocabulary will not be associated, resulting in a false negative match.

3. The order in which the terms appear in the document is lost in the vector space representation.

4. Assumptions of terms independence rarely reflects real-world docu- ments.

5. Impossibility of formulating “structured” queries, that means is not possible to use operator such as OR, AND, NOT, etc..

6. Terms represent axes that means we can reach an high number of dimensions.

7. Each documents is represented by the weights of its terms w.t.r. of the dictionary (or corpus) used. Each document contains only a fraction of the terms in the corpus resulting in a sparse Term-Document matrix (i.e., inefficient for the calculation). In order to prevent the point 7 a different storage schemes are usually used. For example an Inverted Index storage scheme is usually adopted to prevent the term-document matrix being sparse (i.e. in order not to store the po- sitions with “0”). Also different text preprocessing methods are performed before computing the term weights to reduce the dimensionality (point 6) and the semantic sensitivity(point 2) in order to reduce inflectional forms and derivationally related forms of a word to a common base form. Usually in the VSM this is done using: • Stemming: heuristic process that chops off the ends of words[18, 23].

• Lemmatization: accurate process with the use of a dictionary and morphological analysis of words. 2.4. Related Works 30

2.4 Related Works

Several studies have been conducted on extracting knowledge from Flickr photos collections’ georeferenced metadata. Following, the most relevant and representative works on this topic related to my research will be briefly described to better understand my approach.

2.4.1 Wormholes In “Finding Wormholes with Flickr Geotags”[3] Maarten Clements47 et al. propose a kernel convolution method to predict similar locations (worm- holes) based on human travel behaviour. A wormholes is defined as a similar, but not necessarily spatially close loca- tions on the planet. Their hypotheses can be resumed as following: • users have a specific travel preference and therefore visit locations that are to some extend similar

• making a photo at a visited location is an indication that the user likes that location Based on these hypotheses, the aggregated travel data of many users should be able to reveal which locations are most similar to a given query location. In photo sharing websites like Flickr, users can upload and indicate the geo- graphical location of their pictures and annotate them with text-based tags. The method combine these two information in a prediction for similar loca- tions.

Dataset Using the public API of Flickr the research group collected the top-100 most popular localities (cities, parks, etc.) for each day in 2008. The aggregated data contains 8,643 places. To retrieve the geotagged data, they repeatedly followed the procedure: • Select a location l from the full distribution, with the probability rel- ative to the global popularity in 2008

• Get a photo il from this location

• Get all the photos from the user who made il Following this strategy, they collected the tags and their relative coordinates of 36,264 users. Together these users have uploaded 52,425,279 photos of which almost 23 millions have been geotagged (see Figure 2.6).

47http://homepage.tudelft.nl/5q88p/ 2.4. Related Works 31

Figure 2.6: Geotagged World photos distribution collected for the “Wormholes” research

Wormhole Detection From a given target location L the algorithm wants to find the most similar locations around the world. For each user u, a weight WL,u is computed based on the distance of the nearest geotagged photo of the user to the target location, weighted by a normal distribution:

where standard deviation σ of the normal distribution determines how many users are considered relevant for the target location and can therefore be used as a scaling parameter and d(L, Gu,i) computes the euclidean distance th between the i geotag of a user Gu,i and L. 2.4. Related Works 32

The weighting function slowly decays when the nearest geotag48 is found further from the target location L:

The wormholes are found by aggregating the geotags of all users with W u as weight per user. The aggregated user information is convolved with a gaussian kernel to create a smooth prediction profile. The difference between the resulting profile and a distribution based on all users gives a score that indicates the relevance of each position on earth with respect to the target location L. Method Evaluation Maarten Clements et al. shown that geotags can effectively be used to predict similar locations with high precision. Predicting the wormholes from Mount Everest clearly shows the similar locations: ’The Rocky Mountains’, ’Mnt. Kilimanjaro’, ’The Scottish Highlands’ and some other mountain ranges in Indonesia, Japan and New Zealand. Many of the urban areas in the USA and Europe are predicted to have a negative relation with Mount Everest. In the Figure 2.7 is shown the wormholes prediction for mount

Figure 2.7: Wormholes detection from Mount Everest with σ= 50 km.

Everest, positive predictions are blue, negative red. Further information can be found by visiting the project web page49.

48the term is used meaning the textual tag carrying also the georeference information associated to the photo that it describes 49http://homepage.tudelft.nl/5q88p/wormholes/ 2.4. Related Works 33

2.4.2 World Explorer In “World Explorer: Visualizing Aggregate Data from Unstructured Text in Geo-Referenced Collections”[1] researchers from Yahoo! Research Berke- ley50 show how to analyze the tags associated with the geo-referenced Flickr images to generate aggregate knowledge in the form of “representative tags” for arbitrary areas in the world. They used these tags to create a visualiza- tion tool, “World Explorer”51 with the attempt to help revealing the content of the analyzed data, using a map interface to display the derived tags and the original photo items. The data analysis of the system is based on multi-level clustering and TF- IDF-based (term frequency, inverse document frequency) scoring of tags. The visualization exposes, for each map region and zoom level, the high- scoring tags for the generated clusters; these tags are shown as text over the map area where each cluster occurs. There are challenges in analyzing and visualizing such unstructured user- contributed data. Issues of noise (e.g. photos with tags that are not rele- vant to the location) and errors (e.g., photos that are geotagged incorrectly) abound in the Flickr data. The algorithm tries to handle these in a graceful manner. There are also considerations of scale, especially as the amount of data in- creases. To summarize, the contributions of the work are:

• An approach for deriving meaningful data from unstructured text as- sociated with geo-referenced collections.

• A sample application that derives such information from Flickr geo- tagged images.

• A visualization technique for large-scale geo-referenced photo collec- tions, that allows automatically-derived and effective world explo- ration via photos and maps.

Only the first two points will be explained because considered relevant for my work.

50http://research.yahoo.com/Yahoo_Research_Berkeley 51http://tagmaps.research.yahoo.com/worldexplorer.php 2.4. Related Works 34

Figure 2.8: The World Explorer Map for a large scale of details

Figure 2.9: The World Explorer Map for a narrow scale of details for the City of Rome in Italy 2.4. Related Works 35

Dataset The dataset used in the research consisted of 6 million public geo-tagged photographs on Flickr (data collected in October 2006). Almost 90% of the 6 million photos were associated with user-entered tags. While the heaviest concentration of geotagged photos was found in the United States and Western Europe, the dataset had very wide coverage, and included photos from almost every country in the world. The researchers excluded from their analysis photographs which had an accuracy lower than 10 (approximately city level). Other methods were used in order to reduce the noise in the data: weighting of tags described below.

Data Model and Objectives The dataset consists of three basic elements: photos, tags and users. Using such data the algorithms find tags that are most “representative” for each given geographical area G. It is important to note that these representa- tive tags are often not the most commonly used tags within the area under consideration. Instead, the aim is to extract tags that uniquely define sub- areas within the area in question. For example, if the user is examining a portion of the city of Rome then there is very poor advantages by showing the user the “Roma” or “Rome” tags, even if these tags are the most fre- quent. Instead, it is useful to show tags such as “Pantheon”, “San Pietro”, “Villa Borghese” which uniquely represent specific locations, landmarks and attractions within the city. The first step in determining the “representa- tiveness” of a tag is to have an intuition of what the term implies. While there are no formal models to define how representative a tag is for an area, Mor Naaman et al. followed some simple heuristics that guided throw the design of the algorithms. The heuristics attempt to capture the human at- tention and behavior within the photos and tag dataset, and include the notions that: • The number of photographs taken in a location are an indication of the relative importance of that location.

• The importance of a location increases with the number of individual photographers that have taken photos there.

• Users are likely to use a common set of tags to identify the objects/events/locations that occur in photographs of a specific location.

• Tags that occur in a concentrated area (and do not occur often outside that area) are more representative than tags that occur diffusely over a large region. 2.4. Related Works 36

• The more users that used a tag in an area, the more representative the tag is for that area.

Based on the data, the requirements from the analysis are therefore:

• Identify important regions for every map region and zoom level.

• Select representative tags for the identified regions.

Computing Tags for a Geographic Area The analysis starts by assuming that the system considers a single given geographic area G, and the photos that were taken in this area, PG. The system attempts to extract the representative tags for an area G. This computation is done in two main steps:

• cluster the set of photos PG using the photos’ geographic locations.

• assign scores to the tags in each cluster.

For the first step a K-means approach was used in order to cluster photos within an area based on the photos’ latitude and longitude G. Once the clusters have been determined, the system scores the cluster’s tags to see if it is possible to extract representative tags for each cluster C. Each cluster represent a set of tags. The system ranks each tag x in the set so that the top tags are, according to the defined heuristics, the most representative tags. The main factor used for scoring is a TF-IDF approach that assigns a higher score to tags that have a larger frequency within a cluster compared to the rest of the area under consideration (based on the assumption that the more unique a tag is for a specific cluster, the more representative the tag is for that cluster). 2.4. Related Works 37

TF-IDF Weighting The TF-IDF is computed with slight deviation from its regular use in In- formation Retrieval. The term frequency tf(x) for a given tag x within a cluster C is the count of the number of times x was used within the cluster. The inverse document frequency for a tag x usually is represented by the ratio of the number of documents (i.e Clusters) that contain photos with this tag in the entire area G under consideration. In this particular context, they modified the measure to consider the overall ratio of the tag x amongst all photos in the region G under consideration: idf(x) = |PG|/|PG,x|. They modified the standard idf usage due to the small number of clusters that obtained for each area. If they wanted to check the presence of a tag within a cluster with the standard idf then they could face large changes in the TF-IDF weights if even a single photograph in the cluster contained the tag. Also they only consider a limited set of photos (PG) for the IDF computation, instead of using the statistics of the entire dataset. This re- striction to the current area, G, allows to identify local trends for individual tags, regardless of their global patterns. As mentioned in the above section, there are good probability to face the noise in the Flickr data. One scenario is due to a single photographer who takes a large number of photographs using the same tag. To guard against this event, the researchers included a user element in the scoring that also reflects the heuristic in the hypotheses that a tag is more valuable if a number of different photographers use it (as a further guard, they assign a score of 0 to any tags that was used by less than 3 photographers in a given cluster). In particular, the factor is the percentage of photographers in the cluster C that use the tag x: Uf(x) = UC,x/UC . Finally the score for tag x in cluster C is computed by Score(C, x) = tf · idf · uf. For each cluster only the tags that score above a certain threshold are retained and considered representative.

The other parts of the algorithm (indexing, storage and retrieval) are not relevant for my work and they will not be explained. For further information, the original paper presents deeper details and explanations[1] Chapter 3

My Approach

In this chapter the theoretical approach of this work is explained. The chapter begins with the hypotheses formalization through the assump- tions on which the entire the work is based. Especially the concept of place similarity is clarified as intended in the scope of this research. Later the two different datasets used are exposed explaining the reasons why they are needed and why two different analyses were performed. Following the term definitions are listed to let the reader having a clear overview of the symbols used in the scope of this work. The chapter continues discussing the main core of all the research, that is the TF-IDF weighting methods and the Vector Space Model representation. Finally the approaches to score and retrieve similar locations are presented.

38 3.1. Approach Introduction 39

3.1 Approach Introduction

The objective of the research is to find place similarity analyzing the ag- gregated georeferenced tags from a subset of Flickr photos. The objectives are similar to those exposed in the description of the “Wormholes” research (2.4.1), but using a TF-IDF approach similar to the one described in the “World Explorer” research (2.4.2) in order to extract representative tags de- scribing each place. Once obtained a representative set of tags describing each place I used an Information Retrieval method (Vector Space Model) to compute a ranked list of places according to their “similarity”. Finally I could retrieve a list of places similar to a location given as a query term analyzing the pre-computed ranked list.

I can summarize my research in the following steps:

1. Extract a representative description for each city

2. Find similarities between these representations and storing the ranked list of place similarities

3. Given a city as a query term, find the most similar one analyzing the ranked list previously calculated

I analyzed the textual tags distribution of Flickr geotagged photos from the union of the top 150 tourism city destinations for the years 2007 and 20081 for a total count of 233 cities.

1according to “Euromonitor International” : http://www.euromonitor.com/Top_150_City_Destinations_London_Leads_the_Way 3.2. Assumptions 40

3.2 Assumptions

I based my research on assumptions derived from both the two main related works mentioned in chapters 2.4.1 and in 2.4.2. For both Datasets and their two related analysis methods (see Chapter 3.3) the following assumptions hold:

1. Users have a specific travel preference and therefore visit locations that are to some extent similar.

2. Making a photo at a visited location is an indication that the user likes that location.

3. Users are likely to use a common set of tags to identify the objects/events/locations that occur in photographs of a specific loca- tion.

4. The extracted representative set of tags for a place provides a repre- sentative description of that place.

5. A place is defined as a “similar” to another (not necessarily spatially close locations on the planet) if their representative descriptions are similar.

6. The more users used a tag in an area, the more representative the tag is for that area (tf term).

7. Tags that occur in a place (and do not occur often outside that place) are more representative than tags that occur diffusely over all the places (idf term).

8. The tags are assumed to be mutually independent (VSM condition) 3.3. Datasets 41

3.3 Datasets

I used the public Flickr API to retrieve the data I worked on. Flickr offers an API that returns the top 100 most frequent tags for a given place (see 4.2.2). This offered an easy way to retrieve the data. Unfortunately this API does not give any extra information about the users who made the photos or tagged them nor about the photos and their exact coordinates. Also, in my opinion, a collection of only 100 tags can not properly describe a wide area such a city because of the high probability of obtaining trivial results. For this reason I decided to follow two different experimental analyses, one using this dataset (with all its limitations), and another one representing a subset of the actual geotagged photo distribution obtained by a random sampling of all the geotagged photos for each place based on the hypothesis that by a random sampling of the original distribution we can obtain an- other one resembling some of its characteristics.

Dataset “Top 100” For the reasons mentioned above this dataset, that I will call “Top 100”, is represented by a list of the most common tags (100) for each of the 233 cities analyzed and their relative frequencies (total count 23.300 tags).

Dataset “Random” I obtained the “Random” dataset performing a random sampling in order to retrieve a collection of photos (and related tags) trying to cross the apparent triviality obtained by the “Top 100” dataset. I performed a random sampling (over the time parameter) mainly for two reasons. First, I did not have the time or the resources to retrieve the totality of all the geotagged photos for each place. Flickr does not support aggregate query functions like “count” as in sql-like languages, so there is no way to know in advance the amount of geotagged photos for each place. Anyway, in order to give an idea of the magnitude, we can observe that for the city of London the only tag “london” appears with a frequency of 1,2 millions (observed from the “Top 100” dataset), that means that there are, with a huge underestimation, at least 1,2 millions geotagged photos only for the city of London. Multiplying these rough estimations for 233 different cities clearly gives the idea of the impossibility of retrieving such amount of data in the available time I had to carry out this research. 3.3. Datasets 42

The second reason to perform a random sampling comes from experi- mental tests of retrieving geotagged photos for a prefixed period of time along all the places. From these tests I recognized that data biases existed due to particular events for that period in one or more of the 233 places (i.e. a parade, elections, a rock concert) so that this dataset would not represent the characteristic of the actual geotagged photos distributions.

Using the “Random” dataset I could obtain much more useful information not available with the “Top 100” dataset such as the user who tagged a par- ticular photo. Having this information was essential in order to introduce a further assumption:

• The importance of a tag for a place increases with the number of 2 individual photographers that use it in that place [1] (Uf term).

In other terms, a tag is more valuable if a number of different photographers (who is tagging) use it.

NOTE: the details and the conditions of the “Random” sampling will be explained in Chapter 4.2.2.

2Modified from the assumptions of “Wormhole” research: Chapter2.4.2 3.4. Definitions 43

3.4 Definitions

• A tag is represented by ti

• The union of all the retrieved tags ti for all the cities represents a corpus C

• The corpus for the “Top 100” distribution is defined as C100

• The corpus for the “Random” distribution is defined as CRnd

• The set of the 233 places (cities) under analysis is defined as G

• Each place (a city) in G is represented by P

• A retrieved photo for the “Random” dataset is defined as f

• The set of all the retrieved photos for the “Random” dataset is defined as F

• A user who tagged a photos in the “Random” dataset is defined as u 3.5. Flickr: from Narrow to Broad Folksonomy 44

3.5 Flickr: from Narrow to Broad Folksonomy

A particular aspect of this research captured my attention. As mentioned in 2.1, Flickr represents a narrow folksonomy, so we can hardly extract trends and usage patterns. This is true if we consider the distribution of tag along the photos, but in this case we are considering the distribution of tags belonging to photos that are aggregated by the place where they were taken. This means that tagging a geotagged photo with text-based terms the user is not only describing the photo but also the place where the photo was taken (Assumption 3). In other words if we consider as hypothesis, the place (city) as the object being tagged by aggregating the tags of the photos taken in that place we can infer that under this consideration Flickr is a broad folksonomy because more than one user can re-use the same terms for describing the object (the city). This can be empirically shown by plotting a tag distribution for one of the places under study, for example, New York:

Figure 3.1: Tag distribution for the city of New York using the “Top 100” dataset

The distribution clearly presents the power law curve (see Chapter 2.1) typical of the broad folksonomies[7]. 3.5. Flickr: from Narrow to Broad Folksonomy 45

We can now consider the same distribution obtained from the “Random” dataset:

Figure 3.2: Tag distribution for the city of New York using the “Random” dataset truncated to the 150th tag

We obtained again a power law curve with the only differences of a scale factor3 and a wider tags vocabulary. This also shows that the “Random” sampling distribution has to be considered as valid as the “Top 100” distri- bution.

3The scale factor is no relevant because of the scale-invariant property of the power law curve, Chapter 2.1.1 3.5. Flickr: from Narrow to Broad Folksonomy 46

We can observe now the distribution in a log-log scale:

Figure 3.3: Tag distribution for the city of New York using the “Random” dataset truncated to the 150th tag using a log-log scale

In Figure 3.3 we can identify a linear-similar trend that proves (with all the due approximations) that Figure 3.2 identify a power law curve.

NOTE: The hypothesis that the “Random” distribution presents some of the characteristics of the real one will not be proven because this is not the aim of this research. Such “Random” distribution has been used only for lack of resources to obtain the real one, and it will be considered valid for this scope. 3.6. Extract a representative description 47

3.6 Extract a representative description

The Assumption 5 states that: “A place is defined as a “similar” to another (not necessarily spatially close locations on the planet) if their representative descriptions are similar”. So the first step in retrieving similar places is to extract their representative description and then compare them in a further step. The union of all the retrieved tags for all the cities represents a corpus C, and each city is described using all the terms (tags) in the corpus C. The representative description of a city means give a weight to each tag in the corpus identifying how “important” it is in describing each place. The “importance” score has been computed using a TF-IDF weighting.

3.6.1 Vector Space Model Representation In the Vector Space Model (See Chapter 2.3.2) text documents are repre- sented as vectors in a multidimensional space of index terms. Each dimen- sion corresponds to a separate term and it is based on the assumption of the independency between terms (Assumption 8). The objective of this research is to find similarities between geographical places, but if we consider Assumption 5 (Chapter 3.2) we can reduce the problem to find similarities between their descriptions. As a further step we can recall that each place (city) is described using all the terms (textual-based tags) in the corpus C taken with different weight depending on the “importance” of a tag in describing each place (see TF- IDF in next chapter). Under these considerations the description of a place is nothing more than a text document and since we want to find similarities between the descriptions of the places the overall objective is reduced to find similarities between textual documents. Hence, normal textual based Infor- mation Retrieval models can be applied, so I decided to adopt the Vector Space Model to represent each place in G. 3.6. Extract a representative description 48

3.6.2 Problem Decomposition Since the objective of the research may not be clear with the adoption of the VSM I will explain a step-by-step problem decomposition:

• each place P is described by its representative description (Assump- tions 4, Chapter 3.2);

• a place description is, in turn, represented in the VSM by a vector V in an M-dimensional space;

• the basis of the VSM are the M distinct tags occurring in the corpus C;

• the components of the generic vector V are the weights obtained by the TF-IDF approach (next chapter);

• the places similarity problem is reduced to the similarity of their de- scriptions (Assumptions 5, Chapter 3.2);

• the descriptions similarity problem is reduced to the vectors similarity problem;

• the places similarity is calculated computing the vectors similarity.

From now on we can refer to the generic place P as its description (see next section). The Vector Space Model is based on the assumption that documents (places) that are close to each other in the vector space are similar to each other. Since the query and the documents are treated in the same way and repre- sented in the same space, in order to compute a similarity score between a query and a document a distance or a similarity function is applied between the vectors representing the documents and the query. In my case I want to compute the similarity scores between a place P in the set G (considered as the query) and all the other places in G (considered as the documents). As a consequence we can eventually find the place similar- ity for each place in G with respect to all the others adopting a similarity measure in order to estimate the similarity (not distance) between the vec- tors. The cosine similarity has been adopted for computing a ranked list of simi- larities for each place P with respect to all the other places in G. 3.6. Extract a representative description 49

3.6.3 TF-IDF Weights Each place P in G is described by all the tags t in the corpus C taken with different weights. The TF-IDF approach was used to score each tag ti in C according to the “importance” that the tag ti has in describing the place P (based on Assumptions 6, Assumptions 7 Chapter 3.2). The TF-IDF assigns a higher score to tags that have a larger frequency within a city compared to the rest of the places (term frequency factor). The term frequency factor will be the same for both dataset and analyses while the Inverse Document Frequency will be calculated with some heuristic deviations from its standard usage for the “Random” dataset (and related analysis) in order to find the best weighting scheme to represents places in the VSM.

TF

The term frequency tfi,j for a given tag ti within a place Pj is the count of the number of times ti was used within the place Pj. Depending on the dataset under analysis ti was a tag respectively in C100 and CRnd corpuses.

IDF for the “Top 100” Dataset

The inverse document frequency for a tag ti in the corpus C100 is represented by the ratio of the number of documents (i.e Cities) in the set with respect to the number of documents that contain this tag. This represents the heuristic that “tags that occur in a place (and do not occur often outside that place) are more representative than tags that occur diffusely over all the places”.

In other terms, Standard idfi:

|G| idfi = log2 |Pi| where |G| is the number of places used in the analysis (233) and |Pi| is the number of places containing the term ti. 3.6. Extract a representative description 50

IDF for the “Random” Dataset The “Random” dataset carries extra information (Photos and Users) that were used in computing two inverse document frequency deviations from the standard information retrieval usage in order to heuristically find the best weighting scheme to represents places in the VSM.

In the “Random” analysis the generic tags ti are in CRnd corpus and places are described using the terms in CRnd. 1) Firstly the standard idfi was calculated:

|G| idf-rnd1i = log2 |Pi| where |G| is the number of places used in the analysis (233) and |Pi| is the number of places containing the term ti (ti in CRnd).

2) For the first idf deviation I adopted one of the heuristic introduced in the “World Explorer” research (Chapter 2.4.2) exploiting the extra information coming from the photos factor. In this case I estimate the overall importance of a tag ti along the photos collection F and not along the places set G:

|F | idf-rnd2i = |fi| where |F | is the total amount of photos in the “Random Dataset” and |fi| is the number of photos containing the tag ti. This measure has been intro- duced with the attempt of avoiding large changes in the TF-IDF weights if even a single photo in all the dataset contained the tag ti.

3) For the second idf modified formula I followed the extra assumption that “The importance of a tag for a place increases with the number of individual photographers that use it in that place” for this dataset (Chapter 3.3). We refers to photographers as the users who tagged the pictures, hence who contributed in the “descriptions” of the places. To compute the “importance” of a tag in describing each place considering the assumption mentioned above I introduce a further measure, that is properly the user factor:

|ui,j | Ufi,j = |uj | where |ui,j| is the number of users tagging with the tag ti in the place Pj, while |uj| is the number of distinct users who tagged at least a photo in Pj. 3.6. Extract a representative description 51

In this case the inverse document frequency will be:

|fj | idf-rnd3i,j = |fi,j | where |fj| is the number of photos for the place Pj and |fi,j| is the number of photos for the place Pj containing the tag ti. As in the the idf-rnd2 this measure has been introduced to prevent that small groups of photos containing ti (along the place) could widely change the tf-idf weights. The idf-rnd3 is not actually an inverse document frequency because it takes under consideration not only the “importance” of the tag ti along all the places in G but it introduces also a place factor. So it gives the (relative) importance of a tag ti for the place Pj. It looks similar to the term frequency factor (meaning that it depends on both document and term) but with a wide different meaning.

3.6.4 Weighting Systems Finally, I obtained all the necessary measures and data to calculate the weights in the idf-idf approach, reminding that these represent the compo- nents of the vectors (the places) in the vector spaces. That means that these weights have to be calculated for all the unique terms occurring in a place (with respect to the corpus used). For each place I computed four different weights measures organized as fol- low:

System A) For the “Top 100” dataset only one weight was calculated fol- lowing the stadard tf-idf (Chapter 2.3.1):

W -100i,j = tfi,j · idfi, ti ∈ C100 For the “Random” dataset three different weights were calculated depending on the idf definitions explained above: System B) Weight for the “Random” dataset based on the standard tf-idf:

W -rnd1i,j = tfi,j · idf-rnd1i, ti ∈ CRnd System C) Weight for the “Random” dataset based on photo factor (see point 2, previous page) :

W -rnd2i,j = tfi,j · idf-rnd2i, ti ∈ CRnd System D) Weight for the “Random” dataset based on user factor (see point 3, previous page) :

W -rnd3i,j = tfi,j · idf-rnd3i · Ufi,j, ti ∈ CRnd 3.6. Extract a representative description 52

3.6.5 VSM Limitations The Vector Space Model presents some usage limitations as already men- tioned in Chapter 2.3.3 affecting the performance and the expectations of this research. The limitations that most affected the system in adopting a VSM representation will be analyzed also for understand the overall system evaluation explained in Chapter 5.3. Some countermeasures were adopted attempting to cross some of the disadvantages typical of the VSM.

Blacklist With a first analysis of the two distributions (“Random”,“Top 100”) I no- ticed a certain amount of repeated tags occurring all along G (the set of the 233 places). It mainly concerned the most common tags used all along the Flickr collection, for example very frequently terms like “day”, “dog”, “friends” or photography related terms e.g. “canon”, “nikon”, “black&white”. From the Assumptions 7 4 we know that these tags would not give any extra information for the aim of extracting a representative description for a place P . The TF-IDF weighting could easily spot that these tags were not consid- ered important for any of the place (idf term) so I could leave them and let the system do the work. But since every single (non-repeated) term occur- ring in the corpuses defines a dimension in the VSM representation, without pruning these tags the corpuses C100 and CRnd presented a dimensionality problem. Even if these terms are not relevant they were still included in the corpuses and then used for the weights and similarity computations intro- ducing only noise, system complexity and hence performance issues. For all these reasons I decided to prune them. I heuristically created a blacklist (see Appendix B) based on the top most used tags in Flickr5 and personal choices (see also Section 4.3.2.1). I also pruned all the tags occurring in only one city all along G because in the VSM they represent terms that will never match in computing the similarity scores.

4Chapter 3.2 5http://www.flickr.com/photos/tags/ 3.6. Extract a representative description 53

Inverted Index Considering how the weights are calculated using the general TF-IDF weight- ing (previous section) we can easily observe that if a term ti in the corpus does not occur in a document dj its term frequency tfi,j will be zero, hence its weight wi,j will be zero. This leads to a wide occurrences of zeros inside the representation matrix “term-document”. In other terms the matrix is sparse. Considering for example the “Random” dataset I obtained a corpus of 55.000 unique terms determining as many dimensions. This means that each of the 233 cities analyzed is represented by 55.000 weights, most of them being ze- ros. This leads to two main problems: computational inefficiency and non- optimal resources allocation. In order to prevent these problems I adopted an inverted index storing scheme.

In such a way for each place Pj in G only the weights of the terms having a non-zero term frequency tfi,j are computed; that means, the weights are computed only for the terms (of the respective corpus) occurring at least once in the the place description. All the other values will be zero. The inverted index allows computational and storage efficiencies.

Stemming Issue Usually, in order to obtain a further reduction of the dimensionality and the semantic sensitivity (reduction, not solution) different text preprocess- ing methods are performed before computing the term weights(See Chapter 2.3.3). This is generally done using text preprocessing methods to reduce inflectional forms and derivationally related forms of a word to a common base form. The most common way to accomplish this aim is represented by stemming[18, 23]. The standard stemming algorithm for the English language is the Porter’s Stemmer[23]. The stemming algorithms are language dependents and Flickr represents a folksonomy(Chapter 2.1), that means that users are allowed to use text-free and unconstrained tags in different languages and alphabets. Even if the most common language used in Flickr tags is the English[10] it is not possible to use a stemmer algorithm in such multi-language context. However different tests have been carried out in order to detect the English language tags in the corpus attempting to apply the Porter’s stemmer at least to the English terms. The details will be explained in Chapter 4.3.2.2. 3.7. Scoring and Retrieving Similar Places 54

3.7 Scoring and Retrieving Similar Places

3.7.1 Scoring

Having two different datasets and corpuses CRnd and C100, their respective distinct tags represents two different bases for two different VSM representa- tion. That means that the the places in G will be represented in two distinct VSM representations, one using the “Top 100” dataset (and analysis) con- sidering the generic tags in the C100 and the other VSM representation uses the “Random” dataset (and analysis) considering the generic tags in the

CRnd corpus. These two VSM models represent two different and separate analyses that we can call “Top 100 VSM” and “Random VSM”. The VSM representations are actually four because in the “Random” analysis I intro- duced three interpretations of the idf ; anyway these three representations share the same analysis method and dataset. In Chapter 3.6.1 the problem of similarity between places has been reduced to finding vectors close to each others in the VSM representations. We need to adopt a distance measure in order to calculate such distances. I decided to adopt the cosine similarity as distance measure between vec- tors. This is not properly a distance measure (see Chapter 2.3.2) but a “similarity” measure. It basically measures the Cosine function of the angle between vectors. Calculating the cosine of the angle between two vectors thus determines whether two vectors are pointing in roughly the same direc- tion, so given two vectors, their similarity is determined by their directions.

The cosine similarity is the inner product between two vectors normalized by their euclidean distance. The euclidean inner product is defined as:

a · b = |a||b|cos(α) deriving:

a·b similarity(a, b) = cos(α) = |a||b| The cosine measure normalizes the results by considering the length of the (places) vectors. The angle between two vectors cannot be greater than 90° because the tf-idf weights cannot be negative hence the cosine similarity of two places will range from 0 to 1. The result of the cosine similarity is equal to 1 when the angle is 0 (exact match, i.e. e vector with itself), it is less than 1 when the angle is of any other value (partial match) and it is 0 when the angle is 90°(not matching terms). This is a convenient way of ranking places by measuring how close 3.7. Scoring and Retrieving Similar Places 55 their vectors are to a query vector.

The similarity scoring between places has been performed calculating the cosine similarity of each vector (representing a place) with respect to all the other vectors (the other places). The computations have been performed separately for the “Top 100” and “Random” Vector Space Model representations.

NOTE: the similarities are intended between all the places in G (233 cities) with respect to all the others within this set.

3.7.2 Retrieving Similar Places All the computations of the cosine similarities have been performed in ad- vance and stored in a database. In particular I calculated four different measures of similarities between places. The four different measure come from the different VSM represen- tations obtained by the different weights measures (derived from the four tf-idf weighting schemes, see Chapter 3.6.3). That means one measure for the “Top 100” dataset, and three different mea- sures for the “Random” dataset, derived from the three idf interpretations:

• System A: from “Top 100” Dataset using a standard weight W -100i,j

• System B: from “Random” Dataset using a standard weight W -rnd1i,j

• System C: from “Random” Dataset using a modified weight W -rnd2i,j (more importance to photos)

• System D: from “Random” Dataset using a modified weight W -rnd3i,j (more importance to users)

The retrieving has been implemented as a simple query to a database. Since all the similarity scoring between a city and all the others have been calculated in advance (pre-computed), the retrieval of all the places similar to a place P is simply represented by a query selecting all the scored places in G with respect to P , ordering the results from the most similar (cosine similarity close to 1) to the last similar (cosine similarity close to 0). Chapter 4

Implementation

In this chapter, various tools, API and languages used throughout the soft- ware development process of the “Place Similarity” project are presented. In addition, the results of the usages of mentioned tools and some code ex- amples are presented. Later on an overall view of the infrastructure used to built the system and the web application are explained.

4.1 Development Platform

The implemented tool made a large use of online API and I started its devel- opment, from the first experimentations and tests, as an online application. The later steps required heavy computation phases but I decided to con- tinue the development using languages and platforms typically used for the web development for the sake of compatibility between the components and because, simply, the used tools allowed me to carry out the work without any extra components and special efforts. The “Web2rism” project has been implemented using a LAMP platform, an acronym standing for , Apache, Php and MySQL. I decided to un- dertake the development adopting this platform both on my local machine and as a remote system on the “Web2rism” server hosted at the “Università della Svizzera Italiana in Lugano” (Switzerland).

56 4.1. Development Platform 57

4.1.1 PHP As the main programming language, I have used the PHP (version 5.2.4) an open-source general purpose scripting language. PHP was originally designed for web development to produce dynamic web pages, but it has become a is a general-purpose scripting language. As a general-purpose programming language, PHP code is processed by an in- terpreter application in command-line mode performing desired operations and producing program output on its standard output channel. The interpreter is integrated in the LAMP platform. In this way I could create different purpose scripts (Web interface, data gathering, data analysis and computation) adopting only a single scripting language.

4.1.2 MySQL PHP offers a good integration with the Apache web server and several open- source Database Management Systems. MySQL is a popular (maybe the most) open-source relational DBMS largely used in the Web Development for its perfect integration with PHP and the Apache Server in LAMP systems. In has been used for the “Place Similarity” development and used in several tasks. The totality of the tool data analyses and computations have been performed offline, with the meaning of pre- computation. MySQL offered a solid structure to store all the data gathered online and subsequently for storing the results of the analyses and scoring computations.

4.1.3 Python PHP offers a good integrations with hundreds of modules and libraries but although it is considered a general purpose scripting language it is still strictly connected with the web development community. As a consequence PHP suffers the integration with complex mathematic and scientific mod- ules. Without any doubt this is not the case of Python. Python is an in- terpreted, general-purpose high-level programming language offering a wide variety of third-party extensions. Python represent a good choice for the scientific community for its support offered through extensions like scipy1 and numpy2 as well as thousands of

1http://www.scipy.org/ 2http://numpy.scipy.org/ 4.1. Development Platform 58 other scientific-oriented extensions.

Gensim During the development phase I used python for a couple of tests in which this language seemed to be more useful than PHP. In particular, an exten- sion called gensim[26]3 represents a framework for Vector Space Modeling offering automations and functions for all the typical VSM related tasks as well as semantic support (Latent Semantic Analysis, Latent Dirichlet Allo- cation, Random Projections algorithm). Unfortunately, the compatibility with the “Web2rism” project was funda- mental, so I decided to work using the LAMP platform for compatibility support and integrations, renouncing to use this approach. I then started to develop my own version of the VSM models using only PHP.

Another task where Python seemed to be a valid alternative to PHP was the language detection of the retrieved tags (See Chapter 4.3.2.2).

Perl is a high-level, general-purpose, interpreted, scripting language. The language provides powerful text processing facilities and it was used during the development exactly for this feature. For example, it was used for extracting tags from the Flickr webpage4 ex- posing the most popular Flickr tags. This has been done because Flickr does not provide a functionality to know this information through the normal use of the public API.

3http://nlp.fi.muni.cz/projekty/gensim/ 4http://www.flickr.com/photos/tags/ 4.2. API & Online Services 59

4.2 API & Online Services

Being this research based on analyses of user generated content it made a large use of online API and services. It mainly used web services provided by Yahoo. All the online API were accessed using PHP through the Client URL Library5 (cURL).

4.2.1 Yahoo Geoplanet The Yahoo GeoPlanet API6 (see Chapter 2.2.3) was mainly used to identify the different places on earth avoiding any form of ambiguity. The imple- mented tool let the users find a specific place using a form, returning a list of different places having a certain matching degree with the searching string (Disambiguation, see Chapter 4.4). The users can choose the specific place he is looking for reading some extra information provided with each result7 (Country, Region, ectr.). If the chosen place is within the set of the analyzed cities (the set is composed of 233 cities) then, the tool presents the user a page with four lists of cities similar to the chosen one. The ambiguity has been avoided indexing each place in database with its WOEID. The WOEID is a unique 32-bit integer identifier assigned to millions of ge- ographical entities. Once identified each city I could refer to it without caring about names, ambiguity, geographic coordinates and bounding boxes. The WOEID was a really important aspect of the project because it is well supported by the Flickr API (next section) allowing to retrieve photos by place just specifying their WOEID. The main used:

• /places: returns a collection of places that match a specified place name; used for the disambiguation;

• /place/woeid: returns a resource containing the long representation of a place; used for taking information about each place.

5http://php.net/manual/en/book.curl.php 6http://developer.yahoo.com/geo/geoplanet/guide/ 7See Figure 4.5 4.2. API & Online Services 60

4.2.2 Flickr Flickr represented the main datasource on which all this research is based.

“Top 100” Dataset For the first task of this work the API flickr.places.tagsForPlace represented an easy way to obtain results for the initial experiments. This API returns a list of the top 100 unique tags for a Where on Earth ID. Since the WOEID was already known for all the 233 thanks to the GeoPlanet service (previous section) with this API it was possible to build the “Top 100” Dataset (see 3.3). Unfortunately this method does not give any information about the single photos containing the tags, nor about the authors. This represented a big limitations, and it was the reason to perform a successive kind of data gathering from Flickr (“Random”).

“Random” Dataset Due to the limitations of the previous API I performed a random sampling of photo metadata using the following API:

• flickr.photos.search used to retrieve lists of geotagged photos given their geographical location (through WOEID) for a specified period of time (explained later);

• flickr.photos.getInfo used to get tags and owner from the above men- tioned retrieved photo lists.

The reason of a random approach are explained in Chapter 3.3. I personally decided the criteria for the randomization, in particular I chose to retrieve information about 10 photos for each one of 300 different days. Each of the 300 days were chosen randomly (by a customized function) introducing also a random hour factor. That means that for each one of these days a random hour (between 00:00 and 23:00) was chosen in order to prevent bias in the tag collections introduced for example in photos taken during the night time (e.g. “night”, “nightphotography”, “long exposure”, “lights”). I coded a PHP script having as a first requirement the robustness. This was an important requirement due to three different issues. The main reason to have a robust script depended on the fact that three instances of this script were executed on different machines for more than a week making unfeasible and problematic the constant monitoring of their execution statuses. Also during the execution period, the machines running the script often faced temporary network connection problems. Finally, the API service re- turned different sort of errors several times. Given all these eventuality I 4.2. API & Online Services 61 put some effort in creating a script with the capacity of execute with no particular problem even if these events occurred.

After this period I collected and aggregated the data gathered from the three different machines collecting a total count of over 3 Millions tags.

4.2.3 Yahoo Query Language The Flickr gathering method exposed above was not efficient because in that way it was necessary to first retrieve all the photo list using flickr.photos.search according to the specified parameters and then retrieve the actual informa- tion about them (tags) using the second API flickr.photos.getInfo. Instead I decided to perform a more efficient task using the Yahoo Query Language, or YQL8. Yahoo Query Language is an expressive SQL-like language that lets devel- opers to query, filter, and join data across Web services. The previous task became, using this API as follow:

SELECT tags, id, owner.nsid from flickr.photos.info where photo_id in (SELECT id from flickr.photos.search($NUM_PHOTO_TO_TAKE) where has_geo=’true’ and woe_id=’$woeid’ and min_taken_date=’$random1’ and max_taken_date=’$random2’ and safe_search=’true’);

YQL example using flickr.photos.search and flickr.photos.info API in the same statement

It is worth to notice that YQL may slightly change the name of the underlying used API even if they are considered the same (e.g. the Flickr flickr.photos.search) becomes flickr.photos.info in YQL). In this way I avoided to use two API calls, being also the above code more human-readable (very similar to SQL). YQL provides default mappings between several online services (e.g. Flickr, Geoplanet, YMaps and most of the Yahoo services) through YQL tables. It also gives the developers the possibility to develop their own Data Tables (i.e. a mapping between a generic online service and YQL). Since the API flickr.places.tagsForPlace used for creating the “Top 100” dataset was not available as default in YQL service I decided to create my own Open Data Table910 as shown below. It is actually an XML file that has to be online accessible in order to use it.

8http://developer.yahoo.com/yql/ 9http://developer.yahoo.com/yql/guide/yql-opentables-chapter.html 10http://www.datatables.org/ 4.2. API & Online Services 62

Leonardo Gentile http://www.flickr.com/services/api/flickr.places.tagsForPlace.html select * from table where woe_id=’44418’

Open Data Table for flickr.places.tagsForPlace API 4.3. System Architecture 63

4.3 System Architecture

The system architecture was not particularly complex because the imple- mented tool represented a prototype. In particular, single php scripts were manually executed through command line interface at each step of the data gathering and analysis, storing each operation result within the database. In Figure 4.1 the data flow of the operations is shown.

Figure 4.1: Data Flow Representation 4.3. System Architecture 64

4.3.1 Data Storage The used DBMS was MySQL, in line with the LAMP platform as already illustrated in Chapter 4.1.2. As in many real-life search engines all the data were computed offline meaning that they were pre-computed presenting the user only the result of previous computations (ranked lists of place similar- ity).

Figure 4.2: Database Representation for the “Random” Dataset

Figure 4.2 represents the database scheme for “Random” dataset (“Top 100” dataset representation is similar but with less variables since it carries less information). The rectangular shapes are the tables and the darken rounded shapes represent the key fields for each table while the white rounded shapes are all the other fields. The blue lines represents the relations between tables (foreign keys). The table Place contains the list of the 233 cities under analysis, while the table Tags_cleaned contains all the gathered tags (and other photo meta- data) after the suppression of the stop-words (see next section). The table Dictionary includes all the unique tags obtained from Tags_cleaned. Finally Weights is the table where all the computed weights are stored. 4.3. System Architecture 65

4.3.2 Data Analysis 4.3.2.1 Dictionary Cleaning

The gathered data from Flickr were stored in a table (not showed in Figure 4.2) called Tags_raw. This table contained more than 3 Millions of tags for the “Random” dataset. This represented a huge amount of data on which to perform the various computations. For a dimensionality problem explained in Chapter 3.6.5 I decided to prune not only the tags from the Blacklist (see Appendix B) but also others, following personal decisions and heuris- tics. These tags were pruned with an additional step (dictionary cleaning) using regular expressions within SQL statements (MySQL). I decided to use regular expressions because I did not want to be bounded to prune only the stop-word in the blacklist using an the exact match, for example, instead of chop off solely the tags matching exactly “Canon” I wanted to remove all the tags that contained the term, such as “CanonEos350D”, “Canon1Dmarkii”. Following, a small set of regular expressions used in this phase are showed.

delete from tesi.tags_raw WHERE tag_name REGEXP ’(canon)+|(nikon)+|(nokia)+|(fuji)+|(iso)+|(coolpix)+|(pentax)+|(zoom)+| (finepix)+|(lumix)+|(lens)+|(olympus)+|(panasonic)+|(sigma)+|(sony)+| (tamron)+|(tokina)+|(ilford)+|([0-9]mm)+|([0-9]d)$|(d[0-9]+)$| (nikkor)+|(konica)+|(minolta)+|(www)+| (flickr)+|(http)+|(:)+| (geotag)+|(©)+|(eos)+|(2010)+|(hdr)+|(biancoenero)+| (bianconero)+|(film)+|(iphone)+|ˆphoto$ | ˆphotos$|ˆ(photography)$|ˆ(bn)$|(foto)+|(città)+|ˆcity$| (architecture)+|(architettura)+|(allrightsreserved)+’;

Dictionary Cleaning with SQL Regular Expressions

After the application of the dictionary cleaning the size of the tags_raw table shrank from over 3 Millions of elements to 2,3 Millions. The dictionary of unique tags, represented by Dictionary table in Figure 4.2, created from this cleaned version of the tags set was composed of 55.000 terms, that in the vector space means as much dimensions. 4.3. System Architecture 66

4.3.2.2 Language Detection with PHP & Python

As mentioned in the Chapter 3.6.5 it was not possible to perform the stem- ming operations on a multi language context. The porter’s stemming algorithm[23] is maybe the most used for the english language, but several other stemmer algorithms exist for other languages. I then decided to try to guess the language of each tag in order to lately per- form different stemming operations on them using the different algorithms.

PHP Approach The language detection has been first attempted using PHP and a n-gram approach. An n-gram is just a n letter long sequence extracted from a doc- ument, so for example the word “constable” in trigrams (3-letter sequences) would break down like this: “con”, “ons”, “nst”, “sta”, “tab”, “abl”, “ble”. There are a lot of ways of extracting these, but I adopted an algorithm that tokenizes the document into 3-grams, for any string passed in11. I tested the algorithm in conjunction with a vector space style cosine simi- larity using the default Mac Os X pre installed dictionaries for the language detections. The algorithm stores the term (trigrams tokens) frequencies against a lan- guage, and detect, which one decompose a document in the same way, and for each trigram present compares its frequency with the test languages.

Because in the algorithm the vectors are divided by their lengths this is a normalized dot product between the two sets of weights, which gives a score between 0 and 1 (cosine similarity, see Chapter 2.3.2).

It turns out that these kind of approaches detect the language of a document collecting terms usage statistics and finding matches (actually similarities with the VSM approach) resulting in a raking list of languages ordered by scores. Since the documents under analysis were composed of tags in a multitude of different languages (UTF-8 encoding) such kind of algorithms fail the language detection in similar contexts.

11http://phpir.com/language-detection-with-n-grams 4.3. System Architecture 67

Python Approach After the first failure attempt using the tri-gram approach I decided to use Python and a spell-check approach. PyEnchant12 is a spellchecking library for Python, based on the Enchant13 library. Using this method it is possible to check if a particular word (tag) belongs to a dictionary. Being the tags unconstrained and free-form they could be in any of the language of this planet, but they can also be proper nouns or anything else not included in any dictionary. Due to the impossi- bility to obtains all the necessary dictionaries I decided to detect only the english terms for later applying the stemmer algorithm to them.

It turns out that this approach fails because words in a particular language (e.g. Italian) could be included in the English dictionary but they mean dif- ferent things in their respective languages, e.g. “male” means evil in Italian and a person or animal of the male gender in English.

As result of these tests the language detection was not performed on the corpuses due to the problems exposed above. That means that it was not possible to perform any stemming or text pre-processing algorithm.

12http://www.rfk.id.au/software/pyenchant/index.html 13http://www.abisource.com/projects/enchant/ 4.3. System Architecture 68

4.3.2.3 Similarity Computation

Once obtained the metadata, cleaned them, and computed their weights for each one of the different weighting systems (see Chapter 3.6.3) the last step was calculating the similarity scores between each place with respect to all the others. This final step was also pre-computed and the results stored in database.

Figure 4.3: Table Representation for the generic “score” table

Figure 4.3 illustrates one of the four scoring tables that have been created for the four scoring Systems (see Chapter 3.7.1). Since we are comparing each place with respect to all the others in the set, the similarity scores, should be represented by a matrix in which rows and columns are identified by the places in the set. Each element of the matrix represents a similarity score between a place Pi identified by the row i and a place Pj identified by the column j. The table, instead was built as a single correspondence between couples of places. This is because I used a prevention measure in order to obtain the maximum computation efficiency, that is, since the mutual similarity between a place

Pi and Pj is the same similarity between Pj and Pi I actually created the php script to compute only the upper part of this triangular matrix. Using a MacBook Pro with an Intel Core 2 Duo Processor (2,53 GHz) and 4 Gigabyte of RAM, the computation of the scoring for the “Top 100” dataset requires only few minutes, while for each one of the weighting systems derived from the “Random” dataset required on average approximately 3 hours. 4.4. User Interface Design & Data Presentation 69

4.4 User Interface Design & Data Presentation

The tool has been designed as a web application since the beginning. Even if it represented a prototype I put good efforts in designing a system as much usable as possible. The interface design was based on an existing open source project, Geoplanet Explorer14 created by the developer evangelist “Christian Heilmann”15 As first step the users (or the testers) introduces the name of a geographic location as shown in Figure 4.4

Figure 4.4: Search Field

The system will contact with Geoplanet throw API in order to disambiguate the inserted place as showed below.

Figure 4.5: Disambiguation

14http://isithackday.com/geoplanet-explorer/ 15http://www.wait-till-i.com/ 4.4. User Interface Design & Data Presentation 70

In Figure 4.5 basic information about the place are presented in order to let the user identify the place requested. Once the user clicks on the link of the place that he really intends, a page with a map of the place and the most recent photos from Flickr will be presented. If the city is in the similarity database, the four similarity systems will show the most five similar city to that place (e.g. “Rome”) according to their scoring schemes (Figure 4.6).

Figure 4.6: Similarities for the city of “Rome”

In this phase the similar cities are retrieved simply querying the “Score” table (Figure 4.3) ordering the results by score (limiting the results to the first 5 query match). The similar cities are presented using a tool in order to communicate to the user the “degree” of similarity. Chapter 5

Tests and Evaluations

In this chapter, the concept of similarity as perceived by the users is dis- cussed. Understanding this perception is necessary to interpret the results obtained from the survey delivered to the users in order to evaluate the tool. The survey results and the tool evaluation are presented.

5.1 About Similarity

Once again, the purpose of this research and the implementation of the re- lated tool is to find similarities between places. Before getting into the evaluation it is a good idea to clarify the meaning of “place similarity” intended in the scope of this work. The Assumption 5(Chapter 3.2) states that “A place is defined as a similar to another (not necessarily spatially close locations on the planet) if their representative descriptions are similar”. The main interpretation of place similarity is transferred in the concept of similarity between their descriptions. Each place description is derived aggregating Flickr metadata which rep- resent a narrow folksonomy. Once these tags have been aggregated with respect to a place we are creating a broad folksonomy (see Chapter 3.5) and we have to interpret it for what it actually represents.

71 5.1. About Similarity 72

Figure 5.1: Tag distribution for the city of New York using the “Random” dataset truncated to the 150th tag

Figure 5.1 represents the tag distribution for the city of New York, that can be interpreted how users describe the city of New York using their own vocabulary, language and cultural background. As mentioned in Chapter 2.1.1, in a broad folksonomy, after the tag distribution stabilizes over time it produces the famous power law[7]. This means that from this distribution we can gather trends on how a wide range of people are calling or “describ- ing” the city (high term frequency). We should not forget the right end of the curve, that is the long tail. This is where there is a small minority of people who describe the city by a term al- lowing others with a similar vocabulary mindset (or maybe same language) to agree on the city description, even if they do not use the terms used by the masses over at the left end of the curve.

Under these circumstances we can analyze some critical factors before in- terpreting the survey results. 5.1. About Similarity 73

5.1.1 Critical Factors A) The original distributions in the un-modified corpuses (like the one showed in Figure 5.1) have been widely modified extracting a description based not only on the term frequency but also on the inverse document fre- quency and other factors(tf-idf Chapter 3.6.3). Since this is only one of the possible several descriptions of the place, the user may not agree on the criteria chosen for creating this representative description.

Figure 5.2: Tag Weights distribution for the city of New York using the “Random” dataset truncated to the 150th tag using the W-rnd1 e W-rnd3 weights

In Figure 5.2 we can observe the distribution of the weights for the same 150 tags of Figure 5.1 taken in the same order. This represents a part (truncated to 150 tags) of the New York description using the “Random” dataset and two different weighting schemes (W-rnd1 in red e W-rnd3 in blue, see Chapter 3.6.3). We can observe that some tags does not exists anymore (due to the blacklist, Chapter 3.6.5) and the “importance of the tags” are widely changed. This represents only two of the possible place descriptions, that the user may or not find suitable. 5.2. Test 74

B) Some of the factors contributing in creating a description of a city de- pend strongly on the set of places under analysis (e.g. changing the set of places G will change the idf ) and the users are not aware of this. For example a user may consider Berlin the most similar city compared to Oslo based on their alternative music movements, but Berlin may not be inside the analyzed set G. In this case the user may be disappointed finding Dusseldorf in what he perceives being the position of Berlin.

C) Assumption 5 defines similar places those who shares a similar descrip- tion but that are not necessarily geographically close. The assumption does not exclude the usual geographic or physical concept of similarity, but it clearly claims that this condition is not necessary. That means that they can be geographically close. The interpretation of this condition can be explained with an example. If we consider the city of “Rome”, the scoring system D gives the first three (ranked) result: “Siena”, “Florence” and “Venice”. These four city are actu- ally close (all in Italy) but also their descriptions are similar (e.g. a consid- erable number of matching terms in the ), maybe because the citizens of those places share a common cultural background hence a common place description. So this is a valid result, not a false-positive.

5.2 Test

Because of all the “human” factors explained in the previous section we can not apply objective evaluation measure such as precision or recall because it does not exist a unique way to suggest us which city is actually the most similar to another one and evaluate the results of the research on these cri- teria. The results depend only on the weighting schemes adopted (see Chapter 3.7.2) and on the initial retrieved data, that is the Flickr folksonomy, in- tended as the “original” description of a place created by the user tags.

We asked the users to evaluate the four scoring systems, and since they only differ in the used weighting schemes (System A also differs for the underlying dataset) that means they differ in how they create the description of a place. As a consequence we are actually asking the users which one of the four place descriptions is the best according to their mindsets. 5.2. Test 75

These descriptions are obtained from a folksonomy and the four methods to create the four descriptions may enhance or cut off the “importance” of the different tags in contributing in the each of the place description (Figure 5.2). Since the tool extracts knowledge and patterns from user generated content, in my opinion, a good way to estimate “how good” a similarity system is, compared to the others, is to analyze the users’ interpretation.

5.2.1 Survey In order to evaluate this research I created an online survey asking the users which one of the four similarity systems is the best one according to them. Before starting the survey clear instructions explain that the similarities between places in such a tool are based on place descriptions, not geographic or physical similarities.

Figure 5.3: Survey Introduction and Instruction

For the survey a small group of cities have been selected with an high percentage of European tourism destinations. This has been done because most of the users in the focus group may have no cultural background to judge cities elsewhere. Nonetheless a small group of cities from Asia, South America and USA have been inserted in the test. The focus group was composed of 113 people, in particular, master students from “Università della Svizzera Italiana” and “Politecnico di Milano”, and a percentage from a group about European Tourism. Each user was asked to give a relative judgment on which similarity system performs the best similarity scores with respect to the others. Each user judged one city at time for a total of five cities. For each city, the user could observe the five most similar cities according to the four scoring 5.2. Test 76 systems. The web page presented four lists, each one composed by five cities in a ranked list. Figure 5.4 shows the survey for Marseille.

Figure 5.4: Screenshot of the survey for the city of Marseille

At the top of the page some basic information about the city were pro- vided (Country, Administrative Regions, ectr.) as well as a map, in order to give the users a basic background to let them identify the city under test. The users could understand the order of similarity between the city under test and each one of the five cities (for each list) observing the size of the font. The bigger the font, the more the city were similar to the one under 5.3. Evaluation 77 test (e.g. Marseille) with respect to each one of four scoring the system.

Since the place similarity is based on similarity of description, in order to give an extra clue to the users, they could click on each of the city in the lists and check the tags in common with Marseille. This is clearly a simplification because the similarity tool is not based on a simple term match (like in the binary information retrieval model). Nonetheless it was considered a good way to ask the users if the tags in common between two cities were or not trivial. The important point of this particular aspect of the survey was that I did not suggest in any way what to consider trivial and what to consider an appropriate match. It is the user with his mindset to judge it. As a recap: • we are extracting knowledge from a folksonomy;

• a folksonomy is “created” by users;

• we are asking the users which one of the knowledge extraction (and comparison) methods is the best. This evaluation appeared as the most coherent one with the choices I pre- viously made along all this research.

5.3 Evaluation

The survey was filled by 113 users and monitored for about a week producing 516 answers. Since the beginning, a trend was visible, as showed in Figure 5.5.

Figure 5.5: Users’ survey answers for the first day of evaluation.

The letters, from A to D represent the respective scoring systems (e.g. Sys- tem A, System D, see Chapter 3.6.4). The letter N denotes that the users did not know how to answer. 5.3. Evaluation 78

After a week the proportion between the answers remained more or less similar, as showed in Figure 5.6.

Figure 5.6: Users’ survey answers for the last day of evaluation

The final result gives several useful information. The most noticeable is that the users did not know what (or how) to answer for a high number of times (84), that can be interpreted in three ways. One interpretation is that the users did not find, according to their mindset, any of the city representations appropriate (see Chapter 5.1.1). Another way to interpret the high number of un-answered questions is that the proposed survey may have been perceived too complex to understand for a part of the users. I tried to avoid this problem selecting a focus group with a high education level (master students) but it seems that more efforts could have been done to propose such evaluation test in an easier way. Finally there were good possibilities that, simply the users did not know the place under consideration. On the other side we can discern a trend in which the System D remains the favorite one, followed by the System A. The System B and System C are considered less valuable and they swapped importance position with each other several times during the observation week.

The best system according with the users is System D (139 prefer- ences). System D is based on the weighting scheme in which great impor- tance was given to the user factor (Chapter 3.6.4). In this similarity system the weighting scheme is based on the extra Assumption that was introduced for the “Random” dataset: “The importance of a tag for a place increases with the number of individual photographers that use it in that place” (see Chapter 3.3). This assumption was modified from an assumption borrowed from the related work “World Explorer”[1]. This result confirm that not only this assumption is valid for the scope and objective of this research but also, it represents the best one. 5.3. Evaluation 79

System A represents the second most popular answer (118 preferences) and it shows that we can extract valuable knowledge gathering information from the “Top 100” dataset. In other words using only the most 100 popular tags for a city we can build a place similarity system based on them. Of course such system would not be as comprehensive as the one represented by System D, and this can be explained with an example. If we analyze the city of “Seville” (Spain) using the System A and System D we obtain in both cases that “Cordoba” (Spain) is in the set of the five most similar cities. So in this circumstance we have similar results using different systems, but if we check the shared tags between these two cities according to the two systems under evaluation we can see a huge difference. The tags in common between “Seville” and “Cordoba” according to the System A are no more than fifty as shown in Figure 5.7.

Figure 5.7: Shared tags between “Seville” and “Cordoba” according to the System A

From the other side, according to the System D the same two cities share thousands of tags as illustrated in Figure 5.8.

Figure 5.8: Shared tags between “Seville” and “Cordoba” according to the System D (truncated) 5.3. Evaluation 80

As we know the VSM does not consider a simple term match (as in this ex- ample) but it takes under consideration other factors. However, according to the cosine similarity (Chapter 3.7.1) two cities that have similarity score different from zero share at least one tag. So the term matching is used in this scope as illustrative of the examples. Based on the last example it is clear that the similarity according with Sys- tem A may not be efficient because its analysis is based on a small subset of tags. Of course this subset represents “important” tags for most of the users (top 100 most common tags), but since the VSM will widely change their weights, this does not exclude that with the adoption of the Vector Space Model these tags may not be “important” anymore for the representative description of a city (see Chapter 5.1 and Figure 5.2).

Under these considerations I can claim that system A is not considered to be an efficient way to find place similarities, because based on a small corpus C100 of tags and in certain conditions the system may fail in finding similar cities due to small dimension of the corpus. This can be clarified with another example. This time I will consider the city of “Rome” (Italy). According to System A its most similar city is “Tarragona” (Spain) as showed in Figure 5.9. Analyzing the tags in common between these two cities we can observe a very small set of (trivial) tags (Figure 5.10)

Figure 5.9: Five cities similar to “Rome” according to System A and System D (in ranked order)

While, under the same condition System D scored “Siena”, “Florence”, “Venice”, “Verona” and “Naples” (Figure 5.9) the five most similar cities to “Rome” (in ranked order). System D shares thousands of tags with each one of these cities (i.e. not trivial analysis) and as a personal opinion (being an italian citizen I have the cultural background to judge this particular example) this ranked list is the correct one with respect to the one created with System A. 5.3. Evaluation 81

Figure 5.10: Shared tags between “Rome” and “Tarragona” according to System A

System B and System C did not have a great success amongst the survey users, but also they were not considered completely wrong (respec- tively 81 and 94 preferences). System B makes use of a standard inverse document frequency for the “Ran- dom” Dataset, while System C exploited the extra information coming from photos factor for the “Random” dataset, calculating a non-standard idf (see Chapter 3.6.3). In both cases these weighting systems were not judged as good as System D and System A.

Figure 5.11: Five cities similar to “Rome” according to System B and System C (in ranked order) Chapter 6

Conclusion

6.1 Current Status of the Work

The research started with the aim of finding similar geographical locations using the geo-referenced metadata extracted from Flickr photos. By using geotagged photos and their metadata for a limited set of 233 worldwide tourism destinations (cities) meant to narrow the scope of the research with the objective of finding similar destinations in order to realize a tourism suggesting system. Before starting the experimentation I formalized the hypothesis introducing a series of assumptions on which I based the whole research. Following the assumptions I described each geographical location aggregat- ing the tags from Flickr photos taken in that location. Once obtained these textual descriptions, I represented each location using the Vector Space Model, introducing also four TF-IDF weighting modifications in order to find the best way to describe each geographical place. The Vector Space Model was also used to calculate the similarities between the descriptions. Finally a survey has been presented to a focus group of users in order to judge which one of the four TF-IDF modification schemes perform the best place similarity according to them. From a total number of 516 answers (survey proposed to 114 users) a trend emerged, evaluating the System D the most suitable similarity sys- tem (139/516). From this result we can deduce that the particular choice of considering a user factor in the TF-IDF scheme was successful.

82 6.2. Application Fields 83

6.2 Application Fields

This research and the related implemented tool represent an innovative method to find place similarities, especially in the field of tourism desti- nation suggestion. Several researches have been proposed about knowledge extraction from geo- referenced user generated content and metadata. The innovative aspect of this research is that it goes a step further, meaning that after obtaining how the users call or describe a place, it attempts to find a similar one. This may represent a very useful way to suggest similar tourism destination, being different from the actual suggestion systems. In particular it is not based on explicit user suggestions. When someone tags a photo taken in a generic place, he is contributing in describing that place creating an im- plicit link with other geographical locations described in a similar way. This is a “genuine” way of obtaining a place description (and then suggestion) avoiding the “human biases” depending on the people feelings always present when writing a review (e.g. tourists that enjoyed the beautiful weather in a particular week or maybe they experienced a very uncomfortable accom- modation). Although the raw descriptions extracted with this system may not be useful or readable (thousands of keywords) it turns out to be very helpful in suggesting similar places.

6.3 Future Work

The system was positively evaluated by the users showing its potential us- age even in this early development phase. Nonetheless it experiences some issues typical of the most common information retrieval models. In details, the system is based on the Vector Space Model, hence it inherits its typical issues that may be solved as explained in the following paragraph.

Dimensionality The main concern was represented by a dimensionality problem due to the high number of tags to process for each city. I narrowed this problem in- troducing a blacklist and some heuristics in order to suppress those terms that may not give any extra information but only add noise and computa- tional inefficiency. Nevertheless, after the stop-words suppression the corpus

CRnd (from the “Random” dataset) presented 55.000 unique terms, meaning 55.000 dimensions in the VSM. In other words, the systems required several hours of computation in order to obtain all the similarity distances between each place with respect to all the others. 6.3. Future Work 84

If the number of analyzed places will increase (say 1000 cities) the required computation time could drastically increase. The solution to this problem is to reduce the dimensions of the VSM. One way is to enhance the blackList terms detection. For example in- stead of adding manually exact terms in the blacklist we may filter tags not in the stop-list but with similar meaning using the distributional hy- pothesis[8][6] which states that words found in similar contexts tend to be semantically similar: e.g New York, NY, NYC, New York City. Another way to reduce the dimension of the vocabulary can be repre- sented by the adoption of another Information Retrieval Model, for example the Topic-based vector space model or Latent semantic analysis less affected by this issue.

Multi-Language Context A crucial point that was not possible to solve in this research was the lan- guage detection of the various tags amongst the corpuses, hence it was not possible to apply any form of text pre-processing. Of course the fact that the system works (with no particular problem) in a multi-language context is an advantage so I would not suppress in any way the non-english words. The language detection of each tag in a multi-language context is a real challenge (see Chapter 4.3.2.2) and in my opinion it may have a possible solution (e.g. analyzing co-occurring tags in a particular language ) but it would require a research per-se. Bibliography

[1] Shane Ahern, Mor Naaman, Rahul Nair, and Jeannie Hui-I Yang. World explorer. Proceedings of the 2007 conference on Digital libraries - JCDL ’07, page 1, 2007.

[2] Ciro Cattuto, Dominik Benz, and Andreas Hotho. Semantic analysis of tag similarity measures in collaborative tagging systems. arXiv, pages 3–8.

[3] Maarten Clements, Pavel Serdyukov, and A de Vries. Finding Worm- holes with Flickr Geotags. Advances in Information, pages 658–661, 2010.

[4] David J. Crandall, Lars Backstrom, Daniel Huttenlocher, and Jon Kleinberg. Mapping the world’s photos. Proceedings of the 18th in- ternational conference on - WWW ’09, page 761, 2009.

[5] Alan Dix, Stefano Levialdi, and Alessio Malizia. Semantic halo for col- laboration tagging systems. In the and Community- Based Adaptation Technologies Workshop-June 20th, 2006.

[6] J. R. Firth. A synopsis of linguistic theory 1930-55. 1952-59:1–32, 1957.

[7] Harry Halpin, Valentin Robu, and Hana Shepherd. The complex dy- namics of collaborative tagging. In Proceedings of the 16th international conference on World Wide Web, WWW ’07, pages 211–220, New York, NY, USA, 2007. ACM.

[8] Z. S. Harris. Mathematical Structures of Language. Wiley, New York, NY, USA, 1968.

[9] Charles (U.S. Army Construction Engineering Research Laboratory) Herring. An Architecture for Cyberspace : Spatialization of the Inter- net.

85 BIBLIOGRAPHY 86

[10] Livia Hollenstein and Ross Purves. Exploring place through user- generated content: Using Flickr to describe city cores. Journal of Spatial Information Science, 1(1):21–48, July 2010.

[11] Buhalis D. Inversini A., Cantoni L. Destinations’ information com- petition and web reputation. itt – Journal of information technology tourism, pages 221–234, 2009.

[12] Dedekind C. Cantoni L. Inversini A., Marchiori E. Applying a concep- tual framework to analyze online reputation of tourism destinations. Proceedings of the International Conference in Lugano, Switzerland, February 10-12, 2010, pages 321–332, 2010.

[13] Eija Kaasinen. User needs for location-aware mobile services. Personal and Ubiquitous Computing, 7(1):70–79, May 2003.

[14] Evangelos Kalogerakis, Olga Vesselova, James Hays, Alexei a Efros, and Aaron Hertzmann. Image sequence geolocation with human travel priors, September 2009.

[15] Lyndon Kennedy, Mor Naaman, Shane Ahern, Rahul Nair, and Tye Rattenbury. How flickr helps us make sense of the world. Proceedings of the 15th international conference on Multimedia - MULTIMEDIA ’07, page 631, 2007.

[16] R. Lambiotte and M. Ausloos. Collaborative tagging as a tripartite network. ArXiv Computer Science e-prints, December 2005.

[17] RR Larson. Geographic information retrieval and spatial browsing. Geo- graphic information systems and libraries: Patrons, Maps, and Spatial Information. Number Dl. ACM Press, New York, New York, USA, 1996.

[18] J.B. Lovins and MASSACHUSETTS INST OF TECH CAMBRIDGE ELECTRONIC SYSTEMS LAB. Development of a stemming algo- rithm. 11(June):22–31, 1968.

[19] Sergio Martín, Elio S. Cristobal, Rosario Gil, Gabriel Díaz, Manuel Cas- tro, and Juan Peire. A Context-Aware Application Based on Ubiquitous Location. 2008 The Second International Conference on Mobile Ubiq- uitous Computing, Systems, Services and Technologies, pages 83–88, September 2008. BIBLIOGRAPHY 87

[20] Kevin S. McCurley. Geospatial mapping and navigation of the web. Proceedings of the tenth international conference on World Wide Web - WWW ’01, pages 221–229, 2001.

[21] Mor Naaman, Susumu Harada, Q.Y. Wang, H. Garcia-Molina, and An- dreas Paepcke. Context Data in Geo-Referenced Digital Photo Collec- tions. In Proceedings of the 12th annual ACM international conference on Multimedia, pages 196–203. ACM, 2004.

[22] Isabella. Peters and Paul. Becker. Folksonomies : indexing and retrieval in Web 2.0. De Gruyter/Saur, Berlin, 2009.

[23] M. F. Porter. An algorithm for suffix stripping, pages 313–316. Morgan Kaufmann Publishers Inc., , CA, USA, 1997.

[24] Jonathan Raper, Georg Gartner, Hassan Karimi, Chris Rizos, In- formation Science, and Northampton Square. Applications of loca- tion–based services: a selected review. Journal of Location Based Ser- vices, (791766136), 2008.

[25] Tye Rattenbury, Nathaniel Good, and Mor Naaman. Towards auto- matic extraction of event and place semantics from flickr tags. Pro- ceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR ’07, page 103, 2007.

[26] Radim Řehůřek and Petr Sojka. Software Framework for Topic Mod- elling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pages 45–50, Valletta, Malta, May 2010. ELRA. http://is.muni.cz/publication/884893/en.

[27] G. Salton, A. Wong, and C. S. Yang. A vector space model for auto- matic indexing. Commun. ACM, 18:613–620, November 1975.

[28] Gerard Salton. Automatic text processing: the transformation, analysis, and retrieval of information by computer. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1989.

[29] Pavel Serdyukov, Vanessa Murdock, and Roelof van Zwol. Placing flickr photos on a map. Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval - SIGIR ’09, (May):484, 2009.

[30] T.R. Smith. A for geographically referenced materials. Computer, 29(5):54–60, 1996. BIBLIOGRAPHY 88

[31] Carlo Torniai, Steve Battle, and Steve Cayzer. Sharing, discovering and browsing geotagged pictures on the web. Citeseer, 2007.

[32] Kentaro Toyama, Ron Logan, and Asta Roseway. Geographic location tags on digital images. Proceedings of the eleventh ACM international conference on Multimedia - MULTIMEDIA ’03, page 156, 2003. Appendix A

Terms and Abbreviations

tf: term frequency idf:inverse document frequency TF-IDF: term frequency-inverse document frequency VSM: Vector Space Model YQL: Yahoo Query Language LAMP: Linux, Apache, MySQL and PHP Appendix B

Blacklist

Figure B.1: Stop-words or Blacklist composed by the most common Flickr tags