Semantic Trails of City Explorations: How Do We Live a City

Noname manuscript No. (will be inserted by the editor)

Diego Monti · Enrico Palumbo · Giuseppe Rizzo · Rapha¨elTroncy · Thibault Ehrhart · Maurizio Morisio

Received: date / Accepted: date

Abstract The knowledge of city exploration trails of release two datasets holding millions of semantic trails people is in short supply because of the complexity in each and we discuss their most salient characteristics. defining meaningful trails representative of individual We finally present an application using these datasets behaviours and in the access to actionable data. Ex- to build a recommender system meant to guide tourists isting datasets have only recorded isolated check-ins of while exploring a city. activities featured by opaque venue types. In this pa- Keywords Semantic trail · Collective behavior · per, we fill the gaps in defining what is a semantic trail Location-based social networks · Tourist recommenda- of city exploration and how it can be generated by inte- tion · Sequence recommendation. grating different data sources. Furthermore, we publicly

Diego Monti Politecnico di Torino, Corso Duca degli Abruzzi 24, 10129 1 Introduction Turin, Italy Tel.: +39-011-0907087 Location-based social networks (LBSNs) allow users to Fax: +39-011-0907099 E-mail: [email protected] share their position with friends, or even publicly, by performing a check-in when they visit a certain venue Enrico Palumbo Istituto Superiore Mario Boella, Via Pier Carlo Boggio 61, or point-of-interest (POI). A POI can be defined as an 10138 Turin, Italy entity that has a somewhat fixed and physical exten- EURECOM, Sophia Antipolis, 450 Route des Chappes, 06410 sion, like a landmark, a building, or a city.1 A check-in is Biot, France typically associated with many information of potential Politecnico di Torino, Corso Duca degli Abruzzi 24, 10129 Turin, Italy interest for researchers specialized in different domains, E-mail: [email protected] from urban mobility to recommender systems. For ex- Giuseppe Rizzo ample, many LBSNs classify their POIs in consistent LINKS Foundation, Via Pier Carlo Boggio 61, 10138 Turin, taxonomies, that assign an explicit semantic meaning Italy to each check-in. Furthermore, each venue has a physi- arXiv:1812.04367v2 [cs.SI] 30 Dec 2019 E-mail: [email protected] cal location, which can be represented by its geograph- RaphaëlTroncy ical coordinates, and each check-in is performed at a EURECOM, Sophia Antipolis, 450 Route des Chappes, 06410 specific point in time. Biot, France E-mail: [email protected] The contribution of this work is threefold: we formally define what is a set of temporally neighboring ac- Thibault Ehrhart EURECOM, Sophia Antipolis, 450 Route des Chappes, 06410 tivities, which we called semantic trail of check-ins, and Biot, France how to generate it, we propose a mapping between the E-mail: [email protected] venue categories available in Foursquare and the corre- Maurizio Morisio sponding Schema.org terms, and we introduce the Se- Politecnico di Torino, Corso Duca degli Abruzzi 24, 10129 mantic Trails Datasets (STDs) which are two different Turin, Italy E-mail: [email protected] 1 As formalized by https://schema.org/Place. 2 Diego Monti et al. datasets of semantically annotated trails created start- venues. A typical task addressed in literature is the pre- ing from check-ins performed on the Foursquare social diction of the next POI in which a user is likely to be network. Differently from other datasets already avail- willing to go during the exploration of a city. able, we analyzed the check-ins at our disposal in order Different approaches have been considered to ad- to group them into sequences of activities. Furthermore, dress this problem: for example, Cheng et al. [1] pro- we enriched the datasets by adding valuable semantic posed an extension of the matrix factorization method information, that is the Schema.org terms correspond- capable of considering the temporal relations in the ing to the Foursquare category of the associated venues check-in sequence, as well as the spatial constraints as well as the GeoNames and Wikidata entities repre- from the user. Ye et al. [18] introduced a framework senting the city in which the check-in was performed. based on a mixed hidden Markov model capable of first The remainder of this paper is structured as fol- suggesting the most relevant venue categories and then lows. In Section 2, we review related works, while, in selecting the actual suggested POIs given the estimated Section 3, we introduce the procedure used to generate category distribution. Feng et al. [3] addressed the next the STDs. Then, we analyze the main characteristics new POI recommendation problem using a personal- of our datasets in Section 4 and we present a possible ized ranking metric embedding method. More recently, use case in Section 5. Finally, we conclude and outline Palumbo et al. [8] proposed to recommend venue cat- future works in Section 6. egories using recurrent neural networks, while Sánchez et al. [12] exploited cross-domain techniques.

2 Related Work 2.3 Check-in Datasets Different authors have analyzed user-created geographical data obtained from LBSNs. In the following, we Some check-in datasets collected from LBSNs are al- distinguish among works related to data-driven studies ready publicly available. The NYC Restaurant Rich (Section 2.1), next POI recommendation (Section 2.2), Dataset [16] includes check-ins of restaurant venues in and check-in datasets (Section 2.3). New York City only, as well as tip and tag data collected from Foursquare from October 2011 to February 2012. The NYC and Tokyo Check-in Dataset [17] con- 2.1 Data-driven Studies tains check-ins performed in New York City and Tokyo collected from April 2012 to February 2013, together Several works exploited LBSNs for a data-driven under- with their timestamp, GPS coordinates and venue cat- standing of cities and characterizing the social behav- egory. The Global-Scale Check-in Dataset (GSCD) [15] iors related to urban mobility. Noulas et al. [7] relied on includes long-term check-in data collected from April a spectral clustering algorithm to create a semantic rep- 2012 to September 2013 only considering the 415 most resentation of city neighborhoods and to identify user popular cities on Foursquare. All the previous datasets communities that visit similar categories of places. Li were created by Yang et al. [17,15] and they are publicly 2 et al. [4] performed a statistical study with the aim of available on the Web. unraveling the correlations among venue categories and However, none of them is focused on the analysis their popularity using a large check-ins dataset with 2.4 of temporal sequences of check-ins. In contrast, our ap- million venues collected from different geographical re- proach for constructing semantic trails is similar to the gions. Preot¸iuc-Pietro et al. [10] proposed to create a one proposed by Parent et al. [9], but instead of relying semantic representation of an urban area by relying on a on raw GPS data, we consider check-ins from LBSNs bag of venue categories: they used such a representation and we semantically annotate them. to define a similarity measure between cities. More recently, Rizzo et al. [11] exploited density-based clustering techniques on a dataset containing venue categories 3 Semantic Trails Datasets to create high level summaries of the neighborhoods. In this section, we detail the process that we followed for building the Semantic Trails Datasets (STDs) from the collections of check-ins at our disposal, that will 2.2 Next POI Recommendation be described in Section 4. The exploited algorithm is

Other studies relied on LBSNs data to create algorithms 2 https://sites.google.com/site/yangdingqi/home/ capable of providing personalized recommendations of foursquare-dataset Semantic Trails of City Explorations: How Do We Live a City 3 publicly available in our GitHub repository,3 while the trails by assuming that two check-ins that are not dis- resulting datasets have been published on ﬁgshare [6]. tant in time more than eight hours belong to the same trail, similarly to what has been done in [2].

In Algorithm 1, we list the procedure for creating 3.1 Dataset Generation the set S, given the set of users U, the set of check-ins C, and the time interval δτ = 8 hours. Please note that We based our generation strategy on two initial col- some check-ins will not be included in any trail because lections of check-ins obtained from Foursquare, namely they are too distant in time and, therefore, they will the GSCD and a second one created by the authors. not be part of the STDs. Each of these sources consists in two different files seri- alized using a tabular format. The first one contains the check-ins collected from the platform, while the second one lists the venues involved and their details. More Algorithm 1 Generation of the set S. formally, each check-in associates a specific user with . a certain venue and a timestamp, which represents the Require: U 6= {∅} ∧ C= 6 {∅} ∧ δτ = τi − τj 1: S ← {∅} point in time when the check-in was performed. 2: for all υ ∈ U do 3: s ← ∅ Definition 1 Given the space of venues V, the space of 4: for all ci ∈ Cυ : τi−1 < τi ∧ i > 1 do users U, the space of timestamps T , a check-in c ∈ C is 5: if τi < τi−1 + δτ then 6: if s is then a tuple c = (ν, υ, τ), where ν ∈ V is the venue in which ∅ 7: s ← hci−1i the user υ ∈ U was located at the timestamp τ ∈ T . 8: end if 9: s ← s + hcii In contrast, a POI is characterized by a unique iden- 10: else tifier, its geographical coordinates, and a category se- 11: if not s is ∅ then 12: S ← S ∪ {s} lected from the Foursquare taxonomy.4 13: s ← ∅ 14: end if Definition 2 Given the space of categories K, a venue 15: end if or point-of-interest (POI) ν ∈ V is a tuple ν = (ϕ, λ, κ), 16: end for 17: end for where ϕ is the latitude, λ is the longitude, and κ ∈ K 18: return S is the associated category.

In the following, we define a semantic trail as a list of consecutive check-ins created by the same user within In addition to this algorithm, we applied three dif- a certain amount of time. This definition is similar to ferent filters before constructing the trails in order to re- the one of semantic trajectories proposed by Parent et move suspicious check-ins, that may have been spoofed al. [9], but it considers LBSNs instead of GPS data. with the help of automated software.

Definition 3 A semantic trail s ∈ S is a temporally We first ignored the check-ins performed by a cer- ordered list of check-ins hc1, c2,..., cni created by a tain user in the same POI multiple times in a row and particular user υ ∈ U, i.e., for each i, ci = (νi, υ, τi) we only considered the last one, because such repeti- where τi < τi+1 ∧ νi 6= νi+1. tions cannot result in meaningful semantic trails. Then, we discarded the check-ins performed by the same user In order to construct the semantic trails from the in less than one minute, as it is unreasonable to visit a initial datasets, we processed the check-ins and we ana- venue in such a short amount of time. lyzed their timestamps, for obtaining an unambiguous time representation that also includes the time zone. To Finally, we filtered out the check-ins that require an this end, we exploited the ciso8601 Python library.5 unrealistic speed for moving from a certain venue to the Then, we grouped the check-ins by user and we next one. In particular, we removed consecutive check- sorted them according to their timestamp. From such ins that are associated with a speed greater than Mach 1 ordered lists of check-ins we constructed the semantic (∼ 343 m/s), as this value is higher than the normal cruise speed of an airplane. We computed the distance 3 https://github.com/D2KLab/semantic-trails 4 https://developer.foursquare.com/docs/resources/ between two venues by applying the haversine formula categories to their geographical coordinates [13]. This approach is 5 https://github.com/closeio/ciso8601 similar to the one followed in [15]. 4 Diego Monti et al.

3.2 Semantic Enrichment Table 1: The ﬁelds available in the STDs.

In order to enrich the available datasets, we identified Field Description the city where each venue is probably located by per- trail id The numeric identifier of the trail forming the reverse geocoding of its coordinates. To user id The numeric identifier of the user this purpose, we used the reverse geocoder Python li- venue id The Foursquare identifier of the venue 6 venue category The Foursquare identifier of the category brary and the geographical coordinates of all the cities venue schema The Schema identifier of the category with a population greater than 500 people or seat of venue geonames The GeoNames identifier of the city a fourth-order administrative division as reported by venue wikidata The Wikidata identifier of the city GeoNames. We also obtained the corresponding entities venue city name The name of the city venue country The code of the country from Wikidata and we included their URIs, if available, timestamp The timestamp of the check-in in the STDs by matching the English city names and the geographical coordinates, when their distance was less than 10 km. We were able to find a correspondence As an example of the CSV format, we report two for the 84% of the cities available in GeoNames. semantic trails in Listing 1. Furthermore, we manually mapped the categories listed in the Foursquare taxonomy with the Schema.org Listing 1: The first two trails obtained from the GSCD. vocabulary. If a Foursquare category cannot be mapped trail_id,user_id,venue_id,venue_category, with a leaf, then we mapped it with an ancestor. The ,→ venue_schema,venue_geonames, ,→ venue_wikidata,venue_city_name, mapping has involved three domain experts who per- ,→ venue_country,timestamp formed a two-stage process: the first has involved two 1,1,4ec656207ee537da7d220f91, experts and it has elicited mappings and doubts, the ,→ 4bf58dd8d48988d162941735,schema:Place, second has involved the three experts whose the one ,→ geonames:5125734,wd:Q3449083,Malverne,US ,→ ,2012-04-03T18:19:00-04:00 excluded from the first stage acted as meta-reviewer, 1,1,4e753db3c65bb91db4493d78, validating the mappings and resolving inconsistencies ,→ 4bf58dd8d48988d116941735 ,schema:BarOrPub by answering to doubts. The resulting mapping is avail- ,→ ,geonames:5117891,wd:Q3452120,Franklin able in our GitHub repository.7 In the STDs, we in- ,→ Square,US,2012-04-04T00:15:00-04:00 2,1,4cc36d0ad43ba143071c60f8, cluded both the original Foursquare category and the ,→ 4bf58dd8d48988d101951735,schema:Store, associated Schema.org entity for each venue. ,→ geonames:5125734,wd:Q3449083,Malverne,US ,→ ,2012-04-07T12:40:00-04:00 2,1,4e418ddb887740a51b5572d6, ,→ 4bf58dd8d48988d134941735, 3.3 Output Formats ,→ schema:PerformingArtsTheater , ,→ geonames:5125734,wd:Q3449083,Malverne,US The final result of the aforementioned process is avail- ,→ ,2012-04-07T12:46:00-04:00 able in two different file formats. The first one is a comma-separated values file containing the fields detailed in Table 1. The second one is an equivalent RDF 4 Statistical Analysis Turtle version of the dataset. The Foursquare user identifier has been anonymized We generated the STDs starting from two different col- by replacing it with a number. On the other hand, the lections of check-ins obtained from the Foursquare plat- identifier of the venue corresponds to its Foursquare form. The first one is the Global-Scale Check-in Dataset URI and, therefore, it can be used to retrieve additional (GSCD), created by the authors of [15] and publicly information. For each check-in we also provide the cat- available on the Web. The second one is a similar but egory of the venue as available in the Foursquare tax- more recent set of check-ins realized by the authors of onomy and the corresponding Schema.org term. The this work, originally collected in the context of [8]. GeoNames identifier corresponds to the city in which More in detail, we retrieved the check-ins performed the venue is located, while the country code refers to by the users of the Foursquare Swarm8 mobile applica- the country associated with that city. Finally, the times- tion and publicly shared on Twitter from the Twitter tamp is expressed in the ISO 8601 format and it has API. Then, we collected additional information associ- been approximated, for privacy reasons, to the minute. ated with the check-ins, like the venue in which it was performed and its geographical coordinates, thanks to 6 https://github.com/thampiman/reverse-geocoder the Foursquare API. 7 https://github.com/D2KLab/semantic-trails/blob/ master/mapping.csv 8 https://www.swarmapp.com Semantic Trails of City Explorations: How Do We Live a City 5

Table 2: The number of check-ins, venues, and users, Table 4: The number of check-ins removed because of the time interval and the period of collection for the the different filters that we applied for creating the two initial sets of check-ins. STDs. The total number of invalid check-ins is not equal to the sum of the different categories because the sets GSCD Ours are not disjoint. Check-ins 33,263,631 12,473,360 Venues 3,680,126 1,930,452 STD 2013 STD 2018 Users 266,909 424,730 Venue 2,381,182 275,359 Time 532 days 382 days Time 1,627,688 96,879 Start 2012-04 2017-10 Speed 66,796 13,021 End 2013-09 2018-10 Total 3,963,133 366,491

Table 3: The number of check-ins, trails, venues, users, and cities included in the two STD releases. differences in the number of trails and venues are consistent with the size of the initial dataset. STD 2013 STD 2018 In Table 4, we detail the number of check-ins re- Check-ins 18,587,049 11,910,007 moved because of the different filters during the cre- Trails 6,103,727 4,038,150 ation of the STDs. We observe a similar effect of the Venues 2,847,281 1,887,799 filters on the two datasets: for instance, the constrain Users 256,339 399,292 on the repetition of a venue is always the most selec- Cities 10,152 52,011 tive one. However, the number of invalid check-ins is extremely different, because of the various approaches exploited during the collection of the initial check-ins. We report some statistics regarding these initial col- Furthermore, we analyzed the lengths of the seman- lections of check-ins in Table 2. The GSCD contains tic trails that we built: the truncated histograms of their more check-ins, as it was collected for an higher num- distributions are available in Figure 1. We observe that ber of days in a period of great popularity of LBSNs. On the distributions of the two datasets are similar, even the other end, our dataset is being enriched with new if STD 2013 includes a higher number of trails. The check-ins continuously, therefore we envision future re- average trail lengths are 3.05 in STD 2013 and 2.95 in leases of the STDs based on a future snapshot of our STD 2018, while their standard deviations are 2.16 and collection of check-ins. 1.99 respectively. We constructed two different versions of the STDs We also depicted, in Figure 2, the histograms rep- by applying the procedure described in Section 3 to resenting the distributions of time durations, that is these initial datasets. The two STD versions are named the number of time units between the first and the after the year in which the collection phase ended, that last check-in of a trail. It is interesting to notice that is 2013 for the GSCD and 2018 for the snapshot of our STD 2013 has a higher number of short trails, while collection of check-ins. STD 2018 contains more trails that have a relatively We list several statistics regarding the STDs in Ta- longer time duration with respect to very short ones. ble 3. It is possible to observe that the number of ini- This difference may be explained by the fact that the tial check-ins available in the GSCD has been greatly platform and the behaviour of its users evolved during reduced in STD 2013, while it has been only slightly the years: longtime users may be more willing to share decreased in STD 2018. This result is associated with check-ins in a constant way. the different collection protocols of the GSCD and our In order to analyze the check-ins of the two datasets initial dataset. In fact, we decided to start removing from a spatial point of view, we considered the distri- misbehaving users directly during the collection phase, butions of the number of check-ins for each city. As can in order to limit the number of calls to the Foursquare be deduced from Figure 3, STD 2013 includes a lower API. In details, we discarded users that performed two number of cities with less than a hundred check-ins, check-ins in less than a minute for two times, because while STD 2018 contains many cities with a relatively we identified this as a typical non-human behaviour [8]. low number of check-ins. This result is also related to The radically different number of cities involved in the different number of cities available in the datasets as the semantic trails can also be explained by analyzing consequence of the initial collection protocol. For these the collection protocols. The authors of the GSCD only reasons, STD 2018 may be more useful to character- considered densely populated areas, while we looked for ize globally widespread behaviours, while the focus of check-ins without applying any geographical filter. The STD 2013 is only on densely populated areas. 6 Diego Monti et al.

3.5

3.0 105

2.5

2.0

104 1.5

1.0 Number of trails Number of trails [M]

0.5 103

0.0 2 3 4 5 6 7 8 9 10 103 104 105 Number of checkins Trail duration [s] (a) STD 2013 (a) STD 2013

105

2.0

1.5 104

1.0 Number of trails Number of trails [M] 0.5 103

0.0 2 3 4 5 6 7 8 9 10 103 104 105 Number of checkins Trail duration [s] (b) STD 2018 (b) STD 2018

Fig. 1: Histograms representing the distribution of trail Fig. 2: Histograms representing the distribution of trail lengths. We only considered trails with less than 10 time duration. We only considered trails lasting less venues due to graphical constraints. The scale of the than 24 hours due to graphical constraints. Both axes vertical axis is in millions. are represented in a logarithmic scale. The unit of the x axis is seconds. The discontinuity of the curves is caused by the time limit used to build the trails. We also computed the number of check-ins for each country in the two STDs, which are reported in Ta- ble 5. Some interesting differences emerge from these check-ins performed in train stations are very common results: for example, Japan moved from the fifth to the in Japan, while in Turkey the most widespread category first place in STD 2018, while Brazil was superseded by of venues is coffee shop. Malaysia. These observations can be easily explained In order to demonstrate the usefulness of a semanti- by considering the different collection protocols and the cally annotated dataset, we computed additional statis- possible changes in the usage patterns of the Foursquare tics by also relying on external information obtained platform during the years. from GeoNames. In detail, we downloaded the number Furthermore, we investigated the number of check- of inhabitants of the cities in which the check-is were ins from STD 2018 in the two most popular countries, performed, if available, and we considered the check-ins namely Japan and Turkey, grouped by the Schema.org of small cities separately from the ones of big cities. We category of their venue. The purpose of this analysis, define a big city as a city with more than a hundred whose results are listed in Table 6, is to propose a simple thousand inhabitants. but effective way of characterizing the different human In Table 7, we list the number of trails and check-ins behaviours that are typically associated with a certain in the STDs performed in small and big cities, while in culture. From these figures it is possible to observe that Table 8 we report the most frequent venue categories Semantic Trails of City Explorations: How Do We Live a City 7

Table 5: The ﬁve countries with the highest number of check-ins in the two versions of the dataset.

500 (a) STD 2013

400 Country Check-ins Turkey 3,282,073 300 Brazil 1,994,148 USA 1,859,310 200 Malaysia 1,584,552 Number of cities Japan 1,553,603 100 (b) STD 2018 0 0 20 40 60 80 100 Number of checkins Country Check-ins (a) STD 2013 Japan 5,075,916 Turkey 2,030,934 Kuwait 801,867 Malaysia 764,733 USA 583,445 14000

12000

10000 Table 6: The ﬁve Schema.org venue categories with the highest number of check-ins from STD 2018 in Japan 8000 and in Turkey. 6000 (a) Japan Number of cities 4000 Entity Check-ins 2000 schema:TrainStation 1,198,732 0 0 20 40 60 80 100 schema:Restaurant 704,782 Number of checkins schema:CivicStructure 428,321 schema:ConvenienceStore 152,014 (b) STD 2018 schema:SubwayStation 147,659

Fig. 3: Histograms representing the distribution of cities (b) Turkey per number of check-ins. We only considered cities with less than a hundred check-ins due to graphical con- Entity Check-ins straints. The second dataset is more geographically schema:CafeOrCoﬀeeShop 371,118 widespread than the ﬁrst one, as it contains an higher schema:CivicStructure 179,997 number of cities with a lower number of check-ins. schema:Restaurant 167,832 schema:AdministrativeArea 152,237 schema:FoodEstablishment 132,183 in STD 2018 grouped by the size of the cities. It is interesting to notice that airports are associated with Table 7: The number of trails and check-ins performed small cities, as they are usually located outside densely in small and big cities. populated areas. (a) Small cities

Trails Check-ins 5 Tourist Sequence Recommender STD 2013 2,045,440 4,444,930 The rich set of metadata collected in the STDs pro- STD 2018 1,705,937 3,584,304 vides an explicit semantic meaning to users’ activities. (b) Big cities In fact, venue categories play an important role in POI recommender systems, as they enable to model user in- Trails Check-ins terests and personalize the recommendations [5]. The STD 2013 4,693,791 12,350,172 concept of trail, as deﬁned previously, exploits the con- STD 2018 2,855,417 6,971,715 cept of temporal correlation that is a cornerstone for generating sequences of activities. In the past years, 8 Diego Monti et al.

Table 8: The five Schema.org venue categories associ- be observed in the figure, the instantiating of places or ated with the highest number of check-ins performed in events (entities) was considered as an integral part of small and big cities in STD 2018. the process and it was issued by querying the 3cixty knowledge base [14]. (a) Small cities The impact of such use case was certified by a con- Entity Check-ins trolled and online experimentation with real users and schema:CivicStructure 433,467 it proved how impactful the STDs are in terms of mean- schema:Restaurant 372,225 ingful resources to learn a model to generate tourist ac- schema:CafeOrCoffeeShop 303,209 tivity sequences and quality of metadata used to train schema:FoodEstablishment 184,563 our neural learning models. schema:Airport 171,464

(b) Big cities 6 Conclusion and Future Work Entity Check-ins In this work, we introduced the STDs, two datasets con- schema:TrainStation 1,005,890 schema:Restaurant 942,711 taining millions of check-ins performed on Foursquare schema:CivicStructure 497,117 and grouped into semantic trails, that are sequences of schema:CafeOrCoffeeShop 443,195 temporally neighboring activities. We described the al- schema:ShoppingCenter 319,543 gorithm used to generate such trails and we detailed the process followed to enrich the available data. We associated each check-in with the Schema.org term representing the venue category in which it was performed and little attention has been dedicated to the temporal cor- we also identified the Wikidata entities corresponding relations among venue categories in the exploration of to the city and country of the venue. We characterized a city, which is nonetheless a crucial factor in recom- the two datasets by analyzing them considering differ- mending POIs. ent dimensions and we demonstrated the usefulness of Take the example of a check-in in an Irish Pub at 8 semantically annotated data by relying on external in- PM: is the user more likely to continue the evening in formation to compute additional statistics. Finally, we a Karaoke Bar or in an Opera House? Better a Chinese briefly described a possible use case of such datasets, Restaurant or an Italian Restaurant for dinner after in which we proposed a tourist recommender system a City Park in the morning and a History Museum in trained using the trails available in STD 2018. However, the afternoon? Note that generating these sequences re- we envision different possible scenarios that could ben- quires an implicit modeling of at least two dimensions: efit from such datasets, for example human behaviour temporal, as certain types of venues are more tempo- analysis and urban mobility studies. rally related than others (e.g. after an Irish Pub, people The generation phase brought to further attention are more likely to go to Karaoke than to a History Mu- three points, namely the complexity of the mapping be- seum), and personal, as venue categories implicitly de- tween Foursquare categories and Schema.org, the diffi- fine a user profile, independently from their order (e.g. culties in obtaining a comprehensive list of cities, and Steakhouse and Vegetarian Restaurant do not go fre- the possible issues caused by inconsistencies present on quently together). Most of existing studies attempt to Wikidata. We observed that different venue categories model directly sequences of POIs rather than their cat- are not available on Schema.org: for this reason, they egories to recommend the next POI to a user. have been associated with the most similar term or with In [8], we presented an approach based on a neural a common ancestor. Furthermore, even if some cate- learning model, and more precisely, Recurrent Neural gories are available, they are not considered as a more Networks (RNNs), to generate sequences of tourist ac- specific type of schema:Place or schema:Event, and, tivities. The RNNs are trained with the sequential data therefore, they have been mapped with a general term. available in the STDs and the output is expected to be a For example, schema:CollegeOrUniversity is considered sequence of categories. The space of possible categories an organization and not a physical place, so universities is defined by the Foursquare taxonomy, which classifies have been mapped with schema:CivicStructure. How- venues in a hierarchical taxonomy. In order to initiate ever, the mapping available in our GitHub repository the generation process, the neural learning model takes is not meant to be final, and other researchers are in- as input a seed, i.e. a category from which the tourist vited to submit pull requests to improve it for future wishes to start his city exploration. Figure 4 illustrates releases of the STDs. We also observed that there is no the process of semantic city trail generation. As it can widespread entity that represents the concept of “city” Semantic Trails of City Explorations: How Do We Live a City 9

After, he can be interested in Taking a Then, having a And, lately, beer in an dinner in a attending Irish Pub, French Jazz à Juan first: Hop Restaurant: event in a Store Le Jardin Jazz Club Matt is going to visit Picasso Museum in sequence of venues and events that Antibes this afternoon Matt can be interested to go after having visited Picasso Museum

Fig. 4: The illustration of the Tourist Sequence Recommender in action: it takes as input a seed and it generates a sequence composed of places and events contextualized according to the city where the tourist is located. on Wikidata. For this reason, we decided to rely on the for Industrial and Applied Mathematics (2013). DOI definition provided by GeoNames, even if it considers 10.1137/1.9781611972832.44 some districts and neighborhoods as cities. We initially 6. Monti, D., Palumbo, E., Rizzo, G., Troncy, R., Ehrhart, T., Morisio, M.: Semantic trails datasets (2019). DOI 10. tried to rely on the DBpedia type dbo:City, but we em- 6084/m9.figshare.7429076.v2. URL https://figshare. pirically observed an high number of wrong or missing com/articles/Semantic_Trails_Datasets/7429076/2 entities. Finally, we are aware of the fact that some 7. Noulas, A., Scellato, S., Mascolo, C., Pontil, M.: Exploit- URIs representing a city may be erroneous, due to du- ing semantic annotations for clustering geographic areas and users in location-based social networks. In: Fifth In- plicates or incorrect mappings between Wikidata and ternational AAAI Conference on Weblogs and Social Me- GeoNames. However, these problems can be fixed by dia (2011). URL https://www.aaai.org/ocs/index.php/ future releases of our datasets if they are first resolved ICWSM/ICWSM11/paper/view/3845/4388 in the exploited knowledge base. These points will be 8. Palumbo, E., Rizzo, G., Troncy, R., Baralis, E.: Predict- ing your next stop-over from location-based social net- part of future research activities. work data with recurrent neural networks. In: Proceed- ings of the 2nd Workshop on Recommenders in Tourism co-located with 11th ACM Conference on Recommender References Systems, no. 1906 in CEUR Workshop Proceedings, pp. 1–8. CEUR-WS.org (2017). URL http://ceur-ws.org/ 1. Cheng, C., Yang, H., Lyu, M.R., King, I.: Where you Vol-1906/paper1.pdf like to go next: Successive point-of-interest recommen- 9. Parent, C., Spaccapietra, S., Renso, C., Andrienko, G., dation. In: Proceedings of the Twenty-Third Interna- Andrienko, N., Bogorny, V., Damiani, M.L., Gkoulalas- tional Joint Conference on Artificial Intelligence, pp. Divanis, A., Macedo, J., Pelekis, N., Theodoridis, Y., 2605–2611. AAAI (2013). URL https://www.ijcai.org/ Yan, Z.: Semantic trajectories modeling and analysis. Proceedings/13/Papers/384.pdf ACM Computing Surveys 45(4), 42:1–42:32 (2013). DOI 2. Choudhury, M.D., Feldman, M., Amer-Yahia, S., Gol- 10.1145/2501654.2501656 bandi, N., Lempel, R., Yu, C.: Automatic construction of 10. Preot¸iuc-Pietro, D., Cranshaw, J., Yano, T.: Exploring travel itineraries using social breadcrumbs. In: Proceed- venue-based city-to-city similarity measures. In: Pro- ings of the 21st ACM conference on hypertext and hy- ceedings of the 2nd ACM SIGKDD International Work- permedia. ACM (2010). DOI 10.1145/1810617.1810626 shop on Urban Computing. ACM (2013). DOI 10.1145/ 3. Feng, S., Li, X., Zeng, Y., Cong, G., Chee, Y.M., Yuan, 2505821.2505832 Q.: Personalized ranking metric embedding for next new 11. Rizzo, G., Meo, R., Pensa, R.G., Falcone, G., Troncy, POI recommendation. In: Proceedings of the Twenty- R.: Shaping city neighborhoods leveraging crowd sensors. Fourth International Joint Conference on Artificial In- Information Systems 64, 368–378 (2017). DOI 10.1016/ telligence, pp. 2069–2075. AAAI (2015). URL https: j.is.2016.06.009 //www.ijcai.org/Proceedings/15/Papers/293.pdf 12. Sánchez, P., Bellog´ın,A.: A novel approach for venue rec- 4. Li, Y., Steiner, M., Wang, L., Zhang, Z.L., Bao, J.: Ex- ommendation using cross-domain techniques. In: Work- ploring venue popularity in Foursquare. In: 2013 Pro- shop on Intelligent Recommender Systems by Knowl- ceedings IEEE INFOCOM. IEEE (2013). DOI 10.1109/ edge Transfer and Learning co-located with the 12th infcom.2013.6567164 ACM Conference on Recommender Systems (2018). URL 5. Liu, B., Xiong, H.: Point-of-interest recommendation in https://arxiv.org/abs/1809.09864 location based social networks with topic and location 13. Sinnott, R.W.: Virtues of the haversine. Sky and Tele- awareness. In: Proceedings of the 2013 SIAM Interna- scope 68, 159 (1984) tional Conference on Data Mining, pp. 396–404. Society 10 Diego Monti et al.

14. Troncy, R., Rizzo, G., Jameson, A., Corcho, O., Plu, J., 2013 ACM international joint conference on pervasive Palumbo, E., Hermida, J.C.B., Spirescu, A., Kuhn, K.D., and ubiquitous computing. ACM (2013). DOI 10.1145/ Barbu, C., Rossi, M., Celino, I., Agarwal, R., Scanu, C., 2493432.2493464 Valla, M., Haaker, T.: 3cixty: Building comprehensive 17. Yang, D., Zhang, D., Zheng, V.W., Yu, Z.: Modeling user knowledge bases for city exploration. Journal of Web activity preference by leveraging user spatial temporal Semantics 46-47, 2–13 (2017). DOI 10.1016/j.websem. characteristics in LBSNs. IEEE Transactions on Systems, 2017.07.002 Man, and Cybernetics: Systems 45(1), 129–142 (2015). 15. Yang, D., Zhang, D., Qu, B.: Participatory cultural DOI 10.1109/tsmc.2014.2327053 mapping based on collective behavior data in location- 18. Ye, J., Zhu, Z., Cheng, H.: What’s your next move: based social networks. ACM Transactions on Intelli- User activity prediction in location-based social net- gent Systems and Technology 7(3), 1–23 (2016). DOI works. In: Proceedings of the 2013 SIAM Interna- 10.1145/2814575 tional Conference on Data Mining, pp. 171–179. Society 16. Yang, D., Zhang, D., Yu, Z., Yu, Z.: Fine-grained for Industrial and Applied Mathematics (2013). DOI preference-aware location search leveraging crowdsourced 10.1137/1.9781611972832.19 digital footprints from LBSNs. In: Proceedings of the