A Robust Noise Resistant Algorithm for POI Identification from Flickr Data
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) A Robust Noise Resistant Algorithm for POI Identification from Flickr Data∗ Yiyang Yang? Zhiguo Gongxy Qing Li{ Leong Hou Ux Ruichu Cai? Zhifeng Haoz ?Faculty of Computer, Guangdong University of Technology, Guangzhou, China xDepartment of Computer and Information Science, University of Macau, Macau SAR {Department of Computer Science, City University of Hong Kong, Hong Kong SAR zSchool of Mathematics and Big Data, Foshan University, Foshan, China Abstract Point of Interests (POI) identification using social A A media data (e.g. Flickr, Microblog) is one of the C C most popular research topics in recent years. How- B B ever, there exist large amounts of noises (POI irrel- evant data) in such crowd-contributed collection- D s. Traditional solutions to this problem is to set a global density threshold and remove the data point (a) Low Noise Filtering (b) High Noise Filtering as noise if its density is lower than the threshold. However, the density values vary significantly a- Figure 1: Clustering results on Paris dataset, the POIs in alphabet mong POIs. As the result, some POIs with relative- order are: (A) Arc De Triomphe, (B) Eiffel Tower, (C) ly lower density could not be identified. To solve Louvre Museum, (D) Montparnasse Cemetery the problem, we propose a technique based on the local drastic changes of the data density. First we 1 define the local maxima of the density function as from the mobile devices . Such massive amounts of pho- the Urban POIs, and the gradient ascent algorithm tos with heterogeneous meta-data (geographical and textual is exploited to assign data points into different clus- tags) are valuable resources to support various mining tasks ters. To remove noises, we incorporate the Lapla- and have resulted in many research problems, such as Point cian Zero-Crossing points along the gradient ascent of Interest (POI) identification [Yang et al., 2014], the POI- process as the boundaries of the POI. Points locat- based applications [Crandall et al., 2009], photo tag analysis ed outside the POI region are regarded as noises. [Zhang et al., 2012] and travel pattern recognition [Zheng et Then the technique is extended into the geographi- al., 2012]. cal and textual joint space so that it can make use of According to our survey, plenty of travel recommendation the heterogeneous features of social media. The ex- works [Cheng et al., 2013; Ying et al., 2013; Popescu and perimental results show the significance of the pro- Shabou, 2013; Ying et al., 2014] require a predefined POI posed approach in removing noises. database as the input to their recommendation algorithms. It is no doubt that the quality of the POI database is critical to the success of their subsequent processes. In this paper, we 1 Introduction propose a robust noise-resistant approach for POI identifica- tion using geo- and textual-tagged social media data. Flickr contains more than 8 billions photos from 8.7 million In general, the POI database can be automatically con- users; in addition, 3.5 million new photos are uploaded to structed by performing some clustering algorithms over the Flickr daily where a substantial number of these photos come geo-tagged datasets (i.e., Flickr photos). However, the crowd- contributed data like Flickr often contain large amounts of ∗ We thank the anonymous reviewers for their many insight- noises which may generate pernicious effects to the quality ful comments and suggestions. This work was supported in of the identified POI. A robust algorithm for POI identifica- part by: NSFC-Guangdong JF (U1501254), NSFC (61603101, tion is definitely desired from POI-based applications. 61472089, 61472337), NSF-Guangdong (2014A030306004, 2014A030308008), STPP of Guangdong (2015B010108006, 2015B010131015), FDCT of Macau Government (FDC- 1.1 Motivations & Contributions T/116/2013/A3, FDCT/007/2016/AFJ), UMAC RC (MYRG2015- When visiting a city, people may take photos anywhere, 00070-FST, MYRG2017-00212-FST, MYRG2016-00182-FST), and a large percent of which are POI irrelevant. In this GuangdongHPSSP(2015TQ01X140), TPP of Guangzhou paper, such POI-irrelevant geo-tagged photos are regarded (201610010101, 201604016075) yCorresponding Author: [email protected] 1https://en.wikipedia.org/wiki/Flickr 3294 Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) as noises, because they can cause the identified POIs to 1.2 Organization of the Paper be significantly deviated from the actual ones. To solve The reminder of this paper is organized as follows. Section 2 the issue, some techniques are proposed[Ester et al., 1996; and Section 3 respectively introduce the Hill-Climbing algo- Hinneburg and Keim, 1998; Zhao et al., 2009; Purwar and rithm and Laplacian Technology. In Section 4, we describe Singh, 2016] , where a density threshold is used to filter out how to integrate the textual feature into the framework. We the lower density points as noises. However, such a global thoroughly evaluate our proposed techniques on the photo setting cannot handle the noise problem well because of the collections of Flickr in Section 5. We discuss the related work various density values among POIs. Figure 1 shows the situ- in Section 6 and summarize our work in Section 7. ations of the DBScan algorithm[Ester et al., 1996] with two extreme settings of the density threhold. Figure 1(a) is the 2 POI Identification Using Gradient Ascent result of a lower filtering threshold where two urban POIs, (A) Arc De Triomphe and (C) Louvre Museum in the Algorithm downtown area are merged together. At the same time, a large Intuitively, a location is attractive if visitors take more photos amount of irrelevant data points (geo-tagged photos) are in- around it. In other words, the location with the highest density cluded in the result as well. In contrast, if we overly increase of geo-tagged photos among its neighbors is probably a POI. the threshold (Figure 1(b)), then it will partition one POI re- Formally, given a set of geo-tagged photos x1; : : : ; xN in the gion into many small clusters and at the same time some im- 2-dimensional Euclidean space R2, the density function of a portant POIs (e.g. (D) Montparnasse Cemetery) may location x can be estimated through a Kernel K as follows be omitted just because of their relatively lower density. In [Cheng, 1995]: this paper, we aim at tackling the problem. N X xi − x To handle the noise issue, we propose a novel approach T (x) = K( ) (1) based on the Laplacian Zero-Crossing Technique [Canny, g g i=1 1986]. It is constructed based on the relative change of the geographical density. This idea is motivated by an observa- where g is the geographical bandwidth parameter (e.g. 100 tion that the density of geo-tagged photos encounters a dras- meters). In this paper, Gaussian Kernel G is exploited for its tic change when crossing the POI boundaries. The Laplacian excellent features (i.e. simple and infinite differentiable, easy of the density function indicates the speedup of the gradien- gradient estimation). By replacing K by G in Equation 1, the t ascending and it changes from positive to negative when normalized gradient of Tg can be computed as: crossing the boundaries of the POI region (Zero-Crossing of N the Laplacian). We attempt to capture such “changes” auto- X xi − x G( )xi matically in the identification, thus thoroughly overcome the OT (x) 2 g Oln T (x) = g = ( i=1 − x) (2) problem caused by the global density threshold setting. g 2 N Tg(x) g X xi − x Flickr photos are not only tagged with geographical loca- G( ) tions, but also texts. The quality of the identified POI can be g i=1 further improved if the textual features are taken into accoun- t. For such a sake, we apply Local Sensitive Hashing (LSH) Instead of OTg(x), we use Oln Tg(x) in this work, for its fast algorithm [Charikar, 2002] to transform a term vector into a converging speed and easy estimation [Comaniciu and Meer, hashing value and exploit Hamming Distance [Manku et al., 2002]. 2007] to measure the difference between two photos in the In our scenario, the city POIs are located at places x with textual space. This technique can benefit our algorithm from O ln Tg(x) = 0. To look for all POIs in a city, we start the two aspectives: (1) it significantly reduces the computational gradient ascent algorithm with any photo x0, then iteratively cost for the distance evaluation between textual tags (2) with conduct the following Hill-Climbing Algorithm: defined distance metric, it enables to extend Laplacian Zero- xj+1 = xj + αOln T (xj); j = 0; 1; 2:::m (3) Crossing algorithm into the Geographical × Textual joint s- g pace. where α is a parameter for controlling the size of the move- The contributions of this work are summarized as follows: ment, xj is the jth state of the gradient ascent movement 0 j m j starting from x , and x will converge to x with Tg(x ) 1. We propose a novel technique, Laplacian Zero- m monotonically increasing until O ln Tg(x ) = 0. Hence, the Crossing Detecting, to remove noisy data (POI irrele- xm is a stationary point and is called the location of a POI, vant), along the gradient ascent process. The technique while the sequence of successive states xj; j = 0; :::; m is is based on the drastic change of the local density in- called the trajectory of x. crease, thus can thoroughly overcome the global setting According to the Capture Theorem, the trajectory is attract- problem of the density threshold.