A Descriptive Model for Predicting Popular Areas in a Web Map
Total Page:16
File Type:pdf, Size:1020Kb
Recent Researches in Artificial Intelligence, Knowledge Engineering and Data Bases A Descriptive Model for Predicting Popular Areas in a Web Map RICARDO GARCIA,´ JUAN PABLO DE CASTRO, MARIA´ JESUS´ VERDU´ ELENA VERDU,´ LUISA MARIA´ REGUERAS and PABLO LOPEZ´ University of Valladolid Higher Technical School of Telecommunications Engineering Department of Signal Theory, Communications and Telematics Engineering Campus Miguel Delibes, Paseo Belen´ 15, 47011 Valladolid SPAIN fricgar, juacas, marver, elever,[email protected], [email protected] Abstract: The increasing popularity of web map services has motivated the development of more scalable services in the Spatial Data Infrastructures (SDI). Tiled map services have emerged as an scalable alternative to traditional map services. Instead of rendering image maps on the fly, a collection of pre-generated image tiles can be retrieved very fast from a server-side cache. However, storage requirements and start-up time for generating all tiles are often prohibitive for many potentially providers when the cartography covers large areas for multiple rendering scales, which forces to use partial caches containing a subset of the total tiles. This work proposes a descriptive model based on the mining of real-world logs from several nationwide public web map services in Spain. The proposed model is able to determine in advance which areas are likely to be requested in the future based exclusively on past accesses. Tiles that are anticipated to be requested soon can be pre-generated and cached for faster retrieval. As the number of tiles grows exponentially with the rendering resolution level, it is rarely feasible to work with statistics of individual tiles. To overcome this issue, a simplified model is proposed which combines statistics from multiple tiles to reduce the dimension of the tiling space. Simulations demonstrate that significant savings of storage requirements can be achieved by using a partial cache with the proposed model, while maintaining a high cache hit ratio. Key–Words: Web mapping, Map tile, WMTS, SDI, WMS, Descriptive model, Logs, Proxy cache 1 Introduction This situation has also motivated the development of tile-based recommendations like the Web Map Ser- The diverse Web Map Service (WMS) specifications vice - Cached (WMS-C) specification from OSGeo [1] of the Open Geospatial Consortium (OGC) fo- [4] or the new Web Map Tile Service (WMTS) of the cuses on flexibility, enabling clients to obtain exactly OGC[5]. the final image they want. However, spatial parame- When an OGC service is used in a demanding ters in map requests are not restricted, which forces environment, with stationary and bit configurable pa- map images to be generated on the fly through an ex- rameters, the proxy web cache pattern can be used to pensive process that implies access to the data store, improve the perceived quality of service. A proxy is a style application, layer composition and compression device placed seamlessly anywhere between the client of the final image. and the final service, intercepting user’s requests [6]. This process has been proved to be ineffective to Using this approach, providers have to cope with satisfy the requirements of some massive applications, serious decisions about the design and maintenance as explained in [2] after NASA’s experience with the of their tile caches. The immediate option is to pre- OnEarth map server. For this reason, most popu- generate all tiles from all available scales. However, lar commercial services, like Google Maps or Mi- only big corporations have currently enough storage crosoft Bing Maps, have adopted their own specifica- resources to store all the tiles. This is not a problem tions where the geographic space is tiled according to for them, but smaller companies must decide carefully a predefined grid, and content is usually pre-generated which part of the content should be pre-generated. [3]. Anyway, there are cartographic layers that should ISBN: 978-960-474-273-8 397 Recent Researches in Artificial Intelligence, Knowledge Engineering and Data Bases be updated frequently and every of them start from an empty cache. In general, to pre-generate all the ob- y coordinates indexes jects is not a good approach when serving frequently Ymax updated maps, such as weather or traffic information 0, jmax … i, jmax i+1, jmax … imax, jmax maps. The rest of the paper is organized as follows. Sec- ΔY tion 2 introduces a formal characterization of the tiling … … … … … … space. Section 3 defines a descriptive model to predict which tiles should be cached from logs of past server 0, j+1 … i,j+1 i+1,j+1 … imax, j+1 accesses. This model is analyzed and the experiment yi+1 results are presented later. Finally, Section 4 includes 0, j … i,j i+1,j … imax, j the main conclusions of this work. yi … … … … … … 2 Tiling space 0,0 … i,0 i+1,0 … imax,0 Y In order to offer a tiled web map service, the web map min xi xi+1 x server renders the map across a fixed set of scales Xmin Xmax through progressive generalization. Rendered map ΔX images are then divided into tiles, describing a tile pyramid At any moment a cache status can be defined Figure 1: Uniform tile space for a certain level of the where its managed objects can be available with a cer- scale pyramid tain probability. Spatial cache objects can be identified by their coordinates. Each tile is defined as T (i; j; n), hav- The latency to serve a request for a given tile with ing i; j; n 2 N (where n is the resolution level in coordinates (i; j; n) at time t can be determined by the scale pyramid and i; j are the spatial indexes for (3): this level). Therefore, Ph fT (i; j; n)g is defined as τ (i; j; n; t) = Ph fT (i; j; n)g (t)τh + the probability of getting a cache hit for the requested + (1 − Ph fT (i; j; n)g (t)τm) (3) tile T (i; j; n), at time t. Denominate τh to the cost in seconds to delivery an object directly from the cache Combining (3) and (1) a probabilistic expression can and τm to the cost in seconds to build a tile through be obtained to estimate the average latency of the ser- the original services. vice: Being f (x; y; n) the spatial probability density X req τ (t) = (τ − P fT (i; j; n)g (t)(τ − τ )) that characterizes the spatial distribution of the cen- m h m h troids of the requested map tiles for the scale n at time hijni t, in general, for any request distribution, tiles are re- Preq fT (i; j; n)g (4) quested with probability (1). Some components can be identified in (4), which yn;j+1 xn;i+1 should be parametrized at least locally. Z Z Some important characteristics of this model can Preq fT (i; j; n)g = freq (x; y; n) dxdy be: y=yn;j x=xn;i ◦ There is no independence between (1) Ph fT (i; j; n)g (t) and Preq fT (i; j; n) ; tg be- Although this result can be useful to analyze non tiled cause the cache status is intimately connected to the requests to a proxy-cache, it can be simplified assum- service requests history. ing that requests are constrained to a reference grid cell (Figure 2). In this case, the probability of receiv- ◦ There is no temporal invariance during start-up of ing a request of the tile T (i; j; n) with size ∆x∆y is transient states. (2): ◦ The probability density function is not uniform and it probably represents a direct relationship with the Preq fT (i; j; n) ; tg = freq (x; y; n; t) ∆x∆y (2) spatial structure of the underlying information. ISBN: 978-960-474-273-8 398 Recent Researches in Artificial Intelligence, Knowledge Engineering and Data Bases Another factor that must be addressed is the prob- tional Geographic Institute (IGN)3 of Spain. lem of managing exponential data structures. In a Tiled versions of these services use caches imple- cache with pyramidal scales, the number of objects in- mented by Metacarta Tilecache [11]. This cache sys- creases exponentially according to the n value. There- tem follows the OSGeo WMS-C specification. fore, the use of analytic or predictive algorithms can be impractical, even with the support of heuristic al- 3.2 Retrieving request data gorithms, if they aim to be used with all the pyramid levels. Tile map requests are extracted from the Apache’s For this reason, it is very useful to obtain a statis- standard access log configured using the Common tical relationship model between different scale lev- Log Format [12]. Information retrieved from these els. The statistics of a level can be extrapolated to records is as follows: other near levels. Assuming that the geographic lo- ◦ Request date, with precision in seconds. cation is a relevant information for all the scale lev- ◦ IP address or hostname of the remote client that els, and that requests have significant spatial correla- made the request to the server. tion, heuristic algorithms like Locality Principle [7] can be used. These algorithms should manage the ◦ Server status code returned to the the client. This whole cache through statistical probes within levels information is very valuable, because it reveals containing a manageable number of objects (tiles). whether the request was successfully returned or not. ◦ Size of the returned object. This value does not in- P {i −1, j +1,n} req,l clude the response headers, and it is expressed in Preq,l {i, j +1,n} Level l P {i −1, j,n} req,l bytes.