International Journal of Geo-Information

Article GeoWeb Crawler: An Extensible and Scalable Web Crawling Framework for Discovering Geospatial Web Resources

Chih-Yuan Huang 1,* and Hao Chang 2 1 Center for Space and Remote Sensing Research, National Central University, Taoyuan 320, Taiwan 2 Department of Civil Engineering, National Central University, Taoyuan 320, Taiwan; [email protected] * Correspondence: [email protected]; Tel.: +886-3-422-7151 (ext. 57692)

Academic Editor: Wolfgang Kainz Received: 8 June 2016; Accepted: 29 July 2016; Published: 5 August 2016

Abstract: With the advance of the World-Wide Web (WWW) technology, people can easily share content on the Web, including geospatial data and web services. Thus, the “big geospatial data management” issues start attracting attention. Among the big geospatial data issues, this research focuses on discovering distributed geospatial resources. As resources are scattered on the WWW, users cannot find resources of their interests efficiently. While the WWW has Web search engines addressing web resource discovery issues, we envision that the geospatial Web (i.e., GeoWeb) also requires GeoWeb search engines. To realize a GeoWeb search engine, one of the first steps is to proactively discover GeoWeb resources on the WWW. Hence, in this study, we propose the GeoWeb Crawler, an extensible Web crawling framework that can find various types of GeoWeb resources, such as Open Geospatial Consortium (OGC) web services, Keyhole Markup Language (KML) and Environmental Systems Research Institute, Inc (ESRI) Shapefiles. In addition, we apply the distributed computing concept to promote the performance of the GeoWeb Crawler. The result shows that for 10 targeted resources types, the GeoWeb Crawler discovered 7351 geospatial services and 194,003 datasets. As a result, the proposed GeoWeb Crawler framework is proven to be extensible and scalable to provide a comprehensive index of GeoWeb.

Keywords: Geospatial Web; resource discovery; Web crawler; Open Geospatial Consortium

1. Introduction

1.1. Big Geospatial Data Geospatial data are used to help decision making, design, and statistic in various fields, such as administration, financial analyses, and scientific researches [1–4]. With the advance of the World-Wide Web (WWW), the Web 2.0 represents a concept that allows everyone to easily set up websites or post content on the Web [5]. Users can share their data or services on different Web 2.0 platforms, such as Facebook, Wikipedia, and YouTube. Among all the resources (including data and services) available online, the resources that contain geospatial information constitute the “GeoWeb” [6]. The main objective of the GeoWeb is to provide access to geospatial data or services through the worldwide coverage of the WWW. For example, one of the most famous GeoWeb resources is the service, such as [7]. According to [6], GeoWeb is based on a framework of open standards and standards-based technologies, such as geospatial services and Spatial Data Infrastructure (SDI) [8]. Users can then connect to these resources to retrieve geospatial data/information for further analyses. While more and more data on the Web are generated faster than at any time in the past, the concept of “big data” starts attracting attentions. Laney [9] presented the 3V characteristics of the big data,

ISPRS Int. J. Geo-Inf. 2016, 5, 136; doi:10.3390/ijgi5080136 www.mdpi.com/journal/ijgi ISPRS Int. J. Geo-Inf. 2016, 5, 136 2 of 21 which are volume, velocity, and variety. Specifically, data on the Web are generated in high rate (velocity), require large storages (volume), and have various types (variety), such as text, image, and video. Data in the GeoWeb also fit into the 3V characteristic of big data [10]. For instance, the satellite images would cause the storage issue since each image size may be up to several gigabytes. For the velocity, due to the advance of sensor and communication technology, the number of sensors is rapidly increasing, and sensors are generating observations at high frequency. In addition, the sensor observations in GeoWeb [11] are produced by a large number of sensor nodes that monitor various kinds of phenomenon, such as temperature, humidity, and air pressure, and have a large variety regarding data formats, web protocols, and semantic frameworks.

1.2. Problems in Geospatial Resources Discovery While the big data provide opportunities to discover phenomena that were previously undiscoverable, various big data management issues should be addressed [10]. Among the issues, this study mainly focuses on addressing the issue of GeoWeb resource discovery. Currently, if scientists want to search geospatial data on the WWW, they usually start with finding a data portal or an SDI. The data portals are the web sites/services that data providers host, such as the NASA’s data portal (https://data.nasa.gov/) and National Climatic Data Center (http://www.ncdc.noaa.gov/). SDIs are mainly the catalog and registry services provide entries of data services. While providers can manage services on their own, users can find the service entries from SDIs and access the services directly. The INSPIRE (http://inspire-geoportal.ec.europa.eu/), Data.gov (http://www.data.gov/), and GEOSS (http://www.geoportal.org/web/guest/geo_home_stp) are examples of SDIs. However, a critical issue of the existing approaches is that a user may not know where to find the data portals or SDIs that provide the data or services the user needs. Besides, while data in portals are hosted by different organizations and data in SDIs are registered by different providers, no single data portal or SDI can provide complete data index on the GeoWeb. In addition, the heterogeneity problem exists in user interfaces and data between each portal or SDI. Users need to spend more time on learning the procedures to access data and understanding the data. In terms of SDI, the complex registry and lack of incentive would reduce the willingness for providers to register their data. In general, these problems (the details will be further discussed in Section 2.1) would cause GeoWeb resources discovery inefficient. Furthermore, resources outside geospatial data portals/SDIs are important as well. Based on [12], the GeoWeb resources can be modeled as a long tail (Figure1). While most geospatial data portals and SDIs are managed by a small number of government or research institutes, these portals and SDIs provide a substantial amount of geospatial data. These data are called as the “head” in the long tail and usually accessible to users. On the other hand, there are small data portals and GeoWeb resources hosted by a large number of individual scientists or small institutes, which comprise the “tail”. According to the long tail theory [13], the amount of data in the tail equals to the amount in the head. However, the resources in the tail are not indexed by the data portals or SDIs in the head, and usually are undiscoverable to most users. Besides, one of the important features of GIS (Geographic Information System) is overlaying interdisciplinary layers and discovering the relations between them. Especially for data scientists, they may not always know where to find the geospatial data they need [14]. In this case, it is important to have a repository to organize various types of geospatial data. Therefore, in order to provide users a complete index of GeoWeb and a one-stop entry to find every GeoWeb resource, we need a solution that can automatically discover every geospatial resource shared online. In order to make a comprehensive discovery on the WWW, one of the most effective solutions is the web search engine, such as Google, Bing, and Yahoo!. These search engines apply the web crawler approach to find every web pages by traversing through every hyperlink [15,16]. However, search for geospatial resources is not easy since most of the search engines are designed to search plain-text web pages. Users still need to identify geospatial resources from web pages by themselves. In addition, ISPRS Int. J. Geo-Inf. 2016, 5, 136 3 of 21

GeoWeb has various resource types, such as Open Geospatial Consortium (OGC) web services, geospatial data like GeoTiff, GeoJSON, and Shapefile, and catalogue services. There is no general solution to identify every type of resource. Based on these reasons, GeoWeb needs a specific search engineISPRS Int. with J. Geo an-Inf. extensible 2016, 5, 136 crawling mechanism to identify different types of GeoWeb resource. 3 of 21

Figure 1. The GeoWeb Long Tail.Tail.

In order to make a comprehensive discovery on the WWW, one of the most effective solutions In addition, following open standards is the key to easily identify GeoWeb resources. Since open is the web search engine, such as Google, Bing, and Yahoo!. These search engines apply the web standards regulate the client-server communication protocols and data models/formats, GeoWeb crawler approach to find every web pages by traversing through every hyperlink [15,16]. However, resources can be interoperable as long as client and server follow open standards, such as standards in search for geospatial resources is not easy since most of the search engines are designed to search OGC and International Organization for Standardization (ISO). Otherwise, in terms of the resources plain-text web pages. Users still need to identify geospatial resources from web pages by themselves. that are published following proprietary protocols. Although these proprietary resources could also be In addition, GeoWeb has various resource types, such as Open Geospatial Consortium (OGC) web identified, there is an additional cost to customize specific connectors. services, geospatial data like GeoTiff, GeoJSON, and Shapefile, and catalogue services. There is no 1.3.general Research solution Objective to identify every type of resource. Based on these reasons, GeoWeb needs a specific search engine with an extensible crawling mechanism to identify different types of GeoWeb resource. ForIn addition building, following a GeoWeb o searchpen standards engine, the is the first key step to iseasily to collect identify a comprehensive GeoWeb resources. index ofSince GeoWeb open resources.standards Hence, regulate the the main client focus-server of this communication study is to identify protocols and index and every data models/formats, geospatial resource GeoWeb on the WWW.resources To can be specific,be interoperable we propose as long a large-scale as client and GeoWeb server crawling follow open framework, standards, the such GeoWeb as standards Crawler, toin proactively OGC and Internationaldiscover GeoWeb Organization resources. for Although Standardization web crawlers (ISO). have Otherwise been wildly, in terms used in ofweb the searchresources engines that are to discover published web following pages, the proprietary GeoWeb Crawler protocols needs. Although to be extensible these proprietary to identify resources various typescould ofalso geospatial be identifie resources.d, there is Therefore, an additional one ofcost the to keys customize in this specific study is connectors. to define the identification rules for different GeoWeb resource types and apply these rules in a crawling framework. In this case,1.3. Research the GeoWeb Objective Crawler can be easily extended to find other resource types. In addition, as web crawling is a time-consuming process, in order to discover geospatial resources efficiently, we applied the distributedFor building computing a GeoWeb concept search [17 engine] in the, the GeoWeb first step Crawler is to to collect execute a comprehensive the crawling process index in of parallel.GeoWeb Overall,resources. this Hence, research the proposed main focus an extensible of this study and scalableis to identify GeoWeb and crawling index every framework geospatial that hasresource the potential on the WWW. to proactively To be specific, identify everywe propose geospatial a large resource-scale sharedGeoWeb on crawling the Web. framework, We believethat the theGeoWeb GeoWeb Crawler Crawler, to proactively could be thediscover foundation GeoWeb for resources a GeoWeb. Although search engine. web crawlers have been wildly usedThe in web remainder search ofengines this paper to discover is organized web pag as follows.es, the GeoWeb Section Crawler2 discusses needs existing to be extensible approaches to foridentify GeoWeb various resources types of discovery. geospatial Section resources.3 defines Therefore, the GeoWeb one of resourcesthe keys in distribution this study is and to define research the scope.identification In Section rules4, we for present different the GeoWeb details of resource the GeoWeb types Crawler, and apply including these extractingrules in a Uniform crawling Resourceframewor Locatorsk. In this (URLs)case, the from GeoWeb plain-text Crawler web can pages, be easily identification extended to of find different other typesresource of resource,types. In distributedaddition, as computing web crawling workflow, is a time and-consuming filtering of process, redundant in order URLs. to Section discover5 explains geospatial the crawlingresources resultsefficiently, and we evaluations applied the from distributed different perspectives.computing concept In Section [17]6 in, we the demonstrate GeoWeb Crawler the prototype to execute of the a GeoWebcrawling search process engine, in parallel. called Overall, GeoHub, this based research on the proposed resources an discovered extensible by and the GeoWebscalable GeoWebCrawler. Finally,crawling the framework conclusions that and has future the potential work of thisto proactively study are presented identify every in Section geospatial7. resource shared on the Web. We believe that the GeoWeb Crawler could be the foundation for a GeoWeb search engine. The remainder of this paper is organized as follows. Section 2 discusses existing approaches for GeoWeb resources discovery. Section 3 defines the GeoWeb resources distribution and research scope. In Section 4, we present the details of the GeoWeb Crawler, including extracting Uniform Resource Locators (URLs) from plain-text web pages, identification of different types of resource, distributed computing workflow, and filtering of redundant URLs. Section 5 explains the crawling results and evaluations from different perspectives. In Section 6, we demonstrate the prototype of a

ISPRS Int. J. Geo-Inf. 2016, 5, 136 4 of 21

2. Related Work

2.1. Existing Approaches for GeoWeb Resource Discovery As aforementioned, we categorize the existing approaches for finding GeoWeb resources into the data portal/service and catalog approaches. These two approaches can be further separated into open-standard solutions and non-open-standard solutions, as shown in Table1. Web services such as OGC web service and Open Source Geospatial Foundation (OSGeo) Tile Map Service are the examples for the data portal/service following open standards. The geospatial resources are published according to the specifications. On the other hand, most data portals such as the NASA’s data portal and National Climatic Data Center are not based on any open standard. The data models and communication protocols of these non-open-standard portals/services are different. For the catalogs, they are the entrances to find resources. OGC Catalogue Service for the Web (CSW) is an open standard for regulating the catalog service interface while SDIs such as Data.gov and GEOSS are developed independently. Here we discuss the problems of these four categories.

Table 1. Categories of existing GeoWeb resource discovery approaches.

Open-Standard Non-Open-Standard Data portals, e.g., National Aeronautics and Space Web services, e.g., OGC web Data Administration (NASA)’s data portal, National Climatic services and OSGeo Tile Map portal/service Data Center, and National Oceanic and Atmospheric Service (TMS) Administration (NOAA)’s tides and currents. OGC Catalogue Service for the SDIs, e.g., Data.gov and Global Earth Observation Catalog Web (CSW) System of Systems (GEOSS).

In terms of data portals and web services, one of the most critical problems is that users may not know every data portal and web service that contain the resources they need. For example, if a user wants to collect satellite images for land cover change research, to retrieve all the satellite images in the area of interest, the user needs to know every data portal and service that provide satellite images. Similarly, for the catalog approaches, users cannot make sure that they have looked for the data they want from every catalog service as no user knows every catalog service. Another critical problem of the catalog approach is that data providers usually have no incentive to spend time and effort to register their data or services to one or more catalog services. As a result, there is no single catalog service that can provide complete GeoWeb index. Besides the problems mentioned above, for the data portals/services and catalogs that do not follow open standards, there are also some heterogeneity problems. For example, the user interfaces and procedures between each portal/service or catalog are largely different, as shown in Figure2. Users need to learn the procedures to access data from different portals. Heterogeneities also exist in data models and communication protocols. Data provided by non-open-standard portals/services and catalogs use proprietary data models and protocols. Users need to spend more time and effort to customize connectors to retrieve and understand the data. In general, these problems of existing GeoWeb resource discovery approaches make geospatial data discovery inefficient and limit the growth of the GeoWeb. To address this issue, this study learns from the WWW resource discovery solutions as WWW also has wildly-distributed resources. To be specific, we aim at building a GeoWeb search engine to provide users a one-stop service to discover geospatial data while avoiding the requirement of asking data providers to register resources. However, to achieve this GeoWeb search engine vision, one of the first steps is to collect a complete index of GeoWeb. As general web search engines apply web crawlers to collect indexes of web pages, a GeoWeb search engine also requires a GeoWeb resource crawler. Hence, we review the existing GeoWeb resource crawler solutions in the next sub-section. ISPRS Int. J. Geo-Inf. 2016, 5, 136 5 of 21 ISPRS Int. J. Geo-Inf. 2016, 5, 136 5 of 21

(a) (b)

(c) (d)

FigureFigure 2. Examples 2. Examples of of data data portals/SDIs: portals/SDIs:( a(a)) GlobalGlobal EarthEarth Observation System System of of Systems Systems (GEOSS (GEOSS);); (b) Data.gov;(b) Data.gov; (c) National (c) National Aeronautics Aeronautics and Space and Space Administration Administration (NASA)’s (NASA data)’s portal; data portal; and (d )and National (d) National Climatic Data Center. Climatic Data Center.

2.2. Existing GeoWeb Resource Crawler Solutions 2.2. Existing GeoWeb Resource Crawler Solutions Early web crawlers can be traced back to 1993, including the World Wide Web Wanderer, Jump Station,Early webWorld crawlers Wide Web can Worm, be traced and Repository back to 1993,-Based including Software the Engineering World Wide (RBSE) Web spider Wanderer, that Jumpwere Station, proposed World to collect Wide information Web Worm, and and statistics Repository-Based about the Web Software [18]. Nowadays, Engineering one (RBSE) of the main spider thatapplications were proposed of web to crawlers collect informationis exploring web and pages statistics and aboutindexing the those Web pages [18]. automatically Nowadays, one for ofweb the mainsearch applications engines, such of web as Google crawlers and is Yahoo! exploring A web web crawler pages uses and a indexing list of URLs those as pagesseeds to automatically initiate the forcrawling web search process. engines, The crawler such as connects Google andto every Yahoo! seed A(i.e., web web crawler page), usesidentifies a list all of the URLs hyperlinks as seeds in to initiatethese the seeds, crawling and then process. links to The other crawler web pages connects via tothose every hyperlinks. seed (i.e., This web process page), iterates identifies until all the the hyperlinkscrawler incrawls these all seeds, linked and web then pages links on to the other WWW. web pagesWith this via thoseweb crawling hyperlinks. approach, This process web search iterates untilengines the crawler can provide crawls a complete all linked index web pagesof web onpages the to WWW. users. With this web crawling approach, web search enginesOur research can provide is based a completeon the same index web of crawling web pages concept, to users. and aims at developing a GeoWeb resourceOur research crawler is to based discover on geospatial the same webresources crawling on the concept, Web, such and as aimsthe OGC at developing web service a (OWSs), GeoWeb resourceKeyhole crawler Markup to discoverLanguage geospatial (KML) files, resources and Environmental on the Web, Systems such as theResearch OGC Institute, web service Inc (OWSs),(ESRI) KeyholeShapefiles. Markup According Language to [19] (KML), the files,first proposed and Environmental OWS crawler Systems was the Research Spatial Information Institute, Inc Search (ESRI) Shapefiles.Engine According (SISE) to developed [19], the first by proposed [20]. OWSThe crawler NSF was-funded the Spatial PolarHub Information project Search Engine(http://polar.geodacenter.org/polarhub2/) (SISE) developed by [20]. The NSF-funded also developed PolarHub an OWS project crawler (http://polar.geodacenter.org/ and focused on semantic analysis [21]. Bone et al. [22] built the Geospatial Search Engine (GSE) polarhub2/) also developed an OWS crawler and focused on semantic analysis [21]. Bone et al. [22] (http://www.wwetac.us/GSE/GSE.aspx) that provides the OGC Web Map Service (WMS), ArcIMS, built the Geospatial Search Engine (GSE) (http://www.wwetac.us/GSE/GSE.aspx) that provides the ArcGIS Server, and Shapefile. Bone et al. also discussed the functionalities of the search engine such OGC Web Map Service (WMS), ArcIMS, ArcGIS Server, and Shapefile. Bone et al. also discussed as searching and ranking. Other works include [23] that created a WMS crawler to enhance the US the functionalities of the search engine such as searching and ranking. Other works include [23] that Navy’s geospatial information database, [24] that designed a module to finding OGC Web Feature createdService a WMS (WFS), crawler Web Coverage to enhance Service the US (WCS) Navy’s and geospatial CSW, [25] information that focused database, on efficient [24] crawling that designed for a moduleWMS, to[26] finding that developed OGC Web an Feature OGC-focused Service crawler (WFS), Weband emphasized Coverage Service the statistics (WCS) of and OWS, CSW, [27] [25 tha] thatt focusedcrawled on efficientfor WFS and crawling focused for on WMS, retrieving [26] that semantic developed annotations an OGC-focused of data source crawler by ontology, and emphasized and [28] thethat statistics built a of specialized OWS, [27 search] that crawledengine to for discover WFS and ESRI focused Shapefiles. on retrieving semantic annotations of data sourceThese by existing ontology, works and developed [28] that built GeoWeb a specialized crawlers searchfor geospatial engine toresources discover discovery ESRI Shapefiles. and had differentThese existing focuses, works such as developed the crawling GeoWeb efficiency, crawlers developing for geospatial GeoWeb resources search discovery engine, semantic and had different focuses, such as the crawling efficiency, developing GeoWeb search engine, semantic analysis,

ISPRS Int. J. Geo-Inf. 2016, 5, 136 6 of 21

ISPRS Int. J. Geo-Inf. 2016, 5, 136 6 of 21 etc.analysis, While each etc. ofWhile these each existing of these GeoWeb existing resourceGeoWeb resource crawling crawling approaches approaches mainly mainly focused focused on a on subset of GeoWeba subset resources,of GeoWeb thereresources, is still there a large is still portion a large ofportion GeoWeb of GeoWeb resource resource has not has been not been crawled, crawled, such as KMLsuch or non-open-standard as KML or non-open resources.-standard Therefore, resources. Therefore, in this study, in this we try study, to develop we try toan develop extensible an and scalableextensible crawling and frameworkscalable crawling to find framework various types to find of GeoWebvarious types resources. of GeoWeb To further resources. explain To further and define the scopeexplain of and this define study, the next scope section of this introduces study, the the next GeoWeb section resource introduces distribution the GeoWeb on the resource WWW. distribution on the WWW. 3. GeoWeb Resource Distribution and Research Scope 3. GeoWeb Resource Distribution and Research Scope To have a more clear definition of our search targets, we classify the WWW into four parts: (1) geospatialTo have web a more service clear on definition the Surface of our Web; search (2) geospatialtargets, we dataclassify on the the WWW Surface into Web; four (3) parts: geospatial (1) webgeospatial service on web the service Deep on Web; the Surface and (4) Web geospatial; (2) geospatial data ondata the on Deepthe Surface Web. Web The; (3) concept geospatial is shown web in service on the Deep Web; and (4) geospatial data on the Deep Web. The concept is shown in Figure Figure3. Li [ 21] defines the “Surface Web” and the “Deep Web” on the WWW. The Surface Web means 3. Li [21] defines the “Surface Web” and the “Deep Web” on the WWW. The Surface Web means the the Web that users can access through hyperlinks and is visible to general web crawlers. However, Web that users can access through hyperlinks and is visible to general web crawlers. However, some someresources resources are arehidden hidden on the on WWW, the WWW, whose whose addresses addresses cannot cannot be accessed be accessed through throughthe Surface the Web Surface Webvia via hyperlinks. hyperlinks. These These resources resources are called are called in the in Deep the DeepWeb, such Web, as such datasets as datasets in web services in web servicesor data or dataportals/SDIs. portals/SDIs. In order In order to find to find resources resources in the in Deep the Deep Web, Web, we need we needto traverse to traverse through through the services the services on on thethe SurfaceSurface Web by by following following their their specific specific communication communication protocols. protocols.

Figure 3. The concept of the Surface Web and the Deep Web. Figure 3. The concept of the Surface Web and the Deep Web.

As aforementioned, to identify and access geospatial resources on the WWW, open standards areAs necessary aforementioned, since theyto regulate identify the and communications access geospatial between resources clients and on servers. the WWW, If providers open standards share are necessaryresources without since they following regulate any the open communications standard, there betweenwill be an clients interoperability and servers. problem If providers that users share resourcescannot withoutidentify followingor access these any resources open standard, in a general there way. will Therefore, be an interoperability with open standards, problem we that can users cannotdesign identify different or accessidentification these resourcesmechanisms in to a generalfind resources way. Therefore,following these with standards, open standards, and further we can designparse different and index identification the resources. mechanisms to find resources following these standards, and further parse andWhile index there the areresources. some open geospatial web service standards such as the OSGeo TMS, OGC WMS,While WFS, there and are WCS, some there open are standards geospatial for web data serviceformats standardssuch as the ESRI such Shapefile, as the OSGeo TIFF/GeoTIFF, TMS, OGC WMS,GeoJSON, WFS, and OGC WCS, Geography there are standards Markup Languagefor data formats (GML) such and as KML. the ESRI On Shapefile, the other TIFF/GeoTIFF, hand, as aforementioned, geospatial resources may be indexed in catalog services, such as OGC CSW that GeoJSON, OGC Geography Markup Language (GML) and KML. On the other hand, as aforementioned, follow the open standard or SDIs that use proprietary communication protocols. geospatial resources may be indexed in catalog services, such as OGC CSW that follow the open As web crawlers could discover geospatial resources with identification mechanisms designed standardaccordingly or SDIs to that standards, use proprietary for the resources communication that do protocols. not follow standards, they could also be discoveredAs web crawlers but with could extra discover cost to develop geospatial customized resources connectors. with identification While there mechanisms various types designed of accordinglygeospatial to resources, standards, a for GeoWeb the resources crawling that framewo do notrk follow should standards, be extensible they could to identify also be different discovered butwith extra cost to develop customized connectors. While there various types of geospatial resources, a GeoWeb crawling framework should be extensible to identify different resource types. To prove the ISPRS Int. J. Geo-Inf. 2016, 5, 136 7 of 21 concept, the searching targets in this study are shown in Table2. For the data portals and services following open standards, we focused on the OGC WMS, WFS, WCS, Web Map Tile Service (WMTS), Web Processing Service (WPS), and Sensor Observation Service (SOS). These OWS provide maps (e.g., satellite imagery), vector data (e.g., country boundaries), raster data (e.g., digital elevation model), map tiles (e.g., image tiles of an aerial photo), geospatial processing (e.g., buffering process), and sensor observations (e.g., time series of temperature data), respectively. Users need to use specific operations defined in each OWS to retrieve geospatial data from services. While in terms of non-open-standard resources, we select the National Space Organization (NSPO) Archive Image Query System as the target, which provides search functionalities for FORMOSAT-2 satellite images. For catalogs, the OGC CSW and the US data.gov are our open-standard and non-open-standard targets, respectively. In addition, this research also targets at geospatial data formats, including KML files and ESRI Shapefiles. KML is an OGC standard as well, which is a file format for storing geographic features such as points, polylines, and polygons and their presentation styles. KMZ is the compressed KML file that is also one of our search targets. ESRI Shapefile is a popular geospatial vector data format. Although this paper did not target every possible GeoWeb resource types, the proposed solution still provides a larger variety of resources than what any single existing solution provides.

Table 2. GeoWeb resource searching targets.

Open-standard Non-open-standard 1. OGC WMS 4. OGC SOS Data portal/service 2. OGC WFS 5. OGC WMTS NSPO Archive Image Query System 3. OGC WCS 6. OGC WPS Catalog OGC CSW Data.gov Files OGC KML ESRI Shapefile

4. Methodology In this section, we present the proposed web crawling framework, GeoWeb Crawler, for discovering geospatial resources on the Web. In general, we considered four challenges in the GeoWeb Crawler. Firstly, as geospatial resources may not be able to directly access with hyperlinks, resources are sometimes presented in the plain-text format on web pages. To discover these plain-text URLs, we use regular expression string matching to find potential URLs instead of only following the HTML links. Secondly, as geospatial resources are in various types, to identify targeted geospatial resources, specific identification mechanisms for each resource type should be designed. Thirdly, crawling resources on the Web is a time-consuming process. In order to improve the efficiency of web crawling, the proposed framework is designed to be scalable with the distributed computing concept. Finally, while the web crawling process traverses through every URL on web pages, identical URLs could be found on different web pages. In this research, we apply the Bloom filter [29] to filter out the redundant URLs. In the following sub-sections, we first introduce the overall workflow of the GeoWeb Crawler, and then present the solutions for the four challenges.

4.1. Workflow As mentioned earlier, to initiate a web crawler, crawling seeds are required. However, instead of selecting seeds manually and subjectively, we utilize the power of general web search engines. In this research, we use the Google search engine. As Google has already indexed the web pages in the Surface Web, we can directly retrieve relevant web pages by sending designed keywords to the web search engines. In addition, as the Google search engine ranks search results based on the relevance to the request keyword [30], Google could provide us a list of candidate URLs that have a high possibility of containing the target geospatial resources. Therefore, by utilizing the indexing and ranking mechanisms of Google search engine, we could find geospatial resources with fewer crawling requests. ISPRS Int. J. Geo-Inf. 2016, 5, 136 8 of 21

As our goal is to discover geospatial resources that follow OGC service standards or specific data formats, we can design different search keywords, such as “OGC SOS”, “web map service”, “WFS”, and “KML download”, to enrich our crawling seeds. In addition, by utilizing the power of ISPRS Int. J. Geo-Inf. 2016, 5, 136 8 of 21 existing web search engine, the GeoWeb Crawler can easily extend its search resource targets by simply defining theAs identification our goal is to mechanismsdiscover geospatial and searchresources keywords. that follow In OGC this service study, standards for each keywordor specific search, we useddata the formats, URLs inwe the can top design 10 Google different search search pageskeywords, as our such crawling as “OGC seeds. SOS”, “web map service”, Figure“WFS4” is, and the “ high-levelKML download architecture”, to enrich of our the crawling GeoWeb seeds. Crawler. In addition, After retrievingby utilizing the the crawlingpower of seeds, existing web search engine, the GeoWeb Crawler can easily extend its search resource targets by simply the GeoWeb Crawler first sends HTTP GET requests to download HTML documents from the seed defining the identification mechanisms and search keywords. In this study, for each keyword search, web pages.we used While the URLs many in the hyperlinks top 10 Google are search presented pages inas theour crawling HTML anchorseeds. elements , the crawler extracts theFigure URLs 4 fromis the thehigh attribute-level architecture “href” inof the the GeoWeb anchor Crawler elements.. After Then retrieving the crawlerthe crawling filters seeds, the URLs throughthe some GeoWeb simple Crawler constraints, first sends forHTTP instance, GET requests remove to thedownload URLs HTML that end documents with “.doc”, from the “.pdf”, seed “.ppt”, etc. However,web pages. URLs While may many be presentedhyperlinks are as plainpresented texts in on the web HTML pages. anchor In orderelements to , discover the crawler these URLs, we useextracts regular the expression URLs from string the attribute matching “href (discussed” in the anchor in Section elements. 4.2 Then). Finally, the crawler all the filters valid the URLs URLs will be through some simple constraints, for instance, remove the URLs that end with “.doc”, “.pdf”, “.ppt”, identified if they belong to targeted catalog services, which is the OGC CSW in this research. If true, etc. However, URLs may be presented as plain texts on web pages. In order to discover these URLs, the crawlerwe use will regular link expression to the catalog string servicesmatching to(disc findussed geospatial in Section resources4.2). Finally, and all the save valid the URLs resources will into the database.be identified Otherwise, if they belong the crawler to targeted will catalo examineg services, if URLs which belong is the OGC to targeted CSW in geospatialthis research. resources If accordingtrue, to the the crawler identification will link to mechanismsthe catalog services (discussed to find geospatial in Section resources 4.3). For and those save the URLs resou thatrces do not belonginto to any the geospatialdatabase. Otherwise, resource, the we crawler will link will to examinethe URLs if and URLs download belong to their targeted HTML geospatial documents. The processresources will according be iterated to the until identification all the web mechanisms pages within (discussed a crawling in Section level 4.3) are. For crawled. those URLs In that this study, we set thedo notcrawling belong level to any to geospatial5, which means resource, that we the will GeoWeb link to the Crawler URLs can and find download all the their targeted HTML resource documents. The process will be iterated until all the web pages within a crawling level are crawled. withinIn five this link-hops study, we fromset the the crawling seed pages.level to 5, By which “link-hop” means that we the mean GeoWeb linking Crawler from can onefind weball the page to anothertargeted web page. resource This wit crawlinghin five link level-hops is from set according the seed pages. to our By preferred“link-hop” testingwe mean crawling linking from time with our availableone web machines. page to another However, web page. the crawler This crawling could level find is more set according resources to with our preferred more link-hops, testing and the proposedcrawling solution time with certainly our available can bemachines. scaled outHowever, to crawl the acrawler larger could coverage find more with resou morerces machines with and Internetmore bandwidth. link-hops, and Since the we proposed only crawled solution certainly five levels can frombe scaled the out seeds, to crawl when a larger the crawlercoverage reacheswith the fifth level,more the machines crawler and will Internet stop thebandwidth. crawling Since process. we only While crawled the five five levels link-hops from the was seeds, set onlywhen becausethe of crawler reaches the fifth level, the crawler will stop the crawling process. While the five link-hops the hardwarewas set only limitation, because theof the flowchart hardware waslimitation, to delineate the flowchart the complete was to delineate crawling the complete process, crawling which should containprocess, a loop which to continuously should contain crawl a loop new to continuously web pages. crawl new web pages.

FigureFigure 4. 4.The The architecture architecture of of the the GeoWeb GeoWeb crawler crawler.. The above workflow is for crawling the resources following open-standards. In terms of the non- Theopen above-standard workflow resources, is for the crawling GeoWeb Crawler the resources needs to following customize open-standards. crawlers to a certain In degree, terms of the non-open-standardwhich largely depends resources, on ea thech GeoWebresource. For Crawler example, needs to crawl to customize the resources crawlers from the todata.gov, a certain we degree, whichdirectly largely set depends the crawling on each seeds resource. by keyword For-searching example, in the to data.gov. crawl the While resources for the NSPO from Archive the data.gov, we directly set the crawling seeds by keyword-searching in the data.gov. While for the NSPO Archive ISPRS Int. J. Geo-Inf. 2016, 5, 136 9 of 21

Image Query System (http://fsimage.narlabs.org.tw/AIQS/imageSearchE), since the system was designed to be used by humans, which is not crawler-friendly, we used the Application Programming Interface (API) of the data service behind the system to create a customized connector and retrieve FORMOSAT-2 satellite image metadata periodically.

4.2. Discovering Plain-Text URLs with Regular Expressions As we mentioned earlier, some URLs are not shown in the HTML hyperlink form, but are shown as plain texts, especially for the geospatial resources that can only be accessed via specific communication protocols, such as the OWS. Therefore, we need to parse the plain texts in web pages to retrieve potential URLs. In order to discover any potential URLs, we apply the regular expression [31]. Since we want to find plain-text URLs on web pages, we design the following regular expression pattern “z b(https?)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]”, which is able to discover URLs starts with “http” or “https” in an HTML document.

4.3. Identification Mechanisms for Different Geospatial Resources With standards, data or services can be interoperable with anyone following the same standard. While geospatial resources have a large variety in terms of data models, communication protocols, and file formats, we need to design different identification mechanisms according to different standards. According to the search targets specified in Table2, here we discuss the proposed identification mechanisms as follows. While OGC has defined many open standards for geospatial resources, we choose some of them as the search targets. We first discuss the OGC web services. Most of the OWSs support a mandatory operation called “GetCapabilities” for advertising the metadata and data content of these services. This design provides us an easy way to identify OWSs. By sending GetCapabilities request to every candidate URL, if the URL returns a valid response, the URL belongs to an OWS. To send a GetCapabilities request, the HTTP GET with the KVP encoding is used during the identification. There are some common parameters in the GetCapabilities operation that could be useful for identifying resources, including “service” that can be used to identify the type of OWS, “request” to specify the GetCapabilities operation. An example of a GetCapabilities request for a WMS is: http://hostname:port/path?service=WMS&request=GetCapabilities. To identify other types of OWS, we can simply replace the value of the “service” parameter. In general, for every candidate URL, GeoWeb Crawler appends the necessary query options to send the GetCapabilities request. If the URL is an OWS, the crawler will receive a valid Capabilities document from the service. Therefore, we need to design rules to further identify if a response is a valid Capabilities document for the targeted resource. While Capabilities documents are in the XML format, the document shall contain a specific element name (e.g., “Capabilities”) if the URL is an OWS. However, different versions of different types of OWSs may have different element names. In this case, we need to define rules to verify the Capabilities documents for different OWS and different versions:

‚ OGC WMS: According to the WMS specification [32,33], the Capabilities document shall contain a root element named “WMT_MS_Capabilities” or “WMS_Capabilities” for the version 1.0.0 and 1.3.0 of WMS, respectively. ‚ OGC SOS: According to the SOS specification [34,35], the Capabilities document shall contain a root element named “Capabilities” for the version 1.0.0 and 2.0 of SOS. ‚ OGC WFS: According to the WFS specification [36,37], the Capabilities document shall contain a root element named “WFS_Capabilities” for the version 1.0.0 and 2.0 of WFS. ‚ OGC WCS: According to the WCS specification [38–40], the Capabilities document shall contain a root element named “WCS_Capabilities” for the version 1.0.0 and “Capabilities” for the version 1.1 and 2.0 of WCS. ISPRS Int. J. Geo-Inf. 2016, 5, 136 10 of 21

‚ OGC WMTS: According to the WMTS specification [41], the Capabilities document shall contain a root element named “Capabilities” for the version 1.0.0 of WMTS. ‚ OGC WPS: According to the WPS specification [42], the Capabilities document shall contain a root element named “Capabilities” for the version 1.0.0 and 2.0 of WPS.

In addition, regarding geospatial data files such as the KML and ESRI Shapefile, since these files usually exist as download links on web pages, the identification rules are different from the rules for web services:

‚ OGC KML: Since KML is a data format, the file extensions would be “.kml” or “.kmz” (i.e., zipped .kml file). So if a URL string ends with “.kml” or “.kmz”, we regard it as a KML file. ‚ ESRI Shapefile: According to the ESRI Shapefile standard [43], a single Shapefile should at least contain a main file (i.e., including “.shp”), an index file (i.e., “.shx”) and a database file (i.e., “.dbf”). Therefore, Shapefiles are often compressed as ZIP files when sharing on web pages. The only way to identify if a ZIP file contains a Shapefile is by downloading and extracting the compressed file. Then if the extracted files/folders contain “.shp” files, the URL is regarded as a Shapefile resource. Due to our hardware limitation, we only download and examine the ZIP files that are smaller than 10 megabytes to prove the concept.

Furthermore, for catalog services, as the OGC CSW is also an OWS, CSW can be identified by sending GetCapabilities request.

‚ OGC CSW: According to the CSW specification [44], the XML document shall contain a root element named “Capabilities” for the version 1.0.0 and 2.0 of CSW.

As CSW serves as a catalog service hosting the indexes of geospatial resources, we can only retrieve the resources by following the communication protocols of CSW. In this case, the GeoWeb Crawler uses the “GetRecords” operation to retrieve the resources hidden in the catalog service. Besides the CSW, the proposed GeoWeb Crawler can also crawl the geospatial resources in non-open-standard catalogs. However, as most of the non-open-standard catalogs are not crawler-friendly, data.gov is the one that is crawlable. We assign keywords of different resources to search in the data.gov. Then we obtained a list of candidate web pages as the crawling seeds to initiate the crawling process. Finally, the resource crawling of the NSPO Archive Image Query System (http://fsimage.narlabs. org.tw/AIQS/imageSearchE) is different from other sources. As the system is not crawler-friendly, we construct a customized connector according to its backend data service interface to periodically retrieve metadata of new FORMOSAT-2 satellite images. In general, by linking to the Deep Web, we can discover the hidden resources to provide a more comprehensive geospatial resources index.

4.4. Distributed Crawling Process for Scalability To improve the performance of GeoWeb Crawler, we apply the MapReduce concept [17] to execute the crawling process in parallel. The MapReduce concept was proposed by Google to address scalability issues such as calculating indexes in the Google search engine. In general, MapReduce is a processing framework containing two phases, i.e., the map phase and the reduce phase. In the map phase, input data are split into several independent parts. All the parts can be processed in parallel and produce partial results. In reduce phase, all the partial results produced in the map phase are merged as the final output. MapReduce processes can be executed consecutively to iterate the process until all the tasks are finished. In general, we design the GeoWeb Crawler with the MapReduce concept. As shown in Figure5, one machine is assigned as the master and connects to a number of worker machines. Each worker sends a request to the master to retrieve candidate URLs as tasks. Workers then identify if those URLs belong to geospatial resources with the rules presented in Section 4.3. If a URL is not a geospatial resource, the worker will retrieve and parse the HTML file for new URLs. Identified geospatial resources and ISPRS Int. J. Geo-Inf. 2016, 5, 136 11 of 21 new URLsISPRS Int. are J. Geo then-Inf. 2016 returned, 5, 136 to the master. The master will store geospatial resources to a11database of 21 and assignand assign new new tasks tasks (i.e., (i.e., URLs) URLs) to toworker workers.s. AsAs thethe crawling processes processes are are independent, independent, each each workerworker can performcan perform in parallel. in parallel. With With the the designed designed processing processing workflow,workflow, the the GeoWeb GeoWeb Crawler Crawler can be can scaled be scaled horizontallyhorizontall toy enhance to enhance the the performance. performance.

Figure 5. Architecture of distributed computing implementation. Figure 5. Architecture of distributed computing implementation.

4.5. Filtering Redundant URLs 4.5. Filtering Redundant URLs During the crawling process, a lot of redundant URLs would be found. For example, by using During“OGC SOS the” as crawling the search process, keyword, a lot the of Ge redundantoWeb Crawler URLs totally would crawled be found. 9,911,218 For URLs example, within byfvie using “OGClink SOS”-hops as from the the search seed pages. keyword, However, the GeoWeb7,813,921 of Crawler them are totallyredundant crawled URLs. 9,911,218In other words, URLs only within fvie link-hops2,097,297 URLs from are the unique. seed pages. This huge However, number 7,813,921 of redundancy of them would are redundant seriously affect URLs. the In crawling other words, only 2,097,297performance URLs as the are same unique. URLs This could huge be numbercrawled ofrepeatedly. redundancy However, would recording seriously all affect the URLs the crawling to performancedetermine as whether the same they URLs had been could crawled be crawled or not would repeatedly. be inefficient. However, recording all the URLs to determineTherefore, whether theyin order had to been deal crawled with this or notproblem, would we be apply inefficient. the Bloom filter [29] to identify redundant URLs. The concept of Bloom filter is to transfer string into a binary array, which can keep Therefore, in order to deal with this problem, we apply the Bloom filter [29] to identify redundant storage small. As shown in Figure 6, first create an m (size of BitSet) BitSet, and set all bits to 0. Then URLs.choose The k concept different of hash Bloom functions filter for is tomanaging transfer n string(number into of URL a binary strings) array, strings. which A hash can function keep storage is small.any As function shown that in Figure can be6 used, first to create map digital an m data(size of of arbitrary BitSet) BitSet,size to digital and set data all of bits fixed to size, 0. Then which choose k differentis called hash hash functions values. In forthis managing case, we wouldn (number derive k of integers URL strings) from k hash strings. functio Ans hash ranging function from is0 any functionto m that− 1. can be used to map digital data of arbitrary size to digital data of fixed size, which is called hashInvalues. this study, In this we case,transform we would all characters derive k inintegers a URL fromstringk intohash ASCII functions codes ranging (integer), from then 0 toourm ´ 1. Inhash this function study, we calculates transform hash all characters values based in a on URL these string ASCII into ASCIIcodes. codesHence, (integer), each URL then string our hash functioncorresponds calculates to hasha set of values bits determined based on theseby has ASCIIh values. codes. Then Hence, set these each bits URL from string0 to 1, correspondsas shown in to a set ofFigure bits determined 6. According by to hash the BitSet, values. we Then can setdetermine these bits whether from 0two to 1,URLs as shown are the in same Figure or not.6. According If there to is any bit corresponding to a URL is not set to 1, we can conclude that the URL has not been crawled. the BitSet, we can determine whether two URLs are the same or not. If there is any bit corresponding Otherwise, the URL would be regarded as redundancy. to a URLHowever, is not set false to 1, positive we can matches conclude are possible, that the but URL false has negatives not been are crawled. not. Since Otherwise,the size of BitSet the URL wouldis belimited, regarded the more as redundancy. elements that are added to the set, the larger the probability of false positives. However,There is a tradeoff false positive between matches storage and are matching possible, accuracy. but false Thus, negatives we need are to not.set an Since appropriate the size value of BitSet is limited,of m and the k according more elements to n. According that are to added [45], assume to the that set, kn the < m larger, when the all probabilitythe n elements of are false mapped positives. Thereto is m a BitSet tradeoff by k between hash functions, storage the and probability matching of accuracy.one bit is still Thus, 0 (i.e., we p’), need and to the set probability an appropriate of false value of m andpositivek according (i.e., f’) is toshownn. According in Equations to [(1)45 ],and assume (2). that kn < m, when all the n elements are mapped

푘푛 푘푛 to m BitSet by k hash functions, the probability of one1 bit is− still 0 (i.e., p’), and the probability of false 푝′ = (1 − ) ≈ 푒 푚 (1) positive (i.e., f’) is shown in Equations (1) and (2).푚

푘푛 푘 ′ 1 kn ′ 푘 푓 = (1 1 − (1 − 1) ) ≈ (´1kn− p ) (2) p “ 1 ´ 푚 « e m (1) m ˆ ˙

kn k 1 k f 1 “ 1 ´ 1 ´ « 1 ´ p1 (2) m ˜ ˆ ˙ ¸ ` ˘ ISPRS Int. J. Geo-Inf. 2016, 5, 136 12 of 21

For example, if we set 10 hash functions, and the BitSet size is 20 times larger than the number of URLs, the probability of false positives would be 0.00889%. In other words, about nine errors would happen when handling 100,000 URLs. Even so, such matching quality is still acceptable in this study. In general, with the Bloom filter, we can remove redundant URLs with smaller storage ISPRS Int. J. Geo-Inf. 2016, 5, 136 12 of 21 footprint.

Figure 6.6. The concept of Bloom filter.filter.

5. ResultsFor example, and Discussion if we set 10 hash functions, and the BitSet size is 20 times larger than the number of URLs,In the this probability section, we of present false positives the crawling would result be 0.00889%. and evaluations In other in words, different about perspectives. nine errors Firstly, would happenSection when5.1 shows handling the latest 100,000 number URLs. Evenof resources so, such discovered matching quality by GeoWeb is still acceptable Crawler. Secondly, in this study. we Incompare general, the with numbers the Bloom of filter, geospatial we can resources remove redundant between URLs GeoWeb with smaller Crawler storage and other footprint. existing approaches. Thirdly, to better understand the crawling efficiency, we analyze the distribution of 5. Results and Discussion discovered resources based on different crawling levels (i.e., link-hops). Finally, since we apply the distributedIn this computing section, we concept present to execute the crawling the crawling result process and evaluations in parallel, in we different evaluate perspectives.the crawling Firstly,performance Section by 5.1 using shows different the latest numbers number of of machines. resources discovered by GeoWeb Crawler. Secondly, we compare the numbers of geospatial resources between GeoWeb Crawler and other existing approaches. Thirdly,5.1. Geospatial to better Resources understand Crawling the Results crawling efficiency, we analyze the distribution of discovered resources based on different crawling levels (i.e., link-hops). Finally, since we apply the distributed In this paper, we design and implement a GeoWeb crawling framework, GeoWeb Crawler, to computing concept to execute the crawling process in parallel, we evaluate the crawling performance collect geospatial resources. While the GeoWeb Crawler is designed to be extensible, this paper by using different numbers of machines. targeted at various resources, and the crawling result is shown in Table 3. Totally, we discovered 5.1.7351 Geospatial geospatial Resources services Crawling and 194,003 Results datasets within five hyperlink-hops from the seeds. Among these datasets, 1088 service and 77,874 datasets are derived from data.gov and NSPO service, which representIn this the paper, non-open we design-standard and data implement services a and GeoWeb catalo crawlinggs in this framework,study. GeoWeb Crawler, to collectAs geospatial web services resources. may have While a large the amount GeoWeb of Crawlergeospatial is data, designed we define to be the extensible, idea of the this dataset paper targetedfor each service at various type resources, and show andthe number the crawling of datasets result in is Table shown 3. inTo Table be specific,3. Totally, for OWSs, we discovered we parse 7351the Capabilities geospatial document services and and 194,003 count the datasets number within of datasets. five hyperlink-hops from the seeds. Among theseBasically, datasets, we 1088 define service that and a map 77,874 image datasets is represented are derived as from the “ data.govLayer” in and WMS, NSPO a vector service, dataset which is representthe “FeatureType the non-open-standard” in WFS, a raster data dataset services is the and “ catalogsCoverageSummary in this study.” or “CoverageOfferingBrief” in WCS,As web a map services tile image may is have the “ aTileMatrix large amount” in ofWMTS, geospatial geospatial data, weprocessing define the is the idea “Process of the dataset” in WPS, for eachand sensor service observation type and show dataset the is number the “ObservationOffering of datasets in Table3”. in To SOS. be specific, Although for these OWSs, definitions we parse may the Capabilitiesnot be most appropriate document and (e.g., count “ObservationOffering the number of datasets.” as the number of datasets which actually consists of a Basically, group of we observations), define that a extracting map image the is datasets represented information as the “Layer” from thein WMS, Capabilities a vector documents dataset is theprevents “FeatureType” us from retrieving in WFS, a rasteractual datasetdata from is the services. “CoverageSummary” Since WMS is a or well “CoverageOfferingBrief”-known service in OGC, in WCS,both the a map number tile image of discovered is the “TileMatrix” WMS and the in WMTS, number geospatial of datasets processing are the highest is the “Process”among the in OWSs. WPS, andRegarding sensor the observation average crawling dataset is time, the “ObservationOffering” the time mainly depends in SOS.on the Although number theseof crawled definitions URLs mayand not be most appropriate (e.g., “ObservationOffering” as the number of datasets which actually consists of a group of observations), extracting the datasets information from the Capabilities documents prevents us from retrieving actual data from services. Since WMS is a well-known service in OGC, both the number of discovered WMS and the number of datasets are the highest among the OWSs. Regarding the average crawling time, the time mainly depends on the number of crawled URLs and ISPRS Int. J. Geo-Inf. 2016, 5, 136 13 of 21 the Internet connection between client and server. As the identifications for files are relatively simple, the crawling time for KML and Shapefile are much lower than that for GeoWeb services.

Table 3. GeoWeb crawling result.

Number of Number of Avg Crawling Time ISPRS Int. J. Geo-Inf. 2016, 5, 136 Definition of Dataset 13 of 21 Services Datasets (hr/Keyword) the InternetWMS connection4315 between client 45,922 and server. As the identifications Layer for files are relatively 39.1 simple, the crawlingSOS time for 1227KML and Shapefile 14,875 are much Observationlower than Offeringthat for GeoWeb services. 32.3 WFS 663 18,930 Feature Type 15.5 Coverage Summary/Coverage WCS 639 12,379 32.5 Table 3. GeoWeb crawlingOffering result. Brief WMTS 365 10,060 Tile Matrix 21.4 Number of Number of Avg Crawling Time CSW 117 1121Definition Record of Dataset 39.2 Services Datasets (hr/Keyword) WPS 25 2771 Process 14.3 KMLWMS -4,315 400245,922 kml/kmzLayer file 5.739.1 ShapefileSOS -1,227 787814,875 Observation shp file Offering 2.732.3 SatelliteWFS Image -663 76,06518,930 ImageFeature index Type -15.5 Coverage Summary/Coverage WCS 639 12,379 32.5 5.2. Comparison between GeoWeb Crawler and Existing ApproachesOffering Brief WMTS 365 10,060 Tile Matrix 21.4 ThereCSW are some existing117 catalog services1121 that provide geospatialRecord resources. As mentioned39.2 in the Introduction,WPS data providers25 need to register2771 their resources toProcess catalog service to share the14.3 resources. However,KML there is no incentive- for providers4002 to register theirkml/ resourceskmz file into every catalog. This5.7 problem causesShapefile that different catalogs- provide7878 different resources,shp and file none of them is comprehensive.2.7 Satellite Image - 76,065 Image index - To address this issue, we use the concept of the web crawler to discover various geospatial resources. Moreover, data providers do not need to register resources as long as they publish resources following 5.2. Comparison between GeoWeb Crawler and Existing Approaches standards and share it on the Surface Web. InThere general, are some Figure existing7 shows catalo theg datasets services comparisonthat provide betweengeospatial GeoWeb resources. Crawler As mentioned and existing in the solutions,Introduction, where data the providers 76,065 satellite need to images register found their resources in the NSPO to catalo serviceg service is not includedto share the in theresources figure. asHowever, none of the there compared is no incentive approaches for provide providers this to resource. register As their expected, resources the into number every of datasets catalog. and This theproblem variety causes of resources that that different GeoWeb catalo Crawlergs provide provides different is much higher resources, than that and of none other approaches.of them is Especially,comprehensive. one thing To address to note isthis that issue, the datasets we use ofthe other concept compared of the solutions web crawler could to be discover over-estimated. various Forgeospatial example, resources. when we Moreover, search for “WMS”data providers in GEOSS do using not need the keyword to register “WMS”, resources we directlyas long use as they the numberpublish ofresources total results following while standards not every candidateand share isit on WMS, the asSurface shown Web. in Figure 8.

Figure 7. Number of datasets comparison between GeoWeb Crawler and existing approaches. Figure 7. Number of datasets comparison between GeoWeb Crawler and existing approaches.

In general, Figure 7 shows the datasets comparison between GeoWeb Crawler and existing solutions, where the 76,065 satellite images found in the NSPO service is not included in the figure as none of the compared approaches provide this resource. As expected, the number of datasets and the variety of resources that GeoWeb Crawler provides is much higher than that of other approaches. Especially, one thing to note is that the datasets of other compared solutions could be over-estimated. For example, when we search for “WMS” in GEOSS using the keyword “WMS”, we directly use the number of total results while not every candidate is WMS, as shown in Figure 8.

ISPRSISPRS Int. Int. J. J. Geo Geo-Inf.-Inf. 2016 2016, ,5 5, ,1 13636 1414 of of 21 21 ISPRS Int. J. Geo-Inf. 2016, 5, 136 14 of 21

Figure 8. An example of searching WMS in GEOSS. FigureFigure 8.8. AnAn exampleexample ofof searchingsearching WMSWMS inin GEOSS.GEOSS.

InIn addition, addition, there there are are existing existing approaches approaches crawling crawling OWS, OWS, such such as as PolarHub PolarHub [21] [21] in in 2014, 2014, Sample Sample etet al.al.In [23][23] addition, in in 2006,2006, there LiLi and and are YangYang existing [25][25] in in approaches 2010,2010, andand Lopez crawlingLopez-Pellicer-Pellicer OWS, etet such al.al. [26][26] as in PolarHubin 2010.2010. InIn Figure [Figure21] in 9, 2014,9, we we compareSamplecompare et thethe al. numbers [numbers23] in 2006, ofof OWS LiOWS and betweenbetween Yang [25 these]these in 2010, worksworks and andand Lopez-Pellicer thethe GeoWebGeoWeb etCrawlerCrawler al. [26. ]. We inWe 2010. cancan see Insee Figure thatthat thethe9, GeoWebweGeoWeb compare CrawlerCrawler the cannumberscan cocollectllect of more OWSmore types betweentypes ofof OWS theseOWS resources worksresources and thanthan the GeoWebotherother workwork Crawler. did.did. WW Weithith can thethe see largelarge that-scale-scale the crawlingGeoWebcrawling Crawler framework,framework, can GeoWeb collectGeoWeb more CrawlerCrawler types cancan of discoverdiscover OWS resources significantlysignificantly than othermoremore work OWSsOWSs did. thanthan With otherother the approachesapproaches large-scale. . Overall,crawlingOverall, the the framework, GeoWeb GeoWeb Crawler Crawler GeoWeb collects collects Crawler 3.8 3.8 canto to 47.5 47.5 discover times timessignificantly more more resources resources more (including (including OWSs than the the other satellite satellite approaches. images) images) thanOverall,than any any theexisting existing GeoWeb geospatial geospatial Crawler resource resource collects discovery 3.8discovery to 47.5 solution. timessolution. more resources (including the satellite images) than any existing geospatial resource discovery solution.

Figure 9. Number of OWSs comparison between GeoWeb Crawler and existing approaches. FigureFigure 9. 9. Number Number of of OWSs OWSs comparison comparison between between GeoWeb GeoWeb Crawler Crawler and and existing existing approaches approaches. . 5.3. Crawling Level Distribution of Discovered Resources 5.3.5.3. Crawling Crawling Level Level Distribution Distribution of of Discovered Discovered Resources Resources Since the proposed web crawler cannot crawl the whole WWW because of the limited hardware resource,SinceSince wethe the designproposed proposed GeoWeb web web crawler crawler Crawler cannot cannot to be crawl basedcrawl the the on whole thewhole Google WWW WWW search because because engine. of of the the With limited limited the hardware hardware indexing resource,andresource, ranking we we design mechanismsdesign GeoWeb GeoWeb of Crawler Crawler Google, to to Googlebe be based based search on on the the engineGoogle Google cansearch search provide engine. engine. us With With candidates the the indexing indexing that have and and rankingaranking high possibility mechanisms mechanisms of containingof of Google, Google, theGoogleGoogle geospatial search search resourcesengineengine can can with provide provide keyword us us caca searches.ndidatesndidates that Inthat this have have case, a a highhigh the possibilityGeoWebpossibility Crawler ofof containingcontainingcould find thethe geospatial geospatialgeospatial resources resourcesresources within withwith fewer keywordkeyword link-hops. searches.searches. Currently, InIn thisthis we case,case, set the thethe crawling GeoWebGeoWeb CrawlerlevelCrawler to 5;could could that find is,find the geospatial geospatial crawling scoperesources resources is all within within the URLs fewer fewer within link link- fviehops.-hops. link-hops Currently Currently from, ,we we the set set Google the the crawling crawling search resultlevel level topages.to 5 5; ;that that In thisis, is, the the study, crawling crawling we adopted sc scopeope is breath-firstis all all the the URLs URLs visiting within within strategy fvie fvie link tolink crawl-hops-hops every from from completethe the Google Google level search search one byresult result one.

ISPRS Int. J. Geo-Inf. 2016, 5, 136 15 of 21

pages. In this study, we adopted breath-first visiting strategy to crawl every complete level one by one. To understand the distribution of geospatial resources, we display the numbers of discovered resources (services and files) in each crawling level in Figure 10. The figure indicates that more resources can be found with larger crawling level for most of the resources. However, in Figure 11, we show the discovery ratio of each resource in each crawling level, which is the number of discovered resources divided by the total number of URLs in each crawling level. While the discovery ratio represents the necessary effort of crawling, we can see that the trends of discovery ratio are decreasing along the crawling levels because of the rapid growth of URL numbers. ISPRSIn Int. general, J. Geo-Inf. 2016these, 5 ,two 136 figures first indicate that Google search engine could provide fairly15 good of 21 seeds so that we can discover resources with small crawling levels. With the more levels we crawl, the more resources we could discover. However, the discovery ratios also significantly decrease with largerTo crawling understand level, the which distribution indicates of more geospatial unnecessary resources, effort we spent. display Furthermore, the numbers in of terms discovered of the resourcesstrange trend (services behavior and of files) crawling in each Shap crawlingefiles, we level believe in Figure the reason 10. Theis that figure we only indicates examine that the more ZIP resourcesfiles that are can smaller be found than with 10 largermegabytes crawling to prove level forthe mostconcept, of the which resources. unavoidably However, affects in Figure the trend 11, webehavior. show the For discovery the peak ratio of KML of each in level resource 2, there in each is a crawling website level,that provides which is thea tremendous number of discoverednumber of resourcesKMZ files. divided Basically, by due the totalto the number rapid growth of URLs of in URL each numbers crawling along level. with While the thecrawling discovery level, ratio the representsnumber of the discovered necessary resources effort of crawling,increases wecorrespondingly. can see that the trends of discovery ratio are decreasing along the crawling levels because of the rapid growth of URL numbers.

ISPRS Int. J. Geo-Inf. 2016, 5, 136 16 of 21 Figure 10. The number of discovered resources in different crawling levels. Figure 10. The number of discovered resources in different crawling levels.

5.4. Crawling Latency Comparison between Standalone and Parallel Processing In this study, we apply the distributed computing concept to improve the web crawler performance and make the crawling framework scalable. By scalable, we mean that the crawling performance can be improved by simply increasing the number of computers. While the main bottleneck is not on the computation resource but the Internet connection, we argue that the machine specification and machine heterogeneity do not have large influence on the crawling performance. To evaluate the performance of the parallel processing, Figure 12 shows the crawling latencies of using a single machine (i.e., standalone), four machines, and eight machines as workers in the proposed crawling framework. We chose to crawl KML files in this evaluation because the identification procedure is the easiest, which can shorten the latency. The crawling of other resources would take much more time (e.g., more than one week) for the standalone setting. As shown in Figure 12, even for crawling KML file, the crawling process still requires about one and half days for the standalone setting. However, with the increasing number of workers, the crawling latency can be significantly reduced. This evaluationFigure indicates 11. The resources that the discovery crawling ratio performance in different is crawling scalable levels. and could be applied to crawl a larger coverageFigure of 11. the The Web resources to discover discovery a comprehensive ratio in different GeoWeb crawling resourcelevels. index.

In general, these two figures first indicate that Google search engine could provide fairly good seeds so that we can discover resources with small crawling levels. With the more levels we crawl, the more resources we could discover. However, the discovery ratios also significantly decrease with larger crawling level, which indicates more unnecessary effort spent. Furthermore, in terms of the strange trend behavior of crawling Shapefiles, we believe the reason is that we only examine the ZIP files that are smaller than 10 megabytes to prove the concept, which unavoidably affects the trend behavior. For the peak of KML in level 2, there is a website that provides a tremendous number of KMZ files. Basically, due to the rapid growth of URL numbers along with the crawling level, the number of discovered resources increases correspondingly.

Figure 12. Performance comparison between standalone and parallel processing.

6. GeoHub—A GeoWeb Search Engine Prototype As the WWW has Web search engines addressing web resource discovery issues, we envision that the GeoWeb also requires GeoWeb search engines for users to efficiently find GeoWeb resources. In this section, we demonstrate the prototype of the proposed GeoWeb search engine, GeoHub, which currently allows users to search for OWSs. With the GeoHub user interface, as shown in Figure 13, users can use the keyword search to find the OWSs of their interests. In order to achieve keyword search, the GeoWeb search engine needs a database to manage the metadata that are used as the information for the query operation.

ISPRS Int. J. Geo-Inf. 2016, 5, 136 16 of 21

ISPRS Int. J. Geo-Inf. 2016, 5, 136 16 of 21

5.4. Crawling Latency Comparison between Standalone and Parallel Processing In this study, we apply the distributed computing concept to improve the web crawler performance and make the crawling framework scalable. By scalable, we mean that the crawling performance can be improved by simply increasing the number of computers. While the main bottleneck is not on the computation resource but the Internet connection, we argue that the machine specification and machine heterogeneity do not have large influence on the crawling performance. To evaluate the performance of the parallel processing, Figure 12 shows the crawling latencies of using a single machine (i.e., standalone), four machines, and eight machines as workers in the proposed crawling framework. We chose to crawl KML files in this evaluation because the identification procedure is the easiest, which can shorten the latency. The crawling of other resources would take much more time (e.g., more than one week) for the standalone setting. As shown in Figure 12, even for crawling KML file, the crawling process still requires about one and half days for the standalone setting. However, with the increasing number of workers, the crawling latency can be significantly reduced. This evaluation indicates that the crawling performance is scalable and could be applied to Figure 11. The resources discovery ratio in different crawling levels. crawl a larger coverage of the Web to discover a comprehensive GeoWeb resource index.

Figure 12. Performance comparison between standalone and parallel processing. Figure 12. Performance comparison between standalone and parallel processing. 6. GeoHub—A GeoWeb Search Engine Prototype 6. GeoHubAs the— WWWA GeoWeb has Web Search search Engine engines Prototype addressing web resource discovery issues, we envision that theAs GeoWebthe WWW also has requires Web search GeoWeb engines search addressing engines for web users resource to efficiently discovery find issues, GeoWeb we resources. envision Inthat this the section, GeoWeb we also demonstrate requires GeoWeb the prototype search of engines the proposed for users GeoWeb to efficiently search find engine, GeoWebGeoHub resources, which. currentlyIn this section, allows we users demonstrate to search the for prototype OWSs. With of the the proposed GeoHub GeoWeb user interface, search asengine, shown GeoHub in Figure, which 13, userscurrently can useallows the users keyword to search search for to OWSs. find the With OWSs the ofGeoHub their interests. user interface In order, as toshown achieve in Figure keyword 13, search,users can the use GeoWeb the keyword search search engine to needs find athe database OWSs of to their manage interests. the metadata In order thatto achieve are used keyword as the informationsearch, the GeoWeb for the query search operation. engine needs a database to manage the metadata that are used as the informationAfter identifying for the query GeoWeb operation. resources, we store the necessary information into a database. While a search engine could use the various information to rank search results, this prototype simply matches user queries with the following information: the resource URL, resource type, keyword used in Google search engine, content (e.g., Capabilities document), and the source web page. Capabilities documents provide metadata about the OWS. An example of WMS capabilities document is shown in Figure 14. Currently, we also extract specific information from the Capabilities documents, such as service name, title, abstract, keywords, content metadata, etc., and then store them into a database. Hence, whenever users’ search keywords exist in the extracted information, the corresponding services will be returned to users (Figure 15).

ISPRS Int. J. Geo-Inf. 2016, 5, 136 17 of 21 ISPRS Int. J. Geo-Inf. 2016, 5, 136 17 of 21 ISPRS Int. J. Geo-Inf. 2016, 5, 136 17 of 21

Figure 13. GeoHub client interface. Figure 13.13. GeoHub client interface.interface.

Figure 14. An example of a WMS capabilities document. Figure 14.14. An example of a WMS capabilities document.document. After identifying GeoWeb resources, we store the necessary information into a database. While a searchAfter engine identifying could GeoWeb use the resources,various information we store the to necessary rank search information results, thisinto prototypea database. simply While matchesa search user engine queries could with use the the following various information:information the to resource rank search URL, results, resource this type, prototype keyword simply used inmatches Google user search queries engine, with content the following (e.g., Capabilities information: document), the resource and URL, the source resource web type, page. keyword Capabilities used documentsin Google search provide engine, metadata content about (e.g., the Capabilities OWS. An exampledocument), of WMSand the capabilities source web document page. Capabilities is shown indocuments Figure 14. provide Currently, metadata we also about extract the specificOWS. An information example of from WMS the capabilities Capabilities document documents, is shown such asin serviceFigure 14.name, Currently, title, abstract, we also keywords, extract specific content information metadata, etc. from, and the then Capabilities store them documents, into a database. such as service name, title, abstract, keywords, content metadata, etc., and then store them into a database.

ISPRS Int. J. Geo-Inf. 2016, 5, 136 18 of 21

ISPRSHence, Int. whenever J. Geo-Inf. 2016 users’, 5, 136 search keywords exist in the extracted information, the corresponding18 of 21 services will be returned to users (Figure 15).

Figure 15.15. An example of a GeoHub search result.result.

7. Conclusions and Future Work 7. Conclusions and Future Work By using the Web as a platform, the GeoWeb is continuously collecting a tremendous amount of By using the Web as a platform, the GeoWeb is continuously collecting a tremendous amount of geospatial data. This big geospatial data phenomenon causes difficulties when searching for specific geospatial data. This big geospatial data phenomenon causes difficulties when searching for specific data from such big volume of data with frequent updates and large variety. In order to address this data from such big volume of data with frequent updates and large variety. In order to address this geospatial data discovery issue, our ultimate goal is to build a GeoWeb search engine similar to the geospatial data discovery issue, our ultimate goal is to build a GeoWeb search engine similar to the existing web search engines. For realizing such GeoWeb search engine, the first step is to discover existing web search engines. For realizing such GeoWeb search engine, the first step is to discover and index every geospatial data resources on the Web. Thus, in this paper, we design and develop a and index every geospatial data resources on the Web. Thus, in this paper, we design and develop GeoWeb crawling framework called GeoWeb Crawler. According to different geospatial resource a GeoWeb crawling framework called GeoWeb Crawler. According to different geospatial resource standards, the GeoWeb Crawler can apply different identification mechanisms to discover resources standards, the GeoWeb Crawler can apply different identification mechanisms to discover resources on the Surface Web. For resources on the Deep Web such as OGC CSW and portals/SDIs, the GeoWeb on the Surface Web. For resources on the Deep Web such as OGC CSW and portals/SDIs, the GeoWeb Crawler can support connectors to retrieve the resources. In addition, to improve the performance of Crawler can support connectors to retrieve the resources. In addition, to improve the performance of web crawlers, we apply the Bloom filter to remove redundant URLs and the distributed computing web crawlers, we apply the Bloom filter to remove redundant URLs and the distributed computing concept to execute the crawling process in parallel. concept to execute the crawling process in parallel. According to the experiment result, our proposed approach is able to discover various types of According to the experiment result, our proposed approach is able to discover various types online geospatial resources. Comparing with existing solutions, GeoWeb Crawler can provide of online geospatial resources. Comparing with existing solutions, GeoWeb Crawler can provide comprehensive resources in terms of both variety and amount. Furthermore, instead of requiring comprehensive resources in terms of both variety and amount. Furthermore, instead of requiring data providers to register their data, GeoWeb Crawler can discover resources proactively as long as data providers to register their data, GeoWeb Crawler can discover resources proactively as long as providers follow standards and share resource links on the Web. providers follow standards and share resource links on the Web. To sum up, as WWW search engines significantly improve the efficiency of web resource To sum up, as WWW search engines significantly improve the efficiency of web resource discovery, discovery, the proposed crawling framework has the potential to become the base of a GeoWeb the proposed crawling framework has the potential to become the base of a GeoWeb search engine. search engine. GeoWeb search engines could be the next generation of geospatial data discovery GeoWeb search engines could be the next generation of geospatial data discovery framework that can framework that can promote the visibility and usage of geospatial resources to various fields and promote the visibility and usage of geospatial resources to various fields and achieve comprehensive achieve comprehensive data science analysis. data science analysis.

ISPRS Int. J. Geo-Inf. 2016, 5, 136 19 of 21

In terms of other future directions, to provide a complete GeoWeb index, we will extend the crawlers to identify other geospatial resource types such as GeoTiff and GeoJSON. We will also seek the opportunity to expand our infrastructure to crawl into the deeper and wider Web.

Acknowledgments: We sincerely thank Ministry of Science and Technology in Taiwan for supporting and funding us to conduct this research. Author Contributions: Chih-Yuan Huang conceived the research direction; Chih-Yuan Huang and Hao Chang designed the system; Hao Chang develped and evaluated the system; Chih-Yuan Huang and Hao Chang analyzed the result and wrote the paper. Conflicts of Interest: The authors declare no conflict of interest.

References

1. Densham, P.J. Spatial decision support systems. Geogr. Inf. Syst. Princ. Appl. 1991, 1, 403–412. 2. Crossland, M.D.; Wynne, B.E.; Perkins, W.C. Spatial decision support systems: An overview of technology and a test of efficacy. Decis. Support Syst. 1995, 14, 219–235. 3. Burrough, P.A. Principles of geographical information systems for land resources assessment. Geocar Int. 1986, 1.[CrossRef] 4. Rao, M.; Fan, G.; Thomas, J.; Cherian, G.; Chudiwale, V.; Awawdeh, M. A web-based gis decision support system for managing and planning usda’s conservation reserve program (crp). Environ. Model. Softw. 2007, 22, 1270–1280. [CrossRef] 5. Haklay, M.; Singleton, A.; Parker, C. Web mapping 2.0: The of the geoweb. Geogr. Compass 2008, 2, 2011–2039. [CrossRef] 6. Lake, R.; Farley, J. Infrastructure for the geospatial web. In The Geospatial Web; Springer: London, UK, 2009; pp. 15–26. 7. Taylor, B. The World is Your Javascript-Enabled Oyster. Official Google Blog. Available online: https: //googleblog.blogspot.com/2005/06/world-is-your-javascript-enabled_29.html (accessed on 29 June 2005). 8. Kuhn, W. Introduction to Spatial Data Infrastructures. Presentation held on March 2005. Available online: https://collaboration.worldbank.org/docs/DOC-3031 (accessed on 1 August 2016). 9. Laney, D. 3D Data Management: Controlling Data Volume, Velocity and Variety; META Group: Terni, Italy, 6 Febuary 2001. 10. Liang, S.H.; Huang, C.-Y. Geocens: A geospatial cyberinfrastructure for the world-wide sensor web. Sensors 2013, 13, 13402–13424. [CrossRef][PubMed] 11. Botts, M.; Percivall, G.; Reed, C.; Davidson, J. OGC® sensor web enablement: Overview and high level architecture. In Geosensor Networks; Springer: Berlin, Germany, 2006; pp. 175–190. 12. Liang, S.; Chen, S.; Huang, C.; Li, R.; Chang, Y.; Badger, J.; Rezel, R. Capturing the long tail of sensor web. In Proceedings of International Workshop on Role of Volunteered Geographic Information in Advancing Science, Zurich, Switzerland, 14 September 2010. 13. Chris, A. The Long Tail: Why the Future of Business is Selling Less of More; Hyperion: New York, NY, USA, 2006. 14. Morais, C.D. Where is the Phrase “80% of Data is Geographic” from? Available online: https://www. gislounge.com/80-percent-data-is-geographic/ (accessed on 28 December 2014). 15. Sullivan, D. Major Search Engines and Directories. Available online: http://www.leepublicschools.net/ Technology/Search-Engines_Directories.pdf (accessed on 1 August 2016). 16. Hicks, C.; Scheffer, M.; Ngu, A.H.; Sheng, Q.Z. Discovery and cataloging of deep web sources. In Proceedings of the 2012 IEEE 13th International Conference on Information Reuse and Integration (IRI), Las Vegas, NV, USA, 8–10 Augsut 2012; pp. 224–230. 17. Dean, J.; Ghemawat, S. Mapreduce: Simplified data processing on large clusters. Commun. ACM 2008, 51, 107–113. [CrossRef] 18. Mirtaheri, S.M.; Dinçktürk, M.E.; Hooshmand, S.; Bochmann, G.V.; Jourdan, G.-V.; Onut, I.V. A brief history of web crawlers. In Proceedings of the 2013 Conference of the Center for Advanced Studies on Collaborative Research, Riverton, NJ, USA, 18 November 2013; pp. 40–54. ISPRS Int. J. Geo-Inf. 2016, 5, 136 20 of 21

19. Lopez-Pellicer, F.J.; Rentería-Agualimpia, W.; Nogueras-Iso, J.; Zarazaga-Soria, F.J.; Muro-Medrano, P.R. Towards an active directory of geospatial web services. In Bridging the Geographic Information Sciences; Springer: Berlin, Germany, 2012; pp. 63–79. 20. Bai, Y.; Yang, C.; Guo, L.; Cheng, Q. Opengis wms-based prototype system of spatial information search engine. In Proceedings of the 2003 IEEE International Geoscience and Remote Sensing Symposium, 2003. IGARSS’03, Toulouse, France, 21–25 July 2003; pp. 3558–3560. 21. Li, W. Polarhub: A global hub for polar data discovery. In Proceedings of the AGU Fall Meeting Abstracts, San Francisco, CA, USA, 3–7 Decemebr 2014. 22. Bone, C.; Ager, A.; Bunzel, K.; Tierney, L. A geospatial search engine for discovering multi-format geospatial data across the web. Int. J. 2014, 9, 1–16. [CrossRef] 23. Sample, J.T.; Ladner, R.; Shulman, L.; Ioup, E.; Petry, F.; Warner, E.; Shaw, K.B.; McCreedy, F.P. Enhancing the us navy’s gidb portal with web services. IEEE Internet Computing 2006, 10, 53–60. [CrossRef] 24. Chen, N.; Liping Di B, G.Y.B.; Chen, Z. Geospatial Sensor Web Data Discovery and Retrieval Service Based on Middleware. 2008. Available online: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.184.4226 (accessed on 1 August 2016). 25. Li, W.; Yang, C.; Yang, C. An active crawler for discovering geospatial web services and their distribution pattern—A case study of OGC web map service. Int. J. Geogr. Inf. Sci. 2010, 24, 1127–1147. [CrossRef] 26. Lopez-Pellicer, F.J.; Béjar, R.; Florczyk, A.J.; Muro-Medrano, P.R.; Zarazaga-Soria, F.J. State of Play of OGC Web Services Across the Web. In Proceedings of the INSPIRE Conference, Krakow, Poland, 23–25 June 2010. 27. Patil, S.; Bhattacharjee, S.; Ghosh, S.K. A spatial web crawler for discovering geo-servers and semantic referencing with spatial features. In Proceedings of the International Conference on Distributed Computing and Internet Technology, Bhubaneswar, India, 6–9 February 2014; pp. 68–78. 28. Gibotti, F.R.; Câmara, G.; Nogueira, R.A. Geoinfo 2015. In Geodiscover—A Specialized Search Engine to Discover Geospatial Data in the Web; GeoInfo: Campos do Jordão, Brasil, 2005; pp. 3–14. 29. Bloom, B.H. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 1970, 13, 422–426. [CrossRef] 30. Page, L.; Brin, S.; Motwani, R.; Winograd, T. The Pagerank Citation Ranking: Bringing Order to the Web. Available online: http://ilpubs.stanford.edu:8090/422/ (accessed on 1 August 2016). 31. Kleene, S.C. Representation of Events in Nerve Nets and Finite Automata. Available online: www.dtic.mil/ cgi-bin/GetTRDoc?Location=U2&doc=GetTRDoc.pdf&AD=ADA596138 (accessed on 1 August 2016). 32. De La Beaujardiere, J. Open GIS® Web Map Server Interface Implementation Specification. Revision 1.0.0. Available online: http://portal.opengeospatial.org/files/?artifact_id=7196 (accessed on 4 August 2016). 33. De La Beaujardiere, J. Open GIS® Web Map Server Implementation Specification. Version: 1.3.0. Available online: http://portal.opengeospatial.org/files/?artifact_id=14416 (accessed on 4 August 2016). 34. Na, A.; Priest, M. Sensor Observation Service. Available online: http://portal.opengeospatial.org/files/ ?artifact_id=26667 (accessed on 4 August 2016). 35. Bröring, A.; Stasch, C.; Echterhoff, J. OGC Sensor Observation Service Interface Standard. Available online: https://portal.opengeospatial.org/files/?artifact_id=47599 (accessed on 4 August 2016). 36. Vretanos, P.A. Web Feature Service Implementation Specification. Available online: http://portal. opengeospatial.org/files/?artifact_id=7176 (accessed on 4 August 2016). 37. Vretanos, P.A. Opengis Web Feature Service 2.0 Interface Standard. Available online: http://portal. opengeospatial.org/files/?artifact_id=39967 (accessed on 4 August 2016). 38. Evans, J.D. Web Coverage Service (WCS), Version 1.0. 0 (Corrigendum). Available online: https://portal. opengeospatial.org/files/05-076 (accessed on 4 August 2016). 39. Whiteside, A.; Evans, J.D. Web Coverage Service (WCS) Implementation Standard. Available online: https: //portal.opengeospatial.org/files/07-067r5 (accessed on 4 August 2016). 40. Baumann, P. OGC WCS 2.0 Interface Standard—Core. Open Geospatial Consortium. Available online: https://portal.opengeospatial.org/files/09-110r4 (accessed on 4 August 2016). 41. Maso, J.; Pomakis, K.; Julia, N. Open GIS Web Map Tile Service Implementation Standard. Available online: http://portal.opengeospatial.org/files/?artifact_id=35326 (accessed on 4 August 2016). 42. Schut, P.; Whiteside, A. Open GIS Web Processing Service, OGC Project Document. Available online: http://portal.opengeospatial.org/files/?artifact_id=24151 (accessed on 4 August 2016). 43. ESRI, U.; PaperdJuly, W. ESRI shapefile technical description. Comput. Stat 1998, 16, 370–371. ISPRS Int. J. Geo-Inf. 2016, 5, 136 21 of 21

44. Nebert, D.; Whiteside, A.; Vretanos, P. Open GIS Catalogue Services Specification. Available online: http: //portal.opengeospatial.org/files/?artifact_id=20555 (accessed on 4 August 2016). 45. Broder, A.; Mitzenmacher, M. Network applications of bloom filters: A survey. Internet Math. 2004, 1, 485–509. [CrossRef]

© 2016 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC-BY) license (http://creativecommons.org/licenses/by/4.0/).