Geospatial Artificial Intelligence
Total Page:16
File Type:pdf, Size:1020Kb
Geospatial Artificial Intelligence An introduction to pipelining for automated geospatial analysis, modelling and AI Simon D. Wenkel March 30, 2019 DRAFT 6 Selecting file formats Some of the biggest questions we have to ask ourselves when we start a geospatial project is what file formats do we want to use. There are many out there and some have their specific advantages for certain niches. GDAL/OGR [1] lists 96 vector formats and 155 raster formats. That is a lot to choose from. Comment 6.1 Utilizing normal files and databases for geospatial data In theory we could use any kind of file or database to store geospatial data as long as we know how we stored it and how the projection is linked to coordinates. Since the same coordinates will lead to different positions for different projections this can be dangerous and therefore is not recommended. However, we can see this often especially with text files and if they are not documented properly we end up with a big mess. 6.1 Databases vs. single files First, we have to decide on whether we want single files or a real database to store our data. The main challenge with geospatial data stored in single files is that we end up having multiple files that make a “single” file. If one of them is lost or not copied correctly we may are doomed if this is an essential one. The big advantage of using single files to store geospatial data is that if we manage to copy them correctly everyone can work with them without having deep knowledge on setting up databases. Moreover, we can to risky things such as editing them manually with non- geospatial tools. This is not recommended but there are some special applications for it. Another disadvantage of single files is that their size is limited by the file system of the device they are stored on. There are still many devices in use that are formatted with FAT32 (with LFS) which limits file sizes to 4 GB. We could use real databases to store our geospatial data. They have the advantage that it they are accessedDRAFT via centralized APIs and if set up correctly it is much easier to collaborate with other team members. Another big advantage of them is that they usually offer basic data operations on a database level which is usually more optimized for performance than our geospatial software tools. If we are working with Preview - ©Simon D. Wenkel (https://www.simonwenkel.com) - p. 27 6 Selecting file formats really big datasets and/or enjoying slow network connections we can save ourselves a lot of time to transfer the data to our workstation, perform our calculations and transfer them back to the server. However, setting them up correctly especially if for complex cases and in certain corporate environments can be challenging. Furthermore, it is difficult to exchange them with external people since they either have to be dumped and converted or have to been shipped as a snapshot or third parties require access to the api which can be challenging depending on company culture and security. Comment 6.2 Databases vs. single files Shipping functioning databases can be challenging especially if data sizes are really big. Databases such as POSTGIS offer the big advantage to perform calculations on a database level and therefore save a lot of time. Therefore, we should: • ship single files, if file sizes are reasonable • ship databases, otherwise. If we are working on smaller projects especially if we are the only ones involved we will safe time with single files in the short-term especially if we have never used databases before and safe a lot of time with databases in the long-term. 6.2 Single files If we decide to use single files as basis for our analyses then we encounter the most common: • Shapefile (Chapter 6.2.1) • Keyhole Markup Language [KML] (Chapter 6.2.2) • Geotiff (Chapter 6.2.3) • x,y,z textDRAFT files (Chapter 6.2.4) • Geopackage (Chapter 6.2.5) Preview - ©Simon D. Wenkel (https://www.simonwenkel.com) - p. 28 6.2 Single files Comment 6.3 Metadata security flaws In the field of GIS and more general data science cybersecurity is neglected often. In our case it is less about securing workstations and servers itself but the amount of information of our systems we publish/ship accidentally. Automatic generated metadata may not only contain useful data such as the process how analyses were done but absolute file paths and therefore a lot of information on our systems and workflow as well. This seems to be more common with desktop GIS and less in the way we are working here. Nevertheless, we should clean our metadata if we ship it to increase computer security. 6.2.1 Shapefile The shapefile format is developed by ESRI with the main purpose of exchanging vector features between ArcGIS and non-ArcGIS users. Therefore, it is compatible with a lot of software even CAD software packages. It can store only one feature type and one layer. As defined by the shapefile whitepaper [2], it requires the following files to work: .shp # stores feature geometry .shx # spatial index of feature geometries .dbf # dBase containing all attributes We will experience often that we have 7 files instead of 3. The other two files are commonly: .shp.xml # metadata in xml form .prj # text file containing the projection .qpj # additional projection information by QGIS .sbn/.sbx # spatial index of features to speed up spatial queries Depending on what software we are using we may end up with more files: .ain/.aih # attribute index of active fields .atx # attribute index for dBase file .cpg # specifies character encoding of the dBase file .fbn/.fbx # similarDRAFT to .sbn/.sbx for read-only features .ixs # geocoding index for shapefiles (read/write) .mxs # similar to .ixs but for ODB format .qix # quadtree spatial index Preview - ©Simon D. Wenkel (https://www.simonwenkel.com) - p. 29 6 Selecting file formats The main disadvantage of shapefiles is that we may end up with unusable data if one of the essential files is not included and perhaps the projection is missing. This happens far more often then it should. 6.2.2 Keyhole Markup Language (KML) When dealing with single file geospatial data we have to mention KML - the Keyhole Markup Language [3]. KML is famous because it is one of the few filetypes that are readable and writable by Google Earth. Therefore, it sometimes used to store GPS tracks instead of using the GPS Exchange Format (.gpx). There are two kinds of files of KML: .kml # keyhole markup language document .kmz # zipped keyhole markup language document 6.2.3 Geotiff The shapefile equivalent for raster data is Geotiff. The Geotiff file [4] is similar to a standard tiff image file however it extended a bit to store information on georefer- encing. Hence, we only need one file: .tif # georeferenced TIFF file containing the raster Nevertheless, there might be more files for a single raster. We may encounter MapInfo or ESRI world files that store information on georeferencing instead of embedding it directly. Further, we may have our xml metadata files again: .aux.xml # PAM (Persistant Auxiliar Metadata) .ovr # storing pyramid layers of the raster .tab # MapInfo file .tif.xml # contains metadata .twf/.tifw/.tiffw/.wld # ESRI world file 6.2.4 x,y,z text files We can think of x,y,z text files as csv files (comma separated values) that look a bit like this: 0,0,0,'Amsterdam',123,'foo' ... Without specialDRAFT (manual) processing it can only be used for points and therefore they are often used to ship DEMs (Digital Elevation Models). The main problem is that they have to be accompanied by metadata to know which projection has to be assigned to it after importing. Preview - ©Simon D. Wenkel (https://www.simonwenkel.com) - p. 30 6.2 Single files 6.2.5 GeoPackage (GPKG) GeoPackage is a single file that can store multiple vector features and rasters. Accord- ing to the current specifications [5] it is a container for a SQLite database with some degree of similarity to SpatiaLite. However, unlike SpatiaLite it is a pure storage format and not a database that allows certain (optimized) operations on a database level. It is designed to store data and leaves the processing to other software. Comment 6.4 Geopackage is the default file format for QGIS 3 Geopackage is used as the default file format for QGIS 3 [6]. Let us hope that it will lead to wide spread usage of GeoPackage in times of “open data policies” in Europe. This could remove the battle with incomplete shapefiles and geotiffs. 6.2.6 GeoJSON GeoJSON [7] is the abbreviation of Geographic JavaScript Object Notation. It is an geospatial extension based on the JSON format and therefore aims at web applica- tions. It is a rather simple format that uses WGS 84 coordinates that are written as decimal degrees. It supports the following geometry types: • Point, MultiPoint • LineString, MultiLineString • Polygon, MultiPolygon Code 6.1 GeoJSON example This is the example from the GeoJSON homepage (http://geojson.org/). 1 { 2 "type": "Feature", 3 "geometry":{ 4 "type": "Point"DRAFT, 5 "coordinates":[125.6, 10.1] 6 }, Preview - ©Simon D. Wenkel (https://www.simonwenkel.com) - p. 31 6 Selecting file formats 7 "properties":{ 8 "name": "Dinagat Islands" 9 } 10 } 6.3 Databases In terms of geospatial databases, we mainly run into SQL (Structured Query Lan- guage) databases meaning that we are dealing with relational databases. Relational databases were introduced by Codd [8] and have standardized termi- nology describing a “table”: • Row: tupel/record • Column: attribute/field • Table: relation One of the biggest advantages of real databases is the usage of native database tools to perform queries and some analyses directly on the database without transferring data from a database to our workstations and do the calculations there.