DATA MINING II - 1DL460

Spring 2017

A second course in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt17

Kjell Orsborn Uppsala Database Laboratory Department of Information Technology, Uppsala University, Uppsala, Sweden

Kjell Orsborn - UDBL - IT - UU 17-03-09 1 Introductory to Spatial Databases

Kjell Orsborn - UDBL - IT - UU 17-03-09 2

• A spatial database is a database that is optimized to store and query data that represents objects defined in a geometric space. – Most spatial databases allow representing simple geometric objects such as points, lines and polygons.

– Some spatial databases handle more complex structures such as 3D objects, topological coverages, linear networks, and triangulated irregular networks (TINs) .

– Conventional databases developed to manage various numeric and character types of data

– Spatial databases require additional functionality to process spatial data types efficiently, and developers have often added geometry or feature data types.

• The Open Geospatial Consortium developed the Access specification (first released in 1997) and sets standards for adding spatial functionality to database systems.

• The SQL/MM Spatial ISO/EIC standard is a part the SQL/MM multimedia standard and extends the Simple Features standard with data types that support circular interpolations.

Kjell Orsborn - UDBL - IT - UU 17-03-09 3 tributors to SQL/MM did not want to move forward with a Spatio-temporal support until SQL/Temporal developed.2 In the mean time, thefocus of spatial standard lied on keeping it aligned with the OGC specification and the standards developed by the technical com- mitee ISO/TC 211, for example [ISO02a, ISO02b]. The prefix ST for the spatial tables, PSfrag replacementstypes, and methods was not changed during the organizational changes of the standards, Geometryhowever. Today, one might want to interpret it as Spatial Type. SpatialReferenceSystem Point Curve2.2 Geometry Type Hierarchy Surface CollectionThe OGC geometry class hierarchy is adapted for the corresponding SQL type hierarchy LineStringThethat is definedSQL/MMin the SQL/MM Spatialstandard. Figure 2ISO/EICshows the standardized standardtype hierarchy. PolygonThe shaded types are the not-instantiable types.3 All types are used to represent geometric MultiSurfacefeatures in the 2-dimensionalSQL spatialspace ( ). type hierarchy MultiCurve MultiPoint Line ST Geometry LinearRing MultiPolygon MultiLineString

ST Surface ST Curve ST Point ST GeomCollection

ST CurvePolygon ST MultiSurface ST MultiCurve ST MultiPoint

ST Polygon ST MultiPolygon ST MultiLineString

ST LineString ST CircularString ST CompoundCurve ST MultiCircString

Figure 2: SQL Type Hierarchy Kjell Orsborn - UDBL - IT - UU 17-03-09 4

The major differences between the SQL type hierarchy and the OGC geometry class hierarchy are the omission of the derived types Line and LinearRing, and the addition

2SQL/Temporal was not any further developed and, like SQL/MM Part, subsequently withdrawn completely. 3It is implementation-defined whether ST MultiCurve and ST MultiSurface are instantiable or not, even though they are shown as not-instantiable in figure 2. Spatial Database • Features of spatial databases: • Spatial databases use a spatial index to speed up database operations

• Spatial databases can perform a wide variety of spatial operations. The following operations and many more are specified by the Open Geospatial Consortium standard: – Spatial Measurements: computes line length, polygon area, the distance between geometries, etc. – Spatial Functions: modify existing features to create new ones, for example by providing a buffer around them, intersecting features, etc. – Spatial Predicates: allows true/false queries about spatial relationships between geometries. Examples include "do two polygons overlap" or 'is there a residence located within a mile of the area we are planning to build the landfill?' (see DE-9IM) – Geometry Constructors: creates new geometries, usually by specifying the vertices (points or nodes) which define the shape. – Observer Functions: queries which return specific information about a feature such as the location of the center of a circle

• Some databases support only simplified or modified sets of these operations, especially in cases of NoSQL systems like MongoDB and CouchDB.

Kjell Orsborn - UDBL - IT - UU 17-03-09 5 Spatial Database

• Spatial indices are used by spatial databases (databases which store information related to objects in space) to optimize spatial queries. • Conventional index types do not efficiently handle spatial queries such as how far two points differ, or whether points fall within a spatial area of interest. • Common spatial index methods include: – R-tree (also R+ tree, R* tree, Hilbert R-tree): Typically the preferred method for indexing spatial data. Objects (shapes, lines and points) are grouped using the minimum bounding rectangle (MBR). Objects are added to an MBR within the index that will lead to the smallest increase in its size. – X-tree – kd-tree – m-tree – an m-tree index can be used for the efficient resolution of similarity queries on complex objects as compared using an arbitrary metric. – Quadtree and Octree – UB-tree (a B+ tree (information only in the leaves) with records stored according to Z-order,) – Space-filling curve, Hilbert (curve), Z-order (curve) – and others (HHCode, Grid (spatial index), Point access method, Binary space partitioning (BSP- Tree subdividing space by hyperplanes)

Kjell Orsborn - UDBL - IT - UU 17-03-09 6 Spatial Database Systems

• Caliper extends the Raima Data Manager with spatial data types, functions, and utilities. • Boeing's Spatial Query Server spatially enables Sybase ASE. • Smallworld VMDS, the native GE Smallworld GIS database • SpatiaLite extends Sqlite with spatial datatypes, functions, and utilities. • IBM DB2 Spatial Extender can spatially-enable any edition of DB2 • ClusterPoint offers native indexed support for distances, range matching and polygon matching, as well as aggregation. • Oracle Spatial • Oracle Locator • Vertica Place, the geo-spatial extension for HP Vertica, adds OGC-compliant spatial features to the relational column-store database. • Microsoft SQL Server has support for spatial types since version 2008 • PostgreSQL DBMS uses the spatial extension PostGIS to implement the standardized data type geometry and corresponding functions. • Teradata Geospatial includes 2D spatial functionality (OGC-compliant) in its data warehouse system.

Kjell Orsborn - UDBL - IT - UU 17-03-09 7 Spatial Database Systems

• MonetDB/GIS extension for MonetDB adds OGS Simple Features to the relational column- store database. • Linter SQL Server supports spatial types and spatial functions according to the OpenGIS specifications. • MySQL DBMS implements the data type geometry, plus some spatial functions implemented according to the OpenGIS specifications. As of MySQL 5.0.16, MyISAM, InnoDB, NDB, BDB, and ARCHIVE support spatial features. • Neo4j – a graph database that can build 1D and 2D indexes as B-tree, Quadtree and Hilbert curve directly in the graph • AllegroGraph – a graph database which provides a novel mechanism for efficient storage and retrieval of two-dimensional geospatial coordinates for RDF data. It includes an extension syntax for SPARQL queries. • MarkLogic, MongoDB, RavenDB, and RethinkDB support geospatial indexes in 2D. • Esri has a number of both single-user and multiuser geodatabases. • SpaceBase, a real-time spatial database. • CouchDB a document-based database system that can be spatially enabled by a plugin called Geocouch

Kjell Orsborn - UDBL - IT - UU 17-03-09 8 Spatial Database Systems

• CartoDB, a cloud-based geospatial database on top of PostgreSQL with PostGIS • StormDB, an upcoming cloud-based database on top of PostgreSQL with geospatial capabilities • AsterixDB, an open-source big data management system with native geospatial capabilities • Kinetica, a GPU-accelerated analytics database optimized for geospatial analytics on large datasets. • SpatialDB by MineRP, OGC spatial database with spatial type extensions for the Mining Industry • H2 supports geometry types and spatial indices. An extension called H2GIS available on Maven Central gives full OGC Simple Features support. • GeoMesa is a cloud-based spatio-temporal database built on top of Apache Accumulo and Apache Hadoop. GeoMesa supports full OGC Simple Features support and a GeoServer plugin. • Ingres 10S and 10.2 include native comprehensive spatial support. Ingres includes the Geospatial Data Abstraction Library cross-platform spatial data translator. • Tarantool supports geospatial queries with R-tree index. • SAP HANA supports geospatial with SPS08 • Redis with the Geo API

Kjell Orsborn - UDBL - IT - UU 17-03-09 9 Spatial Database Systems

• GeoMesa – Yes yes (Simple Features) yes (JTS) no (manufacturable with GeoTools) no parts of the funcions, a few examples with Simple Feature Access in Java Virtual Machine and Apache Spark are all kinds of tasks solvable yes

• ESRI GIS Tools for Hadoop – yes yes (own specific API) yes (union, difference, intersect, clip, cut, buffer, equals, within, contains, crosses, and touches) no no just briefly forking yes

• Rasdaman – yes just raster raster manipulation with rasqlyes with or detailed wiki own defined function in enterprise edition no

• PostgreSQL with PostGIS – noyes (Simple Features and raster) yes (Simple Feature Access and raster functions) yes yes detailed SQL, in connection with R no

• Neo4J-spatial – no yes (Simple Features) yes (contain, cover, covered by, cross, disjoint, intersect, intersect window, overlap, touch, within and within distance) no no just briefly fork of JTS no

• Postgres-XL with PostGIS – yes yes (Simple Features and raster) yes (Simple Feature Access and raster functions) yes yes PostGIS: yes, Postgres-XL: briefly SQL, in connection with R or Tcl or Python no

Kjell Orsborn - UDBL - IT - UU 17-03-09 10 Spatial Database Systems

• AsterixDB – yes yes (custom) center, radius, distance, area, intersect and cell no no good in Google Code own datatypes, functions and indexes possible • HadoopGIS – yes yes (custom, no raster) yes (contain, cover, covered by, cross, disjoint, intersect, overlap, within and nearest neighbor) no no just briefly forking yes • H2GIS GPL 3 – noyes (custom, no raster) Simple Feature Access and custom functions for H2Network yes no yes (homepage) SQL no • Ingres – yes (if extension is installed) yes (custom, no raster) Geometry Engine, Open Source no with MapScript just briefly with C and OME no • RethinkDB – yes yes distance, getIntersecting, getNearest, includes, intersects no no official docs forking no

Kjell Orsborn - UDBL - IT - UU 17-03-09 11 PostGIS

• PostGIS is an open source software program that adds support for geographic objects to the PostgreSQL object-relational database. • PostGIS follows the Simple Features for SQL specification from the Open Geospatial Consortium (OGC). • Features: – Geometry types for Points, LineStrings, Polygons, MultiPoints, MultiLineStrings, MultipPolygons and GeometryCollections. – Spatial predicates for determining the interactions of geometries using the 3x3 DE-9IM (provided by the GEOS software library). – Spatial operators for determining geospatial measurements like area, distance, length and perimeter. – Spatial operators for determining geospatial set operations, like union, difference, symmetric difference and buffers (provided by GEOS). – R-tree-over-GiST (Generalized Search Tree) spatial indexes for high speed spatial querying. – Index selectivity support, to provide high performance query plans for mixed spatial/ non-spatial queries.

Kjell Orsborn - UDBL - IT - UU 17-03-09 12 PostGIS

• For raster data, PostGIS WKT Raster (now integrated into PostGIS 2.0+ and renamed PostGIS Raster)

• The PostGIS implementation is based on "light-weight" geometries and indexes optimized to reduce disk and memory footprint. Using light-weight geometries helps servers increase the amount of data migrated up from physical disk storage into RAM, improving query performance substantially.

• PostGIS is registered as "implements the specified standard" for "Simple Features for SQL" by the OGC.[2] PostGIS has not been certified as compliant by the OGC.

Kjell Orsborn - UDBL - IT - UU 17-03-09 13 Simple Features Access and SQL/MM Spatial

• Simple Features (officially Simple Feature Access) is both an Open Geospatial Consortium (OGC) and ISO standard (ISO 19125) that specifies a common storage and access model of mostly two-dimensional geometries (point, line, polygon, multi-point, multi-line, etc.) used by geographic information systems.

• The ISO 19125 standard comes in two parts. – Part 1, ISO 19125-1 (SFA-CA for "common architecture"), defines a model for two- dimensional simple features, with linear interpolation between vertices. The data model defined in SFA-CA is a hierarchy of classes. This part also defines representation using Well-Known Text (and Binary). – Part 2, ISO 19125-2 (SFA-SQL), defines an implementation using SQL. The OpenGIS standard(s) cover implementations in CORBA and OLE/COM as well, although these have lagged behind the SQL one and are not standardized by ISO.

• The ISO/IEC 13249-3 SQL/MM Spatial extends the Simple Features data model mainly with circular interpolations (e.g. circular arcs) and adds other features like coordinate transformations and methods for validating geometries as well as Geography Markup Language support.

Kjell Orsborn - UDBL - IT - UU 17-03-09 14 tributors to SQL/MM did not want to move forward with a Spatio-temporal support until SQL/Temporal developed.2 In the mean time, thefocus of spatial standard lied on keeping it aligned with the OGC specification and the standards developed by the technical com- mitee ISO/TC 211, for example [ISO02a, ISO02b]. The prefix ST for the spatial tables, PSfrag replacementstypes, and methods was not changed during the organizational changes of the standards, Geometryhowever. Today, one might want to interpret it as Spatial Type. SpatialReferenceSystem Point Curve2.2 Geometry Type Hierarchy Surface CollectionThe OGC geometry class hierarchy is adapted for the corresponding SQL type hierarchy LineStringThethat is definedSQL/MMin the SQL/MM Spatialstandard. Figure 2ISO/EICshows the standardized standardtype hierarchy. PolygonThe shaded types are the not-instantiable types.3 All types are used to represent geometric MultiSurfacefeatures in the 2-dimensionalSQL spatialspace (R ). type hierarchy MultiCurve MultiPoint Line ST Geometry LinearRing MultiPolygon MultiLineString

ST Surface ST Curve ST Point ST GeomCollection

ST CurvePolygon ST MultiSurface ST MultiCurve ST MultiPoint

ST Polygon ST MultiPolygon ST MultiLineString

ST LineString ST CircularString ST CompoundCurve ST MultiCircString

Figure 2: SQL Type Hierarchy Kjell Orsborn - UDBL - IT - UU 17-03-09 15

The major differences between the SQL type hierarchy and the OGC geometry class hierarchy are the omission of the derived types Line and LinearRing, and the addition

2SQL/Temporal was not any further developed and, like SQL/MM Part, subsequently withdrawn completely. 3It is implementation-defined whether ST MultiCurve and ST MultiSurface are instantiable or not, even though they are shown as not-instantiable in figure 2. Spatial relations

• A spatial relation specifies how some object is located in space in relation to some reference object. When the reference object is much bigger than the object to locate, the latter is often represented by a point. The reference object is often represented by a bounding box. • In spatial databases and Geospatial topology the spatial relations are used for spatial analysis and constraint specifications. • Commonly used types of spatial relations are: topological, directional and distance relations.

Kjell Orsborn - UDBL - IT - UU 17-03-09 16 Spatial relations • Topological relations: – The DE-9IM model expresses important space relations which are invariant to rotation, translation and scaling transformations. – For any two spatial objects a and b, that can be points, lines and/or polygonal areas, there are 9 relations derived from DE-9IM: – Equals a = b • Topologically equal. Also (a b = a) (a b = b) – Disjoint a b = • a and b are disjoint, have no point in common. They form a set of disconnected geometries. – Intersects a b ≠ – Touches (a b ≠ ) (I(a) I(b) = ) • a touches b, they have at least one boundary point in common, but no interior points. – Contains a b = b – Covers I(a) b = b • b lies in the interior of a (extends Contains). Other definitions: "no points of b lie in the exterior of a", or "Every point of b is a point of (the interior of) a". – CoveredBy Covers(b,a) – Within a b = a

Kjell Orsborn - UDBL - IT - UU 17-03-09 17 Spatial relations • Directional relations – Directional relations can again be differentiated into external directional relations and internal directional relations. An internal directional relation specifies where an object is located inside the reference object while an external relations specifies where the object is located outside of the reference objects. – Examples for internal directional relations: left; on the back; athwart, abaft – Examples for external directional relations: on the right of; behind; in front of, abeam, astern • Distance relations – Distance relations specify how far is the object away from the reference object. – Examples are: at; nearby; in the vicinity; far away

Kjell Orsborn - UDBL - IT - UU 17-03-09 18 Spatial relations

Kjell Orsborn - UDBL - IT - UU 17-03-09 19 Spatial query • A spatial query is a special type of query supported by spatial databases. The two most important differences in comparison to non-spatial SQL queries are that they allow for the use of geometry data types such as points, lines and polygons and that these queries consider the spatial relationship between these geometries. • Examples of types of queries (functions) used in PostGIS: – Distance(geometry, geometry) : number – Equals(geometry, geometry) : boolean – Disjoint(geometry, geometry) : boolean – Intersects(geometry, geometry) : boolean – Touches(geometry, geometry) : boolean – Crosses(geometry, geometry) : boolean – Overlaps(geometry, geometry) : boolean – Contains(geometry, geometry) : boolean – Length(geometry) : number – Area(geometry) : number – Centroid(geometry) : geometry – the term 'geometry' refers to a point, line, box or other two or three dimensional shape

Kjell Orsborn - UDBL - IT - UU 17-03-09 20