Big Data: Using ArcGIS with

Erik Hoel and Mike Park Outline

• Overview of Hadoop • Adding GIS capabilities to Hadoop • Integrating Hadoop with ArcGIS Apache Hadoop What is Hadoop?

• Hadoop is a scalable open source framework for the distributed processing of extremely large data sets on clusters of commodity hardware - Maintained by the Apache Software Foundation - Assumes that hardware failures are common

• Hadoop is primarily used for: - Distributed storage - Distributed computation

http://hadoop.apache.org/ Apache Hadoop What is Hadoop?

• Historically, development of Hadoop began in 2005 as an open source implementation of a MapReduce framework - Inspired by Google’s MapReduce framework, as published in a 2004 paper by Jeffrey Dean and Sanjay Ghemawat (Google Lab) - Doug Cutting (Yahoo!) did the initial implementation

• Hadoop consists of a distributed file system (HDFS), a scheduler and resource manager, and a MapReduce engine - MapReduce is a programming model for processing large data sets in parallel on a distributed cluster - Map() – a procedure that performs filtering and sorting - Reduce() – a procedure that performs a summary operation

http://hadoop.apache.org/ Apache Hadoop What is Hadoop?

• A number of frameworks have been built extending Hadoop which are also part of Apache - Cassandra - a scalable multi-master database with no single points of failure - HBase - a scalable, distributed database that supports structured data storage for large tables - Hive - a data warehouse infrastructure that provides data summarization and ad hoc querying - Pig - a high-level data-flow language and execution framework for parallel computation - ZooKeeper - a high-performance coordination service for distributed applications

http://hadoop.apache.org/ MapReduce High level overview

Split Combine Shuffle Partition map() Sort

reduce() map() part 1

map() data reduce() part 2 map()

hdfs://path/input hdfs://path/output Apache Hadoop MapReduce – The Word Count Example

Map red 1 1. Each line is split into words red red red 1 2. Each word is written to the map with the word as the key blue red Map blue 1 and a value of ‘1’ green red 1 green 1 green 1 green 1 Partition/Sort/Shuffle green 1 1. The output of the mapper is sorted and grouped based on red 1 the key red 1 2. Each key and its associated values are given to a reducer green 1 green 3 green Partition red 1 Reduce blue 1 Shuffle red 4 blue red Map red 1 Reduce red 1 Sort 1. For each key (word) given, sum up the values (counts) green green 1 2. Emit the word and its count blue 1 blue 1 Reduce blue 5 blue 1 blue 1 green green 1 blue 1 blue blue Map blue 1 blue blue 1 blue 1 Apache Hadoop Hadoop Clusters

Traditional Hadoop Clusters The Dredd Cluster Adding GIS capabilities to Hadoop Hadoop Cluster

.jar Adding GIS Capabilities to Hadoop General approach

• Need to reduce large volumes of data into manageable datasets that can be processed in the ArcGIS Platform - Clipping - Filtering - Grouping Adding GIS Capabilities to Hadoop Comma Delimited… ONTARIO,34.0544,-117.6058 Spatial data in Hadoop RANCHO CUCAMONGA,34.1238,-117.5702 REDLANDS,34.0579,-117.1709 RIALTO,34.1136,-117.387 RUNNING…with SPRINGS,34.2097, the location defined-117.1135 in multiple fields • Spatial data in Hadoop can show up in a number of different formats

Tab Delimited… ONTARIO POINT(34.0544,-117.6058) RANCHO CUCAMONGA POINT(34.1238,-117.5702) REDLANDS POINT(34.0579,-117.1709) RIALTO POINT(34.1136,-117.387) …with the location defined in well-known text (WKT) RUNNING SPRINGS POINT(34.2097,-117.1135)

JSON… {{‘attr’:{‘name’=‘ONTARIO’},’geometry’:{‘x’:34.05,’y’:-117.60}} {{‘attr’:{‘name’=‘RANCHO…’},’geometry’:{‘x’:34.12,’y’:-117.57}} {{‘attr’:{‘name’=‘REDLANDS’},’geometry’:{‘x’:34.05,’y’:-117.17}} {{‘attr’:{‘name’=‘RIALTO’},’geometry’:{‘x’:34.11,’y’:-117.38}} {{‘attr’:{‘name’=‘RUNNING…’},…with’geometry’:{‘x’: Esri’s JSON34.20, defining’y’: -the117.11}} location

GIS Tools for Hadoop Esri on GitHub Tools and samples using the open source GIS Tools for Hadoop resources that solve specific problems tools samples • Hive user-defined functions for spatial Spatial Framework for Hadoop processing hive • JSON helper utilities spatial-sdk-hive.jar json Geoprocessing tools that… spatial-sdk-json.jar • Copy to/from Hadoop • Convert to/from JSON Geoprocessing Tools for Hadoop • Invoke Hadoop Jobs HadoopTools.pyt

Geometry API Java Java geometry library for spatial data processing esri-geometry-api.jar GIS Tools for Hadoop Java geometry API

• Topological operations - Buffer - Union - Convex Hull - Contains - ... • In-memory indexing • Accelerated geometries for relationship tests - Intersects, Contains, … • Still being maintained on Github

https://github.com/Esri/geometry-api-java GIS Tools for Hadoop Java geometry API

OperatorContains opContains = OperatorContains.local(); for (Geometry geometry : someGeometryList) { opContains.accelerateGeometry(geometry, sref, GeometryAccelerationDegree.enumMedium);

for (Point point : somePointList) { boolean contains = opContains.execute(geometry, point, sref, null); } OperatorContains.deaccelerateGeometry(geometry); } GIS Tools for Hadoop Hive spatial functions

• Apache Hive supports analysis of large datasets in HDFS using a SQL-like language (HiveQL) while also maintaining full support for MapReduce - Maintains additional metadata for data stored in Hadoop - Specifically, schema definition that maps the original data to rows and columns - Allows SQL-like interaction with data using the Hive Query Language (HQL) - Sample of Hive table create statement for simple CSV? • Hive User-Defined Functions (UDF) that wrap geometry API operators • Modeled on the ST_Geometry OGC compliant geometry type

https://github.com/Esri/spatial-framework-for-hadoop GIS Tools for Hadoop Hive spatial functions

• Defining a table on CSV data with a spatial component CREATE TABLE IF NOT EXISTS earthquakes ( earthquake_date STRING, latitude DOUBLE, longitude DOUBLE, magnitude DOUBLE) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','

• Spatial query using the Hive UDFs

SELECT counties.name, count(*) cnt FROM counties Check if polygon contains point JOIN earthquakes WHERE ST_Contains(counties.boundaryshape, ST_Point(earthquakes.longitude, earthquakes.latitude)) GROUP BY counties.name ORDER BY cnt desc; Construct a point from latitude and longitude https://github.com/Esri/spatial-framework-for-hadoop

GIS Tools for Hadoop Geoprocessing tools

• Geoprocessing tools that allow ArcGIS to interact with large data Hadoop Tools stored in Hadoop

- Copy to HDFS – Uploads files to HDFS Copy to HDFS - Copy from HDFS – Downloads files from HDFS - Features to JSON – Converts a feature class to a JSON file Copy from HDFS - JSON to Features – Converts a JSON file to a feature class Execute Workflow - Execute Workflow – Executes Oozie workflows in Hadoop

Features to JSON

JSON to Features

https://github.com/Esri/geoprocessing-tools-for-hadoop Hadoop Cluster

Copy from HDFS Copy to HDFS

filter JSON JSON result Features to JSON to JSON Features DEMO

Point in Polygon Demo Mike Park Aggregate Hotspots

Step 1. Map/Reduce to aggregate points into bins • Traditional hotspots and big data 5 5 • Each feature is weighted, in part, by the 3 Count 3 Count3 2 values of its neighbors Min 2 Min 3 2 7 2 • Neighborhood searches in very large 6 Max 66 7Max 7 datasets can be extremely costly without a spatial index Step 2. Map/Reduce to calculate global values for bin aggregates • The result of such analysis would have as many features as the original data 5 5 Count 3 Count3 2 Count 5 3 • Aggregate Hotspots Min 2 Min 3 Min 2 • Features are aggregated and summarized 2 7 2 Max 66 Max 7 6 Max 7 into bins defined be a regular integer grid • The size of the summarized data is not affected by the size of the original data, Step 3. Map/Reduce to calculate hotspots using bins (next few slides) only the number of bins • Hotspots can then be calculated on the summary data

DEMO

Aggregate Hotspot Analysis Mike Park Integrating Hadoop with ArcGIS Integrating Hadoop with ArcGIS Moving forward

• Optimizing data storage - What’s wrong with the current data storage - Sorting and sharding • Spatial indexing • Data source • Geoprocessing - Native implementations of key spatial statistical functions Optimizing Data Storage Distribution of spatial data across nodes in a cluster

hdfs:///path/to/dataset

part-1.csv dredd0

part-2.csv dredd1

part-3.csv dredd2 dredd0 dredd1 dredd2 processed on dredd1 processed on dredd0 Point in Polygon – in More Detail Using GIS Tools for Hadoop

1. The entire set of polygons is sent to every node 2. Each node builds an in-memory spatial index for quick lookups 3. Every point assigned to that node is bounced off the index to see which polygon contains the point 4. The nodes output their partial counts which are then combined into a single result

Issues: • Every record in the dataset had to be processed, but only a subset of the records contribute to the answer • The memory requirements for the spatial index can be large as the number of polygons increases Optimizing Data Storage Ordering and sharding

• Raw data in Hadoop is not optimized for spatial queries and analysis • Techniques for optimized data storage 1. Sort the data in linearized space 2. Split the ordered data into equal density regions, known as shards • Shards ensure that the majority of features are co-located on the same machine as their neighbors - This reduces network utilization when doing neighborhood searches Hadoop and GIS Distribution of ordered spatial data across nodes in a cluster

hdfs:///path/to/dataset

part-1 dredd0

part-2 dredd1

part-3 dredd2 dredd0 dredd1 dredd2 dredd3 dredd4 Spatial Indexing Distributed quadtree

• The quadtree index of a dataset is composed of sub-indexes that are distributed across the cluster • Each of these sub-indexes points to a shard with a 1-1 cardinality • Each sub-index is stored on the same computer as the shard that it indexes

0 1 3 2 4 Shard Data

Index Point in Polygon – Indexed Points

Counting points in polygons using a spatially indexed dataset • Rather than send every polygon to each node, we only send a subset of the polygons • Each node queries the index for points that are contained in its polygon subset • The polygons from each node are then combined to produce the final result DEMO

Filtering Areas of Interest with Features Mike Park Conclusion Miscellaneous clever and insightful statements

• Overview of Hadoop • Adding GIS capabilities to Hadoop • Integrating Hadoop with ArcGIS

Caching I/O Reads (Where should this go?)

dredd0 dredd1 dredd2 dredd3 dredd4

1 2 3