Technical white paper

Location Intelligence

Location Intelligence for Big Data

Maximizing the distributed nature of big-data clusters to achieve breakthrough performance Page 2

Add agility to big data analysis. Companies struggle to generate positive returns on big data implementations. Many find it difficult to generate actionable insight from their big data assets. Geospatial processing changes the dynamics. Location Intelligence for Big Data makes vast quantities of data consumable using GeoEnrichment and location analytics. Run spatial operations within a native environment. Then visualize relationships in a spatial context to improve analysis and decision making. Pitney Bowes offers a unique approach, embedding location technology within big data solutions. When you embed Discover the technology that delivers richer insights and a faster ROI. rather than connect, you can interpret • High scalability, high speed transactional data faster and resolve data processing critical business issues with the clarity • GeoEnrichment you need. • Cluster-level data partitioning • Node-level data processing

Location Intelligence for Big Data A Pitney Bowes technical white paper Page 3

The big data challenge

Big data technologies increasingly allow companies to The Location Intelligence advantage store and process incredibly large datasets of customer Location Intelligence brings critical perspective to data calls, financial transactions and social media feeds. analysis. Big data typically comes with a locational Yet, many companies struggle to generate meaningful, component. This could be a customer , a mobile actionable insights. Key performance indicators remain phone GPS signal, the location of an ATM, a store elusive as data volume and velocity continue to grow. transaction or a social media check-in.

The challenge is to connect data within and across Through GeoEnrichment, a process for appending datasets in a way that: location-based ancillary data, organizations can augment records with latitude/longitude coordinates. Then, that • Ensures accuracy and precision coordinate data can be used to integrate additional context into every record. • Enables enrichment With this enriched embedded insight: • Keeps pace with the extraordinary speed and scale required • Rules-based workflows can utilize this appended data to automate business decisions.

• Spatial aggregation can condense data volumes, making them more manageable.

• Data can be generalized in a spatial context for results that are easier to visualize and analyze.

• Organizations can gain new perspectives into business drivers and subsequent company responses.

Valuable applications An embedded approach enables companies to formulate business questions and solve them within a big data environment. For example:

Telecommunications companies continually Financial services firmscontinually process an process a huge number of call records. These can incredibly vast number of transactions. Each can be be condensed and presented via highly accurate, appended with a latitude/longitude coordinate pair, near real-time coverage maps. This type of visual and operational rules can help determine when to flag analysis helps firms to improve customer service, records as potentially fraudulent. This process enables reduce churn, market more effectively and gain financial services firms to provide better safeguards to market share. consumers and their privacy, while helping to reduce losses due to fraud. Page 4

The benefits of a native optimized approach Innovative, modular technology Many geospatial technology providers supply solutions The Pitney Bowes Location Intelligence for Big Data that connect to big data platforms, then transfer data from solution is comprised of location technology software these distributed platforms into their own GIS server-based development kits (SDKs) that allow companies to technology. The disadvantage of this “connector” approach GeoEnrich datasets and spatially aggregate results, is that it doesn’t leverage the processing power of the condensing big data into a consumable output. It also distributed platform (Hadoop, Spark, etc.). Instead, the includes APIs and data. actual geospatial operations occur in a single server or a small server cluster, which limits your ability to process • Our Java-based SDKs can be transferred into any big data large data sets. environment, such as Hadoop or Spark, so companies are not limited by their technology choices in a transient and Pitney Bowes takes a different, big-data-ready approach. evolving field. To maximize the capabilities of distributed processing environments, we enable geospatial operations to run • We offer 350+ datasets that can be used to add spatial natively within a variety of distributed platforms. context, serve as a container for aggregation, and be analyzed and visualized using our web-based mapping Let’s look first at the technology we provide; then at the technologies. process steps that enable it to work natively in a big data environment.

Examples of performance achieved via the Pitney Bowes native and optimized technology strategy These are actual numbers achieved in our clients’ use cases. We are continually improving the technology and can achieve much better performance using newer technologies like Spark.

US Parcel Centroid geocoding of 106 Geocoding 30 minutes 5 nodes Hadoop cluster million

1 billion mobile points spatially joined “Find the Nearest” spatial join 36 minutes 20 nodes EMR on AWS to 12 million points of interest

Aggregated 19 billion mobile call Point-in- polygon processing 30 minutes records to 950 million polygons 56 node Hadoop cluster

Pitney Bowes Location Intelligence for Big Data capabilities

Location Intelligence SDK Global Geocoding API • Takes spatial primitives (points, lines and polygons) and applies a • Geocoding turns a street address, place or point of interest geometric function (“contains”, “combine”, “intersects” etc.) using into a latitude, longitude co-ordinate pair. additional spatial and aspatial data • Reverse geocoding take the co-ordinate and gives a street • Enables creation of spatial query such as “Aggregate point data address, or administrative boundary. within this polygon” or “Find the nearest point to this line” • Can be used to GeoEnrich a dataset by appending additional Routing SDK attributes using customer data, third-party data or any of the 350+ • Takes a known location (e.g. retail store) and uses the road datasets in the Pitney Bowes Global Data Catalog network to derive information such as equal drive times (isochrones) around that point, or the shortest path to • U.S. customers may also choose to utilize the pre-enriched Pitney that point. Bowes Master Location Database (MLD) assets for U.S. postal addresses to further improve processing speeds and location accuracy for operational workflows

Location Intelligence for Big Data A Pitney Bowes technical white paper Page 5

Putting our technology to work The diagram to the right uses the Global Geocoding Jar API to illustrate this architecture as well as how it can MR/Yarn File Application be used in various big data related processes. It shows Hadoop NameNode how Pitney Bowes integrates geocoding capabilities Geocoding natively into Hadoop. SDK

The key components of the solution are the Global Geocoding API (GGA) and geocoding dictionary files.

• GGA is a collection of Jar files that can be used in writing Java based MapReduce, Yarn or Spark Geocoding Geocoding Geocoding applications. Dictionary Dictionary Dictionary

• The geocoding dictionary data files can be pre- installed in all data nodes of the Hadoop cluster or Data Node Data Node Data Node distributed into the cluster dynamically before use.

Input Data Output Data

Running geocoding within Hadoop Making it more accessible Listed below is an example of how Pitney Bowes can run While a MapReduce batch job works well for users with a geocoding as a Hadoop MapReduce batch job in command data-engineering background, it is not user-friendly for line. Note that the user can set up different geocoding other data analysts. To make geocoding in Hadoop more parameters in the config.xml, such as the dictionary to use accessible, we’ve developed a HIVE geocoding UDF so any and the fields to return. Both forward geocoding and user with a SQL background can use it in Hadoop. Most reverse geocoding is supported. Pitney Bowes Location Intelligence capabilities can be deployed in Hadoop or Spark using an approach similar to the geocoding example above. [jun@osboxes ~]$: hadoop jar Geocoding_Hadoop.jar com.pb.mr.GeocodingDriver -input /addressdatafolder -output /geocoderesult -appConfig Geocode_config.xml HIVE> select (street, city, state, zip, ‘USA’) from customersAddTable; Page 6

Optimizing geospatial data processing Breaking down the Pitney Bowes approach in Hadoop and Spark Data preparation is critical to a highly performant spatial Geospatial data processing is a fundamental step in process. This requires enhancements to both the spatial almost all location related big data applications. For data partition at the cluster level and spatial data example, to analyze users’ mobile records with GPS processing in node level. locations, ancilliary data is added to provide context, such as the individual’s address or a nearby point of Cluster-level data partitioning dictates how large interest. To enable this GeoEnrichment processes at datasets are divided so they can be efficiently processed scale, a set of highly efficient geospatial processes, such on a single node. as point-in-polygon or “find the nearest” site searches is needed. These processes need to be optimzied for big Node-level data processing optimizes spatial indexing data technologies like Hadoop or Spark in order to and processing of small pieces of the data subset in a leverage large-scale parallel computing power. local node to expedite joint query processing.

Using point-in-polygon analysis in Hadoop as an We will explore each of these below using the following example, there are different types of strategies. These large-scale point-in-polygon use-case example. depend on the use cases and data to be analyzed.

In many use cases the number of polygons to be evaluated is small. When this is true, it can be sufficient Use case: Point-in-polygon to use a broadcaster for evaluation, for example, to evaluate whether point records fall within polygons Objective representing administrative boundaries. These types of Join mobile log points with GPS information use cases may account for the majority of traditional to store boundary polygons for the purpose spatial aggregation and analysis. of determining the store visit patterns of mobile users. In the context of the Internet of Things (IoT) there are many more use cases for which the simplistic Challenge broadcaster approach breaks down. It becomes overwhelmed by the volume of polygons that would Both data sets are too big to import to a be needed to be broadcast to every node and held in single machine (terabytes of points; gigabytes memory. Its spatial processes become prohibitively of polygons). slow. This is where a different approach becomes essential. Pitney Bowes brings agility to these big-data Solution queries, stepping up to market needs to expedite and A partition strategy and corresponding algorithm optimize results. are needed.

Location Intelligence for Big Data A Pitney Bowes technical white paper Page 7

Cluster-level data partitioning

Cluster-level data partitioning is comprised of two Figure 1 main process steps: pre-partitioning and matching. 01. Pre-partitioning 02. The matching process

01. Pre-partitioning Pre-partitioning uses spatial attributes within the data to organize datasets for a big data file system (e.g. HDFS) prior to running the application. It allows the data to be queried or processed quickly as the application runs.

First, the nature of the data is examined to decide the best pre-partition approach.

Use case In our use case, user mobile data logs are constantly streamed into HDFS daily. The store-boundary data is provided by data vendors like Pitney Bowes and updated quarterly. It is more efficient to pre- partition store boundary data than the mobile user data, updating this preparation once per quarter when the boundary data is updated.

There are multiple algorithms to partition boundary data, these range from space-oriented algorithms like Grid to data-oriented algorithms like R-tree. However, space-oriented algorithms are usually more parallel friendly than data-oriented algorithms, so they are the preferred algorithm family at cluster level.

Regular grid is the most commonly used algorithm in the industry. Figure 1 shows an example of how regular grid is used to partition a large polygon dataset. Page 8

Balancing the data load Figure 2 shows the results of a data load-balance However, there is one key drawback of the grid- comparison between the regular grid approach and the based method: the spatial data distribution is often two new algorithms. The flatter the data distribution, the highly skewed. fewer the load-balancing issues, and the better the performance in the Hadoop cluster. You can see how much flatter the distribution is for the adaptive-tile Use case algorithm. In point-in-polygon tests using large numbers In a single day there may be thousands of mobile of points and polygons, the adaptive tile algorithm data user records generated at Grand Central in out-performs the regular grid method by more than New York City and zero records generated in the twenty times. Arizona desert.

The grid method is likely to create partitions with high- density data tiles which, in turn, will cause load balance issues in a Hadoop cluster-like environment.

To address this issue, Pitney Bowes has developed two algorithms:

• The bisect grid algorithm

• The adaptive tile based algorithm

Figure 2

Regular grid

Bisect grid

Adaptive tile

Location Intelligence for Big Data A Pitney Bowes technical white paper Page 9

02. The matching process A spatial encoding step could also be executed. This would After pre-partitioning the store boundary data, the apply a geohashing-like algorithm to latitude/longitude spatial joining or query processes can be designed. fields in each incoming point-data record during the streaming or data importing process, generate a variable- This can be done, for example using a MapReduce gridding based key and append it to the point-data records. application as illustrated in Figure 3: This key could then be used in an HDFS or NoSQL database for data storage indexing or partition, enabling fast spatial • All pre-partitioned store boundaries will be loaded first query or joining of these points data in later uses. within each partition, including a partition key. Figure3 • Then mobile point data are loaded and matched to a corresponding partition key. Partitioning Input Dataset A using samples • The matching process is similar to a simple geo-hashing Input Dataset B process and can be accomplished quickly. Partition Assignment • Then, matched pairs of data are imported into the reducer for spatial joining at the local level. Partitioned Data Local Join on Each Partition

Part 1: A1 Part 1: B1 In the example to the right, the mobile point data records Part 1 were not pre-processed. However, they could be. For Part 2: A2 Part 2: B2 example, if the process had required repeated spatially Part 2 querying or spatial joining, an additional pre-partitioning step could be added to pre-partition this point dataset Part 3: A3 Part 3: B3 Part 3 using the partition results of the store-boundary dataset.

Geospatial processing in node level Applying detailed geometry operations Geospatial processing is also comprised of two main Apply the appropriate detailed geometry operations process steps: Building the local spatial index, and between point data and polygon data. Different types of applying detailed geometry operations. point-in-polygon analysis can be accommodated, from simply identifying whether a polygon contains a particular point, to advanced point-in-polygon analysis that can also Use case return the distance from the point within a polygon to the After both the mobile point data records and polygon edges. store-boundary data are partitioned, matched, and sent into differnet slave nodes, the operation With this type of partition-based point-in-polygon analysis, in the node level is very similar to single-machine users are able to join dynamic mobile logs with store geospatial processing. boundaries within 30 minutes. They can quickly run multiple analyses daily to gain timely insights about customer-store visiting patterns. Building the local spatial index Build a local spatial index for data in the memory of the This type of high-precision, high-speed, high scalability node dynamically when the application is started. Typically, spatial analysis on mobile data was previously difficult to a data-oriented spatial index algorithm like R-Tree is used accomplish at all. With the Pitney Bowes solution, at this level, rather than the space-oriented index execution is quick and insights are easy to assimilate. algorithm that is preferred in the cluster level. Page 10

Built for today and tomorrow Big data technology is advancing rapidly and will Learn more continue to evolve. Today we are starting to see Spark To learn more about Location Intelligence for replacing Hadoop as the latest big data technology and Big Data visit us at pitneybowes.com the pace of innovation continues.

Businesses also have diverse use cases. These require different technology options such as batch in Hadoop, real-time streaming in Storm, or interactive spatial querying in NoSQL databases like HBase.

Pitney Bowes takes an agile approach to this diverse and rapidly changing environment so you can:

• Reflect a high degree of understanding of both spatial processing and big data technology in each individual use case.

• Plug industry-leading capabilities into most big data components and platforms.

• Gain the flexibility to address myriad user requirements.

• Ensure highly efficient application of capabilities against any given spatial use case.

• Maximize the distributed nature of big data clusters to optimize today’s high-data-volume applications.

With the right technology and capabilities, you can capitalize on extraordinary big-data insights.

Location Intelligence for Big Data A Pitney Bowes technical white paper 3001 Summer Street Stamford, CT 06926-0700 800 327 8627 [email protected]

Europe/United Kingdom The Smith Centre The Fairmile Henley-on-Thames Oxfordshire RG9 6AB 0800 840 0001 [email protected]

Canada 5500 Explorer Drive Mississauga, ON L4W5C7 800 268 3282 [email protected]

Australia/Asia Pacific Level 1, 68 Waterloo Road Macquarie Park NSW 2113 +61 2 9475 3500 [email protected]

For more information, visit us online: pitneybowes.com

Pitney Bowes and the Corporate logo are trademarks of Pitney Bowes Inc. or a subsidiary. All other trademarks are the property of their respective owners. © 2016 Pitney Bowes Inc. All rights reserved. 16DC03768_US