Data Microservices in Spark Using Apache Arrow Flight

The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight Apache Arrow: Primer Arrow has become the industry standard for in-memory data Arrow powers dozens of open source & >12M monthly downloads & growing commercial technologies exponentially 10+ programming languages supported Arrow’s adoption provides numerous benefits: • 300+ developers contributing Java, C, C++, Python, • Broad architecture (CPU/GPU/FPGA), OS and language support R, JavaScript, C#, • Awareness & OSS thought leadership Ruby, Rust, Go, … What is Arrow? What is it? What isn’t it? ● A specification that outlines ● It’s not an installable system in-memory binary layout of data ● It’s not a memory grid or in-memory ● A set of libraries and tools cache ● A set of standards to make analytical ● It’s not designed for streaming or data transportable other single record operations (e.g. transactions) ● Representation for efficient analytical processing on CPUs and GPUs Arrow In Memory Columnar Format ● Shredded Nested Data Structures ● Randomly Accessible ● Maximize CPU throughput Traditional Arrow ○ Pipelining Memory Memory ○ SIMD ○ cache locality ● Scatter/gather I/O Example Arrow Building Blocks Gandiva Arrow Flight ● LLVM-based JIT compilation for ● RPC/IPC interchange library for execution of arbitrary expressions efficient interchange of data against Arrow data structures between processes Feather Parquet ● Fast ephemeral format for ● Read and write Arrow quickly movement of data between to/from Parquet. C++ library R/Python builds directly on Arrow. Apache Arrow Flight Arrow Flight ● High performance wire protocol ● Focused on bulk transfer for analytics ● Full delivery of Arrow interoperability promise ● Cross-platform FLIGHT ● Built for parallel ● Designed for Security Arrow Data Paradigm: Streams of Batches ● Primary Communication: ● Specific Methods: ○ A stream of Arrow record batches ○ Put Stream: Client sends a stream to server ○ Bulk transfer targeting efficient movement ○ Get Stream: Server sends a stream to client ○ Effectively peer-to-peer ○ Both initiated by Client end Data Data Data Put Header Thanks Client Server Get Descriptor Header Data Data Data end Big Datasets Require Parallel Streams ● Parallel consumption and locality awareness Flight ○ A flight is composed of streams Stream ○ Each stream has a FlightEndpoint: A opaque stream Host 1 ticket along with a consumption location Stream ○ Systems can take advantage of location information to improve data locality Stream ● Flights have two reference systems: Host 2 Stream ○ Dotted path namespace for simple services (e.g. marketing.yesterday.sales) ○ Arbitrary binary command descriptor: (e.g. “select a,b from foo where c > 10”) ● Support for Stream Listing Endpoint: Retrieved with Ticket ○ ListFlights (Criteria) ○ GetFlightInfo (FlightDescriptor) Flight Spark Source Spark DataSource V2 ● Columnar support ● Transactions ● Partitions ● Better support for pushdowns Flight Spark Source ● Uses Columnar Batch to leverage Spark’s Arrow support ● Supports pushdown of filters and projects ● Partitioned by Arrow flight ticket Benchmarks ● 4x node EMR querying 4x node Data Size JDBC Serial Parallel Parallel Flight - Dremio AWS Edition (m5d.8xlarge) Flight Flight 8 nodes ● Return n rows to spark executors then 100,000 3.84 (1) 1 (1) 2.9 (2.21) 3.78 (3.02) perform a non-trivial calculation 1,000,000 6.5 (2.8) 1.41 (1) 3.07 (2.76) 4.38 (2.98) ● Table shows t1 (t2) where t1 is total 10,000,000 25.88 (22.9) 8.05 (4.3) 6.25 (3.43) 8.19 (4) time and t2 is only transport time 100,000,000 223 (220) 109 (105) 18.72 (11) 8.53 (10) ● All units are seconds 1,000,000,000 n/a n/a 36.6 (16) 18.9 (15) Demo Time! Thanks! Let me know your thoughts Benchmarks ○ [email protected] ● Flight: https://bit.ly/32IWvCB ○ https://github.com/rymurr ● Spark Connector: https://bit.ly/3bpR0Ni Join the Arrow Community ○ @apachearrow ○ [email protected] Code Examples ○ arrow.apache.org ● Arrow Flight Example Code: Try out Dremio https://bit.ly/2XgjmUE ○ bit.ly/dremiodeploy ○ community.dremio.com.

Data Microservices in Spark Using Apache Arrow Flight

Quick Install for AWS EMR

Delft University of Technology Arrowsam In-Memory Genomics

The Platform Inside and out Release 0.8

Arrow: Integration to 'Apache' 'Arrow'

In Reference to RPC: It's Time to Add Distributed Memory

Ray: a Distributed Framework for Emerging AI Applications

Zero-Cost, Arrow-Enabled Data Interface for Apache Spark

ECSEL2017-1-737451 Fitoptivis from the Cloud to the Edge

Open Source Software Packages

Enabling Geospatial in Big Data Lakes and Databases with Locationtech Geomesa

Release Notes Date Published: 2020-10-13 Date Modified

Processingではじめるプログラミング楽しく、簡単に、Javaベースの言語を学ぶ (I/O)/加藤直樹/ 著 IO編集部/編集/NEOB