The Data Lake Engine

Data Microservices in Spark

using Apache Arrow Flight Apache Arrow: Primer Arrow has become the industry standard for in-memory data

Arrow powers dozens of open source & >12M monthly downloads & growing commercial technologies exponentially

10+ programming languages supported Arrow’s adoption provides numerous benefits: • 300+ developers contributing Java, , C++, Python, • Broad architecture (CPU/GPU/FPGA), OS and language support , JavaScript, C#, • Awareness & OSS thought leadership Ruby, Rust, Go, … What is Arrow?

What is it? What isn’t it?

● A specification that outlines ● It’s not an installable system in-memory binary layout of data ● It’s not a memory grid or in-memory ● A set of libraries and tools cache ● A set of standards to make analytical ● It’s not designed for streaming or data transportable other single record operations (e.g. transactions) ● Representation for efficient analytical processing on CPUs and GPUs Arrow In Memory Columnar Format

● Shredded Nested Data Structures ● Randomly Accessible

● Maximize CPU throughput Traditional Arrow ○ Pipelining Memory Memory ○ SIMD ○ cache locality ● Scatter/gather I/O Example Arrow Building Blocks

Gandiva Arrow Flight ● LLVM-based JIT compilation for ● RPC/IPC interchange library for execution of arbitrary expressions efficient interchange of data against Arrow data structures between processes

Feather Parquet ● Fast ephemeral format for ● Read and write Arrow quickly movement of data between to/from Parquet. C++ library R/Python builds directly on Arrow. Apache Arrow Flight Arrow Flight

● High performance wire protocol ● Focused on bulk transfer for analytics ● Full delivery of Arrow interoperability promise ● Cross-platform FLIGHT ● Built for parallel ● Designed for Security Arrow Data Paradigm: Streams of Batches

● Primary Communication: ● Specific Methods: ○ A stream of Arrow record batches ○ Put Stream: Client sends a stream to server ○ Bulk transfer targeting efficient movement ○ Get Stream: Server sends a stream to client ○ Effectively peer-to-peer ○ Both initiated by Client

end Data Data Data Put Header

Thanks

Client Server Get Descriptor

Header Data Data Data end Big Datasets Require Parallel Streams

● Parallel consumption and locality awareness Flight ○ A flight is composed of streams Stream ○ Each stream has a FlightEndpoint: A opaque stream Host 1 ticket along with a consumption location Stream ○ Systems can take advantage of location information to improve data locality Stream ● Flights have two reference systems: Host 2 Stream ○ Dotted path namespace for simple services (e.g. marketing.yesterday.sales) ○ Arbitrary binary command descriptor: (e.g. “select a,b from foo where c > 10”) ● Support for Stream Listing Endpoint: Retrieved with Ticket ○ ListFlights (Criteria) ○ GetFlightInfo (FlightDescriptor)

Flight Spark Source Spark DataSource V2

● Columnar support

● Transactions

● Partitions

● Better support for pushdowns Flight Spark Source

● Uses Columnar Batch to leverage Spark’s Arrow support

● Supports pushdown of filters and projects

● Partitioned by Arrow flight ticket Benchmarks

● 4x node EMR querying 4x node Data Size JDBC Serial Parallel Parallel Flight - Dremio AWS Edition (m5d.8xlarge) Flight Flight 8 nodes

● Return n rows to spark executors then 100,000 3.84 (1) 1 (1) 2.9 (2.21) 3.78 (3.02)

perform a non-trivial calculation 1,000,000 6.5 (2.8) 1.41 (1) 3.07 (2.76) 4.38 (2.98)

● Table shows t1 (t2) where t1 is total 10,000,000 25.88 (22.9) 8.05 (4.3) 6.25 (3.43) 8.19 (4)

time and t2 is only transport time 100,000,000 223 (220) 109 (105) 18.72 (11) 8.53 (10) ● All units are seconds 1,000,000,000 n/a n/a 36.6 (16) 18.9 (15) Demo Time! Thanks!

Let me know your thoughts Benchmarks ○ [email protected] ● Flight: https://bit.ly/32IWvCB ○ https://github.com/rymurr ● Spark Connector: https://bit.ly/3bpR0Ni Join the Arrow Community

○ @apachearrow ○ [email protected] Code Examples ○ arrow.apache.org ● Arrow Flight Example Code: Try out Dremio https://bit.ly/2XgjmUE ○ bit.ly/dremiodeploy ○ community.dremio.com