Accelerating Big Data with RDMA solutions

HPC advisory council, June 2013 © 2012 Mellanox Technologies - Mellanox Confidential - 1 Leading Supplier of End-to-End Interconnect Solutions

Storage Server / Compute Switch / Gateway Front / Back-End Virtual Protocol Interconnect Virtual Protocol Interconnect 56G IB & FCoIB 56G InfiniBand 10/40/56GbE & FCoE 10/40/56GbE Fibre Channel

Comprehensive End-to-End InfiniBand and Portfolio

ICs Adapter Cards Switches/Gateways Host/Fabric Software Cables

© 2012 Mellanox Technologies - Mellanox Confidential - 2 Three Areas for Accelerations

. Data Analytics • Explore inefficiencies in existing analytics frameworks and systems • Accelerate data processing to deliver faster results

. Storage • Explore ways to refine dominant • Take advantage for direct attached disk to accelerate data access

. Distributed Storage • Leverage popular distributed storage systems with Big Data applications • Use existing systems for usage with Big Data frameworks

© 2012 Mellanox Technologies - Mellanox Confidential - 3 Motivation to Accelerate Data Analytics

. Data Analysis Requires Faster Network • Hadoop Map Reduce Framework is a network intensive workload - Mapped data is shuffled between nodes in the cluster • Data Replication - A high availability event triggers Multi-Tera of data movement

. Provide Higher Data Value • Expose SSD’s low latency capabilities • Better server/CPU utilization

Big Data Applications Require High Bandwidth and Low Latency Interconnect

* Data Source: Intersect360 Research, 2012, IT and Data scientists survey © 2012 Mellanox Technologies - Mellanox Confidential - 4 Hadoop Framework

. A scalable fault-tolerant distributed system for data storage and processing

. Hadoop has two main systems • Hadoop Distributed File System: self-healing high-bandwidth clustered storage. • MapReduce: distributed fault-tolerant resource management and scheduling coupled with a scalable data programming abstraction.

. Key values Hive Pig • Flexibility – Store any data, Run any analysis. MapMap ReduceReduce HBase • Scalability – Start at 1TB/3-nodes grow to petabytes/1000s of nodes. • Economics – Cost per TB at a fraction of traditional options. HDFS™ (Hadoop Distributed File System)

DISK DISK DISK DISK DISK DISK

© 2012 Mellanox Technologies - Mellanox Confidential - 5 Unstructured Data Accelerator - UDA

. Plug-in architecture • Open-source, latest GA version 3.1 (6/10/2013) Hive Pig • code repository at: https://code.google.com/p/uda-plugin/

. Accelerates Map Reduce Jobs MapMap ReduceReduce HBase • Accelerated merge sort HDFS™ . Efficient Shuffle Provider (Hadoop Distributed File System) • Data transfer over RDMA

• Supports InfiniBand and Ethernet DISK DISK DISK DISK DISK DISK

. Supported Hadoop Distributions • Apache 3.0 – In the main trunk! • Apache 2.0.3 – In the main trunk*! • 1.0.x ; 1.1.x • Cloudera Distribution Hadoop 3 update 4 (CDH3u4) • Cloudera Distribution Hadoop 4 (CDH4) • Hortonworks HDP 1.1

. Supported Hardware • ConnectX®-3 VPI • SwitchX-2 based systems

© 2012 Mellanox Technologies - Mellanox Confidential - 6 Map Reduce Serialization

© 2012 Mellanox Technologies - Mellanox Confidential - 7 New Pipelined Data Flow

Map Map Map Map Map Map Map Map Map Stage Map Map Map Map

shuffle merge Header fetch

Shuffle Reduce Merge New shuffle merge Algorithm Header fetch

Reduce

start Time © 2012 Mellanox Technologies - Mellanox Confidential - 8 8 UDA - Software Architecture

JobTracker Hadoop (Java) TaskTracker TaskTracker

MapTask ReduceTask

UDA Plugin (C++) MOFSupplier NetMerger RDMA Merging Data Engine RDMA Client Merging Server ThreadMerging ThreadThread

RDMA NIC / HCA

© 2012 Mellanox Technologies - Mellanox Confidential - 9 Double Map Reduce Performance with UDA

FDR Infiniband

Disk Access ~50% CPU Efficiency 2.5X

*TeraSort is a popular benchmark used to measure the performance of Hadoop cluster **1TB Data Set, 16x dual X5670 (Westmere) Machines, 10x HDD Base; Vanilla GPHD1.2; UDA  GPHD1.2+UDA ~2X Faster Job Completion! Increase the Value of Data!

© 2012 Mellanox Technologies - Mellanox Confidential - 10 HiBench Benchmark Results

. HiBench is a combine test suite from • Tests: IO, Map Reduce, Machine Learning, Clustering and search applications . Faster Network provides between 15% and 100% performance Improvement! • Some applications are more I/O bounded than others

© 2012 Mellanox Technologies - Mellanox Confidential - 11 Cassandra, Initial Results

.Linearly scalable, column index database .Enable 30% more queries .Cut latency gaps by 50%

© 2012 Mellanox Technologies - Mellanox Confidential - 12 Three Areas for Accelerations

. Data Analytics • Explore inefficiencies in existing analytics frameworks and systems • Accelerate data processing to deliver faster results

. Storage • Explore ways to refine dominant file system • Take advantage for direct attached disk to accelerate data access

. Distributed Storage • Leverage popular distributed storage systems with Big Data applications • Use existing systems for usage with Big Data frameworks

© 2012 Mellanox Technologies - Mellanox Confidential - 13 The Great Things in Hadoop Distributed File System

• HDFS is a block storage solution • Block size can be modified to provide efficient solutions for very large files • Inherent reliability, no need for high end storage solution to make sure data is there! • Tuned for Hadoop work loads, write one and read many

© 2012 Mellanox Technologies - Mellanox Confidential - 14 The Less Great Things in HDFS

Metadata Server Failure Default 3x Replication Small files or latency sensitive

It’s hard to manage Ingress and extraction the different setting of data requires to get the right nodes additional tools. into the right capabilities.

© 2012 Mellanox Technologies - Mellanox Confidential - 15 Considerations When Planning Capacity

Growth Rate Cost of Storage Data Retention

Do you need Value Byte ? Real-Time Analytics ? If it’s not , is it worth storing on a high performance storage?

© 2012 Mellanox Technologies - Mellanox Confidential - 16 HDFS Acceleration; Joint Project With Ohio State University

. HDFS is the Hadoop File System Hive Pig • The underlying File system for HBase and other NoSQL Data Bases Map Reduce HBase . More Drives, Higher Throughput is Needed HDFS™ (Hadoop Distributed File System) . SSDs Solutions Must use Higher Throughput

• Bounded by 1GbE and 10GbE DISK DISK DISK DISK DISK DISK

© 2012 Mellanox Technologies - Mellanox Confidential - 17 HDFS Acceleration; Joint Project With Ohio State University

© 2012 Mellanox Technologies - Mellanox Confidential - 18 HDFS Acceleration; Joint Project With Ohio State University

© 2012 Mellanox Technologies - Mellanox Confidential - 19 Unlocking the Power SSDs In Hadoop Environment

. SSDs Become De-Facto standard in HDFS deployment • Read capability is a critical factor for application performance

. E-DFSIO, Part of Intel’s HiBench test suite, profiles aggregated throughput on the cluster • 1GbE network impede any performance benefit from SSD deployment E-DFSIO, Showing the Power of SSD @ HDFS

© 2012 Mellanox Technologies - Mellanox Confidential - 20 Three Areas for Accelerations

. Data Analytics • Explore inefficiencies in existing analytics frameworks and systems • Accelerate data processing to deliver faster results

. Storage • Explore ways to refine dominant file system • Take advantage for direct attached disk to accelerate data access

. Distributed Storage • Leverage popular distributed storage systems with Big Data applications • Use existing systems for usage with Big Data frameworks

© 2012 Mellanox Technologies - Mellanox Confidential - 21 OrangeFS as Hadoop Storage Solution

© 2012 Mellanox Technologies - Mellanox Confidential - 22 as Hadoop Storage Solution

Source: Map/Reduce on Lustre, Hadoop Performance in HPC Environments, Nathan Rutman, Senior Architect, Networked Storage Solutions, Xyratex

© 2012 Mellanox Technologies - Mellanox Confidential - 23 as Hadoop Storage Solution

. Generating lot of Interest since the Ceph kernel client was pulled into kernel 2.6.34 • Object-based parallel file system • Scalable metadata server • Each file can specify it’s own striping strategy and object size • Automatic rebalancing of data with minimal data movement • Hadoop module for integrating Ceph has been in development since 0.12 release

. Benchmarks on Ceph is still WIP • We are currently working on using running benchmarks on Ceph – Stay tuned!!

© 2012 Mellanox Technologies - Mellanox Confidential - 24 Cloudera Certified – CDH3 and CDH4

. Mellanox VPI Card • MCX354A-FCBT

. Mellanox Edge Switches • MSX10xx; MSX60xx

© 2012 Mellanox Technologies - Mellanox Confidential - 25 Simple Building Block for Big Data Solution

. E5-26x0 (Sandy Bridge) Machines • Dual Socket • 4+ cores each socket • 32GB+ of DRAM

. Disk Drives • At least 5 x 1TB, SAS, 10K RPM

. Hadoop Configuration • At least one Name Node + Job Tracker • At least 4 Data Nodes

. Installation: • Your selection of Hadoop Distribution or other Big Data solution (Such as Cassandra)

. Networking • ConnectX-3 VPI card, FDR, 40GbE and 10GbE • SwitchX based systems: MSX6036F, MSX1036B and MSX1016 • Mellanox’s FDR, 40GbE and 10GbE Cable Solutions

. http://www.mellanox.com/related-docs/whitepapers/WP_Deploying_Hadoop.pdf

© 2012 Mellanox Technologies - Mellanox Confidential - 26 Test Drive Your Big Data

. EMC 1000-Node Analytic Platform . Accelerates Industry's Hadoop Development . 24 PetaByte of physical storage • Half of every written word since inception of mankind . Mellanox VPI Solutions

Hadoop 2X Faster Hadoop Job Run-Time Acceleration

High Throughput, Low Latency, RDMA Critical for ROI

© 2012 Mellanox Technologies - Mellanox Confidential - 27 Thank You

© 2012 Mellanox Technologies - Mellanox Confidential - 28