Hadoop on OpenStack : Scaling Hadoop-SwiftFS for Big

October 29th, 2015 Andrew Leamon – Director Christopher Power - Principal Engineer Engineering Analysis About Comcast

Comcast brings together the best in media and technology. We drive innovation to create the world’s best entertainment and online experiences. High Speed Internet

Video

IP Telephony

Home Security / Automation

Universal Parks

Media Properties

2 Hadoop on OpenStack About Our Team: Engineering Analysis

High Speed Data Video

Business

IP Telephony Home Security / Automation

Feature Exploratory Reporting / Engineering Ad-hoc Data Simulation Visualization / Machine Analysis Learning Analysis Platform OpenStack

3 Hadoop on OpenStack Hadoop Overview

4 Hadoop on OpenStack Hadoop / Cloud Evolution – Why does this make sense?

Network Bandwidth is growing faster than Disk I/O - Doubling every 18 vs 24 months - Network is faster than disk - Location of Disk is not as important. - IOps are the key metric.

Courtesy of the Ethernet Alliance: https://www.nanog.org/meetings/nanog56/presentations/Tuesday/tues.general.kipp.23.pdf

5 Hadoop on OpenStack Hadoop / Cloud Evolution – Memory Growth

• 2003 - MapReduce paper was published • 2005 - Hadoop Born • 2012 – was released. Leverage Main Memory and Avoid Disk I/O • 2014 – Apache Tez released to Avoid Disk I/O Available main memory per server has increased greatly.

Courtesy of the Centip: https://centiq.co.uk/sap-sizing

6 Hadoop on OpenStack Hadoop / Cloud Evolution – Performance Increasing

• Performance of Everything has been increasing with the exception of HDD

Courtesy of the Cisco: http://www.cisco.com/c/en/us/products/collateral/servers-unified-computing/ucs--series-rack-servers/whitepaper-C1 1-734798.ht ml

7 Hadoop on OpenStack Hadoop / Cloud Evolution – Disk is the long poll!

Factors that make Hadoop on the Cloud Possible • Disk is the long poll, Network is additive but proportional • Many workloads are CPU bound anyway • Compression and Columnar formats reduce I/O and leverage CPU • Servers have more Memory • Keep Data in Memory whenever possible • Avoid I/O at all costs. • Only read once. • Only write once. • Locality is less important. • MPP Frameworks like Spark & Tez make this possible.

8 Hadoop on OpenStack Hadoop Scaling – Coupled Storage & Compute

• On bare metal, storage and compute are coupled together. • Scaling one means that you have to scale the other proportionally.

9 Hadoop on OpenStack Hadoop Scaling – Decoupled Storage & Compute

• On OpenStack Compute and Storage can be decoupled using Swift as the Object Store, this allows you to: • Scale compute and storage independently • Run multiple clusters simultaneously • Provide greater access to data

10 Hadoop on OpenStack Big Data Platform

Cinder

Swift

11 Hadoop on OpenStack OpenStack @ Comcast

• Vanilla distribution of OpenStack

• Multiple data centers

• Multi-tenant, multi-region

• Nova, Neutron (with IPV6 support), Glance

• Cinder block, Swift object provided by

• Ceilometer metrics

• Heat orchestration

12 Hadoop on OpenStack Anatomy of Hadoop on the Cloud

Design for the cloud

• Assume things will fail

• Distribute load for performance and fault tolerance Automation • Use persistent storage where appropriate

Think elastically, scale horizontally

• Scale intelligently to meet demand

• Return resources when not in use Horizontal Scaling

Leverage automation

• Automate, automate, automate

• Increase efficiency and repeatability

13 Hadoop on OpenStack Performance and Fault Tolerance

Affinity and Anti-affinity

• OpenStack allows the user to explicitly specify whether a group of VMs should or should not share the same physical hosts

• Create ServerGroup with Anti-Affinity and provide scheduler hint during boot

• Improves performance in a multi-tenant environment by spreading CPU and Network load across physical hosts

• Provides a mechanism to increase fault tolerance by scheduling critical services on Courtesy of Cloudwatt dev: mutually-exclusive physical hosts https://dev.cloudwatt.com/en/blog/affinity-and-anti-affinity-in- .html

14 Hadoop on OpenStack Cluster Node Storage Architecture

Cinder Block Storage (CEPH RBD)

• Persistent storage for all cluster nodes Cluster Node VM Root Volume • DataNode HDFS Cinder Volume • Can act as NodeManager local disk Ephemeral (HDFS)

(Local Disk) libvirtCEPH

Ephemeral Block Device(s)

• High performance direct attached storage

• Root volume, local disk OpenStack Swift • Best for NodeManager local disk ()

Swift (CEPH RadosGW)

• Data lake, unified central storage

• Source, destination for job data

15 Hadoop on OpenStack Local Storage – Cinder vs Ephemeral

How important is ephemeral storage for big data workloads on the cloud?

• Traditional Hadoop jobs are read/write intensive during their intermediate stages • Performant local disk useful for transient data like shuffle/sort, spilling to disk, logs • Local disk setting configured using: yarn.nodemanager.local-dirs

Benchmarks Cinder • TeraSort at 1TB • DFSIO at 10x1GB, 100x10GB vs Configurations Ephemeral • a) Cinder volume – network attached • b) Local ephemeral disk – direct attached

16 Hadoop on OpenStack Local Storage – Cinder vs Ephemeral Results

TeraSort – ephemeral showed 29% wall clock improvement over cinder DFSIO – negligible performance difference

Local Storage Comparison – Cinder vs Ephemeral 1.20

1.00

0.80

0.60

0.40 Relative Runtime Job Relative 0.20

0.00 TeraSort (1TB) DFSIO (10x1GB) DFSIO (100x1GB) Ephermal Cinder

17 Hadoop on OpenStack Hadoop + SwiftFS 101

How does Hadoop interact with Swift?

• Hadoop SwiftFS implements Hadoop VM VM VM FileSystem interface on top of OpenStack Hadoop- Hadoop- Hadoop- SwiftFS SwiftFS SwiftFS Swift REST API

Hadoop-SwiftFS Network • Part of the Sahara project (Sahara-Extra)

• https://github.com/openstack/sahara-extra

Hadoop-OpenStack OpenStack Swift • Part of Apache Hadoop, of Sahara-Extra

• https://github.com/apache/hadoop/tree/trunk/h adoop-tools/hadoop-openstack

18 Hadoop on OpenStack Challenges with Hadoop at scale on Swift

When we attempted to run jobs at scale we noticed a few things

• Large number of input splits

• Hadoop clients took a long time to launch jobs

• Swift only returned the first 10,000 objects

• Job output is written and then renamed

• CEPH cluster needed some tuning

19 Hadoop on OpenStack Large number of input splits

Challenge

• Noticed multiple the typical number of input splits

Issues • Hadoop uses blocksize to compute input splits

• Default Swift “blocksize” set to 32MB

Solution

• Set blocksize appropriate to your environment

• Example: fs.swift.blocksize=131072 (for 128MB blocks)

20 Hadoop on OpenStack Slow launching jobs

Challenge

• Hadoop clients took a long time to launch jobs Container

Issues

• Hadoop does not know it is talking to an object store

• Asks for metadata and block locations of every object

• Results in O(n) performance for number of objects objects list

Hadoop-SwiftFS Possible Approaches Hadoop Client • Multi-threading – only works at directory/partition level FileInputFormat.getSplits • Override getSplits – tool specific implementations

21 Hadoop on OpenStack Slow launching jobs solution

Solution

• Extend support for location awareness flag to get block locations methods

• Reduce unnecessary calls to get object metadata

• Localize changes to Hadoop SwiftFS layer Performance Improvement

1400

1200 Benefits 1000 • Jobs launch faster 800 • Reduces load on object store 600 Seconds • Works across tool ecosystem 400 • Improves interactive query experience 200

0 100 1,000 2,500 5,000 10,000 25,000 # of Objects in Container

hadoop-swift-latest. with optimizations

22 Hadoop on OpenStack Swift only returns first 10,000 objects

Challenge

• Swift only returns the first 10,000 objects in a container or partition

• http://developer.openstack.org/api-ref-objectstorage-v1.html#showContainerDetails

Solution

• Page through list of objects using marker and limit query string parameters

• Continue until the number of items returned is less than the requested limit value

• Default set to 10000

• Configurable by setting fs.swift.container.list.limit

1 2 3 ...... 10000

23 Hadoop on OpenStack Job output write and rename

Hadoop’s OutputCommitter writes task output to temporary directory Job completes, temporary directory is renamed to final output directory

Object stores are not file systems • Rename results in a copy and delete which is expensive

• Consequence of using path of the object as hash to the storage location

Object store compatible OutputCommitter

• Basic approach skips temporary write, outputs directly to final destination • Enhanced approach uses local ephemeral storage for temporay writes

24 Hadoop on OpenStack CEPH Architecture and Tuning

Tuning for Hadoop Workloads Load Balancing

• Scale RadosGWs and CEPH OSD nodes RadosGW … RadosGW • Enable container index sharding

CEPH CEPH CEPH • Increase placement groups for RadosGW … OSD OSD OSD index pool

• Increase filestore merge.threshold and

split.multiple configurations

• Turn off RadosGW logs

25 Hadoop on OpenStack Lessons Learned

• Get to know your OpenStack architecture

• Understand the impacts of your cluster design

• Use ephemeral local disk for NodeManager if possible

• Ensure consistent pseudo-directory representation

• Think about your container data organization

• Choose file formats that reduce I/O (ORC/Parquet)

26 Hadoop on OpenStack Next Steps and Future Enhancements

Next Steps • Upstream enhancements back to the community

Future Enhancements • Keystone authentication token optimization

• Handle large number of partitions • Streamline map task object retrieval

27 Hadoop on OpenStack