Hadoop on OpenStack : Scaling Hadoop-SwiftFS for Big Data
October 29th, 2015 Andrew Leamon – Director Christopher Power - Principal Engineer Engineering Analysis About Comcast
Comcast brings together the best in media and technology. We drive innovation to create the world’s best entertainment and online experiences. High Speed Internet
Video
IP Telephony
Home Security / Automation
Universal Parks
Media Properties
2 Hadoop on OpenStack About Our Team: Engineering Analysis
High Speed Data Video
Business
IP Telephony Home Security / Automation
Feature Exploratory Reporting / Engineering Ad-hoc Data Simulation Visualization / Machine Analysis Learning Analysis Big Data Platform OpenStack
3 Hadoop on OpenStack Hadoop Overview
4 Hadoop on OpenStack Hadoop / Cloud Evolution – Why does this make sense?
Network Bandwidth is growing faster than Disk I/O - Doubling every 18 vs 24 months - Network is faster than disk - Location of Disk is not as important. - IOps are the key metric.
Courtesy of the Ethernet Alliance: https://www.nanog.org/meetings/nanog56/presentations/Tuesday/tues.general.kipp.23.pdf
5 Hadoop on OpenStack Hadoop / Cloud Evolution – Memory Growth
• 2003 - MapReduce paper was published • 2005 - Hadoop Born • 2012 – Apache Spark was released. Leverage Main Memory and Avoid Disk I/O • 2014 – Apache Tez released to Avoid Disk I/O Available main memory per server has increased greatly.
Courtesy of the Centip: https://centiq.co.uk/sap-sizing
6 Hadoop on OpenStack Hadoop / Cloud Evolution – Performance Increasing
• Performance of Everything has been increasing with the exception of HDD
Courtesy of the Cisco: http://www.cisco.com/c/en/us/products/collateral/servers-unified-computing/ucs-c-series-rack-servers/whitepaper-C1 1-734798.ht ml
7 Hadoop on OpenStack Hadoop / Cloud Evolution – Disk is the long poll!
Factors that make Hadoop on the Cloud Possible • Disk is the long poll, Network is additive but proportional • Many workloads are CPU bound anyway • Compression and Columnar formats reduce I/O and leverage CPU • Servers have more Memory • Keep Data in Memory whenever possible • Avoid I/O at all costs. • Only read once. • Only write once. • Locality is less important. • MPP Frameworks like Spark & Tez make this possible.
8 Hadoop on OpenStack Hadoop Scaling – Coupled Storage & Compute
• On bare metal, storage and compute are coupled together. • Scaling one means that you have to scale the other proportionally.
9 Hadoop on OpenStack Hadoop Scaling – Decoupled Storage & Compute
• On OpenStack Compute and Storage can be decoupled using Swift as the Object Store, this allows you to: • Scale compute and storage independently • Run multiple clusters simultaneously • Provide greater access to data
10 Hadoop on OpenStack Big Data Platform
Cinder
Swift
11 Hadoop on OpenStack OpenStack @ Comcast
• Vanilla distribution of OpenStack
• Multiple data centers
• Multi-tenant, multi-region
• Nova, Neutron (with IPV6 support), Glance
• Cinder block, Swift object provided by CEPH
• Ceilometer metrics
• Heat orchestration
12 Hadoop on OpenStack Anatomy of Hadoop on the Cloud
Design for the cloud
• Assume things will fail
• Distribute load for performance and fault tolerance Automation • Use persistent storage where appropriate
Think elastically, scale horizontally
• Scale intelligently to meet demand
• Return resources when not in use Horizontal Scaling
Leverage automation
• Automate, automate, automate
• Increase efficiency and repeatability
13 Hadoop on OpenStack Performance and Fault Tolerance
Affinity and Anti-affinity
• OpenStack allows the user to explicitly specify whether a group of VMs should or should not share the same physical hosts
• Create ServerGroup with Anti-Affinity and provide scheduler hint during nova boot
• Improves performance in a multi-tenant environment by spreading CPU and Network load across physical hosts
• Provides a mechanism to increase fault tolerance by scheduling critical services on Courtesy of Cloudwatt dev: mutually-exclusive physical hosts https://dev.cloudwatt.com/en/blog/affinity-and-anti-affinity-in- openstack.html
14 Hadoop on OpenStack Cluster Node Storage Architecture
Cinder Block Storage (CEPH RBD)
• Persistent storage for all cluster nodes Cluster Node VM Root Volume • DataNode HDFS Cinder Volume • Can act as NodeManager local disk Ephemeral (HDFS)
(Local Disk) libvirtCEPH
Ephemeral Block Device(s)
• High performance direct attached storage
• Root volume, local disk OpenStack Swift • Best for NodeManager local disk (Data Lake)
Swift Object Storage (CEPH RadosGW)
• Data lake, unified central storage
• Source, destination for job data
15 Hadoop on OpenStack Local Storage – Cinder vs Ephemeral
How important is ephemeral storage for big data workloads on the cloud?
• Traditional Hadoop jobs are read/write intensive during their intermediate stages • Performant local disk useful for transient data like shuffle/sort, spilling to disk, logs • Local disk setting configured using: yarn.nodemanager.local-dirs
Benchmarks Cinder • TeraSort at 1TB • DFSIO at 10x1GB, 100x10GB vs Configurations Ephemeral • a) Cinder volume – network attached • b) Local ephemeral disk – direct attached
16 Hadoop on OpenStack Local Storage – Cinder vs Ephemeral Results
TeraSort – ephemeral showed 29% wall clock improvement over cinder DFSIO – negligible performance difference
Local Storage Comparison – Cinder vs Ephemeral 1.20
1.00
0.80
0.60
0.40 Relative Runtime Job Relative 0.20
0.00 TeraSort (1TB) DFSIO (10x1GB) DFSIO (100x1GB) Ephermal Cinder
17 Hadoop on OpenStack Hadoop + SwiftFS 101
How does Hadoop interact with Swift?
• Hadoop SwiftFS implements Hadoop VM VM VM FileSystem interface on top of OpenStack Hadoop- Hadoop- Hadoop- SwiftFS SwiftFS SwiftFS Swift REST API
Hadoop-SwiftFS Network • Part of the Sahara project (Sahara-Extra)
• https://github.com/openstack/sahara-extra
Hadoop-OpenStack OpenStack Swift • Part of Apache Hadoop, fork of Sahara-Extra
• https://github.com/apache/hadoop/tree/trunk/h adoop-tools/hadoop-openstack
18 Hadoop on OpenStack Challenges with Hadoop at scale on Swift
When we attempted to run jobs at scale we noticed a few things
• Large number of input splits
• Hadoop clients took a long time to launch jobs
• Swift only returned the first 10,000 objects
• Job output is written and then renamed
• CEPH cluster needed some tuning
19 Hadoop on OpenStack Large number of input splits
Challenge
• Noticed multiple the typical number of input splits
Issues • Hadoop uses blocksize to compute input splits
• Default Swift “blocksize” set to 32MB
Solution
• Set blocksize appropriate to your environment
• Example: fs.swift.blocksize=131072 (for 128MB blocks)
20 Hadoop on OpenStack Slow launching jobs
Challenge
• Hadoop clients took a long time to launch jobs Container
Issues
• Hadoop does not know it is talking to an object store
• Asks for metadata and block locations of every object
• Results in O(n) performance for number of objects objects list
Hadoop-SwiftFS Possible Approaches Hadoop Client • Multi-threading – only works at directory/partition level FileInputFormat.getSplits • Override getSplits – tool specific implementations
21 Hadoop on OpenStack Slow launching jobs solution
Solution
• Extend support for location awareness flag to get block locations methods
• Reduce unnecessary calls to get object metadata
• Localize changes to Hadoop SwiftFS layer Performance Improvement
1400
1200 Benefits 1000 • Jobs launch faster 800 • Reduces load on object store 600 Seconds • Works across tool ecosystem 400 • Improves interactive query experience 200
0 100 1,000 2,500 5,000 10,000 25,000 # of Objects in Container
hadoop-swift-latest.jar with optimizations
22 Hadoop on OpenStack Swift only returns first 10,000 objects
Challenge
• Swift only returns the first 10,000 objects in a container or partition
• http://developer.openstack.org/api-ref-objectstorage-v1.html#showContainerDetails
Solution
• Page through list of objects using marker and limit query string parameters
• Continue until the number of items returned is less than the requested limit value
• Default set to 10000
• Configurable by setting fs.swift.container.list.limit
1 2 3 ...... 10000
23 Hadoop on OpenStack Job output write and rename
Hadoop’s OutputCommitter writes task output to temporary directory Job completes, temporary directory is renamed to final output directory
Object stores are not file systems • Rename results in a copy and delete which is expensive
• Consequence of using path of the object as hash to the storage location
Object store compatible OutputCommitter
• Basic approach skips temporary write, outputs directly to final destination • Enhanced approach uses local ephemeral storage for temporay writes
24 Hadoop on OpenStack CEPH Architecture and Tuning
Tuning for Hadoop Workloads Load Balancing
• Scale RadosGWs and CEPH OSD nodes RadosGW … RadosGW • Enable container index sharding
CEPH CEPH CEPH • Increase placement groups for RadosGW … OSD OSD OSD index pool
• Increase filestore merge.threshold and
split.multiple configurations
• Turn off RadosGW logs
25 Hadoop on OpenStack Lessons Learned
• Get to know your OpenStack architecture
• Understand the impacts of your cluster design
• Use ephemeral local disk for NodeManager if possible
• Ensure consistent pseudo-directory representation
• Think about your container data organization
• Choose file formats that reduce I/O (ORC/Parquet)
26 Hadoop on OpenStack Next Steps and Future Enhancements
Next Steps • Upstream enhancements back to the community
Future Enhancements • Keystone authentication token optimization
• Handle large number of partitions • Streamline map task object retrieval
27 Hadoop on OpenStack