Quantcast Petabyte Storage at Half Price with QFS

Presented by Silvius Rus, Director, Big Data Platforms

September 2013

9-131 Quantcast (QFS)

A high performance Manages multi-petabyte Hadoop workloads alternative to the Hadoop with significantly faster I/O than HDFS and uses Distributed File System only half the disk space. (HDFS). Offers massive cost savings to large scale Hadoop users (fewer disks = fewer machines).

Production hardened at Quantcast under massive processing loads (multi exabyte).

Fully Compatible with .

100% Open Source.

9-13 © Quantcast 2012 2 Quantcast Technology Innovation Timeline

Quantcast Quantcast Measurement Advertising Launch Launched Launched QFS

Receiving Receiving Receiving Receiving 1TB/day 10TB/day 20TB/day 40TB/day

2006 2007 2008 2009 2010 2011 2012 2013

Processing Processing Processing 1PB/day 10PB/day 20PB/day Started Using and Turned using sponsoring off Hadoop KFS HDFS

Quantcast File System 9-13 © Quantcast 2012 3 Architecture

Client Rack 1 · Implements high level file interface (read/write/delete) · On write, RS encodes chunks and distributes stripes to nine chunk servers. Read/write · On read, collects RS stripes RS encoded from six chunk servers and data from/to recomposes chunk. Chunk servers chunk servers Client

Copy/Recover Chunk Server Locate or chunks · Handles IO to locally stored allocate chunks 64MB chunks · Monitors host file system health · Replicates and recovers chunks Metaserver as metaserver directs · Maps /file/paths to chunk ids Chunk replication · Manages chunk locations and rebalancing · Directs clients to chunk servers instructions Chunk servers

Metaserver Rack 2

Quantcast File System 9-13 © Quantcast 2012 4 QFS vs. HDFS

Broadly comparable feature Feature QFS HDFS set, with significant storage efficiency advantages. Scalable, distributed storage designed for efficient batch processing ü ü

Open source ü ü

Hadoop compatible ü ü

Unix style file permissions ü ü

Reed-Solomon Multiple data Error Recovery mechanism encoding copies

Disk space required 1.5x 3x (as a multiple of raw data)!

Quantcast File System 9-13 © Quantcast 2012 5 Reed-Solomon Error Correction Leveraging high-speed modern networks

HDFS optimizes toward data 1. Break original data into Reed-Solomon Parallel Data I/O locality for older networks. 64K stripes.

10Gbps networks are now 2. Reed-Solomon generates common, making disk I/O three parity stripes for a more critical bottleneck. every six data strips QFS leverages faster networks 3. Write those to nine to achieve better parallelism different drives. and encoding efficiency. 4. Up to three stripes can Result: higher error tolerance, become unreadable... faster performance, with half the disk space. 5. …yet the original data can still be recovered Every write parallelized across 9 drives, every read across 6

Quantcast File System 9-13 © Quantcast 2012 6 MapReduce on 6+3 Erasure Coded Files versus 3x Replicated Files

Positives Negatives

Writing is ½ off, both in terms of space There is no locality, reading will require the network and time On read failure, recovery is needed – however it’s Any 3 broken or slow devices will be lightning fast on modern CPUs (2 GB/s per core) tolerated vs. any 2 with 3-way replication Writes don’t achieve network line rate as original + Re-executed stragglers run faster due to parity data is written by a single client reading from multiple devices (striping)

Quantcast File System 9-13 © Quantcast 2012 7 Read/Write Benchmarks

18 HDFS 64 MB 16 Host network behavior during tests HDFS 2.5 GB 14 QFS write = ½ disk I/O of HDFS write QFS 64 MB QFS write à network/disk = 8/9 12 HDFS write à network/disk = 6/9 QFS read à network/disk = 1 10 HDFS read à network/disk = very small

8

End-to-endtime(minutes) 6

4 End-to-end 20 TB write test

2 End-to-end 20 TB read test 8,000 workers * 2.5 GB each 0 Write Read Tests ran as Hadoop MapReduce jobs

Quantcast File System 9-13 © Quantcast 2012 8 Intel E5-2670 Metaserver Performance 64 GB RAM 70 million directories

stat

rmdir

mkdir

ls QFS HDFS

0 50 100 150 200 250 300 Operations per second (thousands)

Quantcast File System 9-13 © Quantcast 2012 9 Production Hardening for Petascale

Continuous I/O Balancing Optimization Operations

• Full feedback loop • Direct I/O and fixed buffer • Hibernation space = predictable RAM and storage device usage • Metaserver knows the I/O • Evacuation through queue size of every device recovery • C++, own memory allocation and layout • Activity biased towards • Continuous space/ under-loaded chunkservers consistency rebalancing • Vector instructions for Reed Solomon coding • Direct I/O = short loop • Monitoring and alerts

Quantcast File System 9-13 © Quantcast 2012 10 http://qc.st/QCQuantsort Use Case Quantsort: All I/O over QFS

Concurrent append. Largest sort = 1 PB 10,000 writers append Daily = 1 to 2 PB, max = 3 PB to same file at once.

Quantcast File System 9-13 © Quantcast 2012 11 Use Case Fast Broadcast through Wide Striping

100.0 94.5 90.0

80.0

70.0

60.0

50.0

40.0

Broadcast Time (s) BroadcastTime 30.0

20.0 16.7 8.5 10.0 4.8 0.0 HDFS Default HDFS Small Blocks QFS on Disk QFS in RAM Configuration

Quantcast File System 9-13 © Quantcast 2012 12 Refreshingly Fast Command Line Tool hadoop fs -ls / versus –ls /

Time (msec) 800 700 700 600 500 400 Time (msec) 300 200 100 7 0 HDFS QFS

Quantcast File System 9-13 © Quantcast 2012 13 How Well Does It Work

Reliable at Scale

Hundreds of days of metaserver uptime common

Quantcast MapReduce sorter uses QFS as distributed virtualized store instead of local disk

8 petabytes of compressed data

Close to 1 billion chunks

7,500 I/O devices

Quantcast File System 9-13 © Quantcast 2012 14 How Well Does It Work

Reliable at Scale Fast and Large

Hundreds of days of Ran petabyte sort last metaserver uptime common weekend.

Quantcast MapReduce sorter Direct I/O not hurting fast uses QFS as distributed scans: Sawzall query virtualized store instead of local performance similar to Presto: disk Presto/ Turbo/ 8 petabytes of compressed HDFS QFS data Seconds 16 16 Rows 920 M 970 M Close to 1 billion chunks Bytes 31 G 294 G 7,500 I/O devices Rows/sec 57.5 M 60.6 M Bytes/sec 2.0 G 18.4 G

Quantcast File System 9-13 © Quantcast 2012 15 How Well Does It Work

Reliable at Scale Fast and Large Easy to Use

Hundreds of days of Ran petabyte sort last 1 Ops Engineer for QFS and metaserver uptime common weekend. MapReduce on 1,000+ node cluster Quantcast MapReduce sorter Direct I/O not hurting fast uses QFS as distributed scans: Sawzall query Neustar set up multi petabyte virtualized store instead of local performance similar to Presto: instance without help from disk Quantcast Presto/ Turbo/ 8 petabytes of compressed HDFS QFS Migrate from HDFS using data Seconds 16 16 hadoop distcp! Rows 920 M 970 M Close to 1 billion chunks Hadoop MapReduce “just Bytes 31 G 294 G works” on QFS 7,500 I/O devices Rows/sec 57.5 M 60.6 M Bytes/sec 2.0 G 18.4 G

Quantcast File System 9-13 © Quantcast 2012 16 Metaserver Statistics in Production

QFS metaserver statistics over Quantcast production file systems in July 2013.

• High Availability is nice to have but not a must-have for MapReduce. There are certainly other use cases where High Availability is a must. • Federation may be needed to support file systems beyond 10 PB, depending on file size

Quantcast File System 9-13 © Quantcast 2012 17 Who will find QFS valuable?

Likely to benefit from QFS May find HDFS a better fit

Existing Hadoop users with large-scale Small or new Hadoop deployments, as data clusters. HDFS has been deployed in a broader variety of production environments. Data heavy, tech savvy organizations for whom performance and efficient use of Clusters with slow or unpredictable hardware are high priorities. network connectivity.

Environments needing specific HDFS features such as head node federation or hot standby.

Quantcast File System 9-13 © Quantcast 2012 18 Summary

Key Benefits of QFS

Delivers stable high performance alternative to HDFS in a production-hardened 1.0 release

Offers high performance management of multi-petabyte workloads

Faster I/O than HDFS with half the disk space.

Fully Compatible with Apache Hadoop

100% Open Source

© Quantcast 2012 Future Work

What QFS Doesn’t Have Just Yet

Kerberos Security – under development

HA – No strong case at Quantcast, but nice to have

Federation – Not a strong case either at Quantcast

Contributions welcome!

© Quantcast 2012 Thank You. Questions?

San Francisco 201 Third Street San Francisco, CA 94103 Download QFS for free at: New York 432 Park Avenue South github.com/quantcast/qfs New York, NY 10016

London 48 Charlotte Street London, W1T 2NS

Quantcast File System 9-13 © Quantcast 2012 21