Sneaking a Parallel File System Into Hadoop

Crossing the Chasm: Sneaking a parallel file system into Hadoop Wittawat Tantisiriroj Swapnil Patil, Garth Gibson PARALLEL DATA LABORATORY Carnegie Mellon University In this work … • Compare and contrast large storage system architectures • Internet services • High performance computing • Can we use a parallel file system for Internet service applications? • Hadoop, an Internet service software stack • HDFS, an Internet service file system for Hadoop • PVFS, a parallel file system http://www.pdl.cmu.edu/ 2 Wittawat Tantisiriroj © November 08 Today’s Internet services • Applications are becoming data-intensive • Large input data set (e.g. the entire web) • Distributed, parallel application execution • Distributed file system is a key component • Define new semantics for anticipated workloads – Atomic append in Google FS – Write-once in HDFS • Commodity hardware and network – Handle failures through replication http://www.pdl.cmu.edu/ 3 Wittawat Tantisiriroj © November 08 The HPC world • Equally large applications • Large input data set (e.g. astronomy data) • Parallel execution on large clusters • Use parallel file systems for scalable I/O • e.g. IBM’s GPFS, Sun’s Lustre FS, PanFS, and Parallel Virtual File System (PVFS) http://www.pdl.cmu.edu/ 4 Wittawat Tantisiriroj © November 08 Why use parallel file systems? • Handle a wide variety of workloads • High concurrent reads and writes • Small file support, scalable metadata • Offer performance vs. reliability tradeoff • RAID-5 (e.g., PanFS) • Mirroring • Failover (e.g., LustreFS) • Standard Unix FS interface & POSIX semantics • pNFS standard (NFS v4.1) http://www.pdl.cmu.edu/ 5 Wittawat Tantisiriroj © November 08 Outline ¾ A basic shim layer & preliminary evaluation • Three add-on features in a shim layer • Evaluation http://www.pdl.cmu.edu/ 6 Wittawat Tantisiriroj © November 08 HDFS & PVFS: high level design • Meta-data servers • Store all file system metadata • Handle all metadata operations • Data servers • Store actual file system data • Handle all read and write operations • Files are divided into chunks • Chunks of a file are distributed across servers http://www.pdl.cmu.edu/ 7 Wittawat Tantisiriroj © November 08 PVFS shim layer under Hadoop Hadoop applications Hadoop framework Extensible file system API Forward requests to and respond from PVFS client library HDFS client library PVFS shim layer using Java Native Interface Unmodified PVFS (JNI) client library (C) Client Server Unmodified PVFS HDFS servers servers http://www.pdl.cmu.edu/ 8 Wittawat Tantisiriroj © November 08 Preliminary Evaluation • Text search (“grep”) • common workloads in Internet service applications • Search for a rare pattern in 100-byte records • 64GB data set • 32 nodes • Each node servers as storage and compute nodes http://www.pdl.cmu.edu/ 9 Wittawat Tantisiriroj © November 08 Vanilla PVFS is disappointing … 2.5 times slower http://www.pdl.cmu.edu/ 10 Wittawat Tantisiriroj © November 08 Outline • A basic shim layer & preliminary evaluation ¾ Three add-on features in a shim layer 9 Readahead buffer • File layout information • Replication • Evaluation http://www.pdl.cmu.edu/ 11 Wittawat Tantisiriroj © November 08 Read operation in Hadoop • Typical read workload: • Small (less than 128 KB) • Sequential through an entire chunk • HDFS prefetches an entire chunk • No cache coherence issue with its write-once semantic http://www.pdl.cmu.edu/ 12 Wittawat Tantisiriroj © November 08 Readahead buffer • PVFS has no client buffer cache • Avoid a cache coherence issue with concurrent writes • Readahead buffer can be added to PVFS shim layer • In Hadoop, a file can become immutable after it is closed • No need for cache coherence mechanism http://www.pdl.cmu.edu/ 13 Wittawat Tantisiriroj © November 08 PVFS with 4MB buffer still quite slow http://www.pdl.cmu.edu/ 14 Wittawat Tantisiriroj © November 08 Outline • A basic shim layer & preliminary evaluation ¾ Three add-on features in a shim layer • Readahead buffer 9 File layout information • Replication • Evaluation http://www.pdl.cmu.edu/ 15 Wittawat Tantisiriroj © November 08 Collocation in Hadoop • File layout information • Describe where chunks are located • Collocate computation and data • Ship computation to where data is located • Reduce network traffic http://www.pdl.cmu.edu/ 16 Wittawat Tantisiriroj © November 08 Hadoop without collocation Compute Computation Chunk1 Chunk2 Chunk3 Node Storage Node 3 data transfers over network Chunk1 Chunk2 Chunk3 Chunk 3 Chunk 1 Chunk 2 Node A Node B Node C http://www.pdl.cmu.edu/ 17 Wittawat Tantisiriroj © November 08 Hadoop with collocation Compute Computation Chunk1 Chunk2 Chunk3 Node Storage Node no data transfer over network Chunk3 Chunk1 Chunk2 Chunk 3 Chunk 1 Chunk 2 Node A Node B Node C http://www.pdl.cmu.edu/ 18 Wittawat Tantisiriroj © November 08 Expose file layout information • File layout information in PVFS • Stored as extended attributes • Different format from Hadoop format • A shim layer converts file layout information from PVFS format to Hadoop format • Enable Hadoop to collocate computation and data http://www.pdl.cmu.edu/ 19 Wittawat Tantisiriroj © November 08 PVFS with file layout information comparable performance http://www.pdl.cmu.edu/ 20 Wittawat Tantisiriroj © November 08 Outline • A basic shim layer & preliminary evaluation ¾ Three add-on features in a shim layer • Readahead buffer • File layout information 9 Replication • Evaluation http://www.pdl.cmu.edu/ 21 Wittawat Tantisiriroj © November 08 Replication in HDFS • Rack-awareness replication • By default, 3 copies for each file (triplication) 1.Write to a local storage node 2.Write to a storage node in the local rack 3.Write to a storage node in the other rack http://www.pdl.cmu.edu/ 22 Wittawat Tantisiriroj © November 08 Replication in PVFS • No replication in the public release of PVFS • Rely on hardware based reliability solutions • Per server RAID inside logical storage devices • Replication can be added in a shim layer • Write each file to three servers • No reconstruction/recovery in the prototype http://www.pdl.cmu.edu/ 23 Wittawat Tantisiriroj © November 08 PVFS with replication Hadoop applications Hadoop framework Extensible file system API PVFS shim layer Unmodified PVFS client library (C) Unmodified PVFS Unmodified PVFS Unmodified PVFS server server server http://www.pdl.cmu.edu/ 24 Wittawat Tantisiriroj © November 08 PVFS shim layer under Hadoop Hadoop applications Hadoop framework ~1,700 lines of code PVFS shim layer Extensible file system API Readahead buffer HDFS client library PVFS shim layer File layout info Replication Unmodified PVFS client library (C) Client Server Unmodified PVFS HDFS servers servers http://www.pdl.cmu.edu/ 25 Wittawat Tantisiriroj © November 08 Outline • A basic shim layer & preliminary evaluation • Three add-on features in a shim layer ¾ Evaluation 9 Micro-benchmark (non MapReduce) • MapReduce benchmark http://www.pdl.cmu.edu/ 26 Wittawat Tantisiriroj © November 08 Micro-benchmark • Cluster configuration • 16 nodes • Pentium D dual-core 3.0GHz • 4 GB Memory • One 7200 rpm SATA 160 GB (8 MB buffer) • Gigabit Ethernet • Use file system API directly without Hadoop involvement http://www.pdl.cmu.edu/ 27 Wittawat Tantisiriroj © November 08 N clients, each reads 1/N of single file • Round-robin file layout in PVFS helps avoid contention http://www.pdl.cmu.edu/ 28 Wittawat Tantisiriroj © November 08 Why is PVFS better in this case? • Without scheduling, clients read in a uniform pattern • Client1 reads A1 then A4 • Client2 reads A2 then A5 • Client3 reads A3 then A6 • PVFS • Round-robin A1 A2 A3 A4 A5 A6 placement • HDFS • Random A1 A2 A4 placement A3 A5 A6 Contention http://www.pdl.cmu.edu/ 29 Wittawat Tantisiriroj © November 08 HDFS with Hadoop’s scheduling • Example 1: • Client1 reads A1 then A4 • Client2 reads A2 then A5 A1 A2 A4 A3 A5 A6 • Client3 reads A6 then A3 • Example 2: • Client1 reads A1 then A3 • Client2 reads A2 then A5 A1 A2 A4 • Client3 reads A4 then A6 A3 A5 A6 http://www.pdl.cmu.edu/ 30 Wittawat Tantisiriroj © November 08 Read with Hadoop’s scheduling • Hadoop’s scheduling can mask a problem with a non-uniform file layout in HDFS http://www.pdl.cmu.edu/ 31 Wittawat Tantisiriroj © November 08 N clients write to n distinct files • By writing one of three copies locally, HDFS write throughout grows linearly http://www.pdl.cmu.edu/ 32 Wittawat Tantisiriroj © November 08 Concurrent writes to a single file • By allowing concurrent writes in PVFS, “copy” completes faster by using multiple writers http://www.pdl.cmu.edu/ 33 Wittawat Tantisiriroj © November 08 Outline • A basic shim layer & preliminary evaluation • Three add-on features in a shim layer ¾ Evaluation • Micro-benchmark (non MapReduce) 9 MapReduce benchmark http://www.pdl.cmu.edu/ 34 Wittawat Tantisiriroj © November 08 MapReduce benchmark setting • Yahoo! M45 cluster • Use 50-100 nodes • Xeon quad-core 1.86 GHz with 6GB Memory • One 7200 rpm SATA 750 GB (8 MB buffer) • Gigabit Ethernet • Use Hadoop framework for MapReduce processing http://www.pdl.cmu.edu/ 35 Wittawat Tantisiriroj © November 08 MapReduce benchmark • Grep: Search for a rare pattern in hundred million 100-byte records (100GB) • Sort: Sort hundred million 100-byte records (100GB) • Never-Ending Language Learning (NELL): (J. Betteridge, CMU) Count the numbers of selected phrases in 37GB data-set http://www.pdl.cmu.edu/ 36 Wittawat Tantisiriroj © November 08 Read-Intensive Benchmark • PVFS’s performance is similar to HDFS http://www.pdl.cmu.edu/ 37 Wittawat Tantisiriroj

Sneaking a Parallel File System Into Hadoop

11.7 the Windows 2000 File System

“Application - File System” Divide with Promises

Verifying a High-Performance Crash-Safe File System Using a Tree Specification

Comparative Analysis of Distributed and Parallel File Systems' Internal Techniques

Operating Systems File Systems

Zfs-Ascalabledistributedfilesystemusingobjectdisks

File Systems

Orion File System : File-Level Host-Based Virtualization

Data Storage on Unix

Beegfs Unofficial Documentation

Windows OS File Systems

Sibylfs: Formal Specification and Oracle-Based Testing for POSIX and Real-World File Systems