<<

Clusters Instute: High Performance Storage

University of Oklahoma, 05/19/2015

Mehmet Belgin, Georgia Tech [email protected] (in collaboraon with Wesley Emeneker)

18-22 May 2015 1 The Fundamental Queson

• How do we meet *all* user needs for storage?

• Is it even possible?

• Confounding factors • User expectaons (in their own words) • Budget constraints • Applicaon needs and use cases • Experse in team • Exisng infrastructure

18-22 May 2015 2 Examples to Common Storage Systems

• Network (NFS) – a distributed file system protocol for accessing files over a network. • – a parallel, distributed file system • OSS – server. This server stores stores and manages pieces of files (aka objects) • OST – object storage target. This disk is managed by the OSS and stores data • MDS – metadata server. This server stores file metadata. • MDT – metadata target. This disk is managed by the MDS and stores file metadata • General Parallel File System (GPFS) – a parallel, distributed file system. • Metadata is not owned by any parcular server or set of servers. • All clients parcipate in filesystem management • NSD – network storage device • Panasas/PanFS – a parallel, distributed file system • Metadata is owned by director blades • File data is owned by storage blades

18-22 May 2015 3 Nomenclature

• Object store – a place where chunks of data (aka objects) are stored. Objects are not files, though they can store individual files or different pieces of files. • Raw space – what the the disk label shows. Typically given in base 10. i.e. 10TB (terabyte) == 10*10^12 bytes • Usable space - what “df” shows once the storage is mounted. Typically given in base 2. i.e. 10TiB (tebibyte) == 10*2^40 bytes • Usable space is oen about 30% smaller (somemes more, somemes less) than raw space.

18-22 May 2015 4 Which one is right for me?

Lustre

18-22 May 2015 5 The End.

Thanks for parcipang!

18-22 May 2015 6 Before we start… What is a File System?

18-22 May 2015 7 What is a filesystem?

• A system for files (Duh!)

• A source of constant frustraon

• A filesystem is used to control how data is stored and retrieved –Wikipedia

• It’s a container (that contains files)

• It’s the set of disks, servers (computaonal components), networking, and soware

• All of the above

18-22 May 2015 8 Disclaimer

• There are no right answers

• There are wrong answers • No, seriously.

• It comes down to balancing tradeoffs of preferences, experse, costs, and case-by-case analysis

18-22 May 2015 9 Know Your Stakeholders

… and keep all of them happy! (at the same me)

1. Users 2. Managers and University Leadership 3. University support staff 4. System administrators Managers 5. Vendor

Users Sysadmins

18-22 May 2015 10 What do you need to support?

Common Storage Requirements (which most users can’t arculate) • Temporary storage for intermediate results from jobs (a.k.a scratch) • Long-term storage for runme use • Backups • Archive • Exporng said filesystem to other machines (like a user's Windows XP laptop) • Virtual Machine hosng • Database hosng • Map/Reduce (a.k.a Hadoop) • Data ingest and outgest (DMZ?) • System Administrator storage

18-22 May 2015 11 Tradeoffs

First, try to define ‘use purpose’ and ‘operaonal lifeme’…

• Speed (… is a relave term!) • Space • Cost • • Administrave burden • Monitoring • Reliability/Redundancy • Features • Support from vendor

18-22 May 2015 12 Parallel/Distributed vs. Serial Filesystems*

Serial • It doesn’t scale beyond a single server • It oen isn't easy to make it reliable or redundant beyond a single server • A single server controls everything

Parallel • Speed increases as more components are added to it • Built for distributed redundancy and reliability • Mulple servers contribute to the management of the filesystem

*None of these things are 100% true

18-22 May 2015 13 The Most Common Soluons for HPC

Want to access your data from everywhere?

You need “Network Aached Storage (NAS)”!

• NFS (serial-ish) • GPFS (Parallel) • Lustre (Parallel) • Panasas (Parallel)

• What about others like OrangeFS, , , XtreemFS, CIFS, HDFS, Swi, etc.?

18-22 May 2015 14 Prepare for a Challenge

• NFS low • Panasas Administrave Burden & needed experse • GPFS (anectodal) high • Lustre

• Your mileage may vary!

18-22 May 2015 15 (NFS)

• Can be built from commodity parts or purchased as an appliance

• A single server typically controls everything *Speed *Space *Cost *Scalability *Administrave Burden *Monitoring • Where does it fall for our tradeoffs? *Reliability/Redundancy • No soware cost *Features *Vendor Support • Compable (not 100% POSIX) • Underlying Filesystem does not maer much (ZFS, , …) • True redundancy is harder (single point of failure) • Mostly for low-volume, low-throughput workloads • Strong client side caching, works well for small files • Requires minimal experse and (relavely) easy to manage

18-22 May 2015 16 General Parallel File System (GPFS)

• Can be built from commodity parts or purchsed as an appliance • All nodes in the GPFS cluster parcipate NSD Server NSD Server

in filesystem management Network • Metadata is managed by every node in the cluster Client Client • Where does it fall in our tradeoffs?

*Speed *Space *Cost *Scalability *Administrave Burden *Monitoring *Reliability/Redundancy *Features *Vendor Support

18-22 May 2015 17 Lustre

• Can be built from commodity parts, or purchased as an appliance • Separate servers for data and metadata • Where does it fall in our tradeoffs?

*Speed *Space *Cost *Scalability *Administrave Burden *Monitoring *Reliability/Redundancy *Features *Vendor Support

* Image credit: nor-tech.com

18-22 May 2015 18 Panasas

• Is an appliance • Separate servers for metadata and data • Where does it fall in our tradeoffs?

*Speed *Space *Cost *Scalability *Administrave Burden *Monitoring *Reliability/Redundancy *Features *Vendor Support

* Image credit: panasas.com

18-22 May 2015 19 Appliances Screenshot of Panasas management tool • Appliances generally come with vendor tools for monitoring and management • Do these tools increase or decrease management complexity? • How important is vendor support for your team?

18-22 May 2015 20 Good idea? Bad idea? Let’s discuss!

• NFS for everything

• Panasas for everything

• Lustre for everything

• GPFS for everything

18-22 May 2015 21 How about…

• Lustre for work (files stored here are temporary) • NFS for home • Tape for backup and archival

• Lustre available everywhere • Tape available on data movers • NFS only available on login machines

18-22 May 2015 22 Designing your storage soluon

• Who are the stakeholders? • How quickly should we be able to read any one file? • How will people want to use it? • How much training will you need? • How much training will your users need to effecvely use your storage? • Do you have the knowledge necessary to do the training? • How oen do they need the training? • Do you need different ers or types of storage? • Long-term • Temporary • Archive • From what science/usage domains are the users? • aka what applicaons will they be using? • What features are necessary?

18-22 May 2015 23 Applicaon Driven Tradeoffs

• Domain Science • Chemistry • Aerospace • Bio* (biology, bioinformacs, biomedical) • Physics • Business • Economics • etc. • Data and Applicaon Restricons • HIPAA and PHI • ITAR • PCI DSS • And many more (SOX, GLBA, CJIS, FERPA, SOC, …)

18-22 May 2015 24 What you need to know

• What is the distribuon of files? • sizes, count • What is the expected workload? • How many bytes are wrien for every byte read? • How many bytes are read for each file opened? • How many bytes are wrien for each file opened? • Are there any system-based restricons? • POSIX conformance. Do you need a POSIX Filesystem? • Limitaons on number of files or files per directories • Network compability (IB vs. )

18-22 May 2015 25 Use Case: Data Movement

• Scenario: User needs to import a lot of data • Where is the data coming from? • Campus LAN? • Campus WAN? • WAN? • How oen will the data be ingested? • Does it need to be outgested? • What kind of data is it? • Is it a one-me ingest or regular?

18-22 May 2015 26 Designing your storage soluon

• What technologies do you need to sasfy the requirements that you now have? • Can you put a number on the following? • Minimum disk throughput from a single compute node • Minimum aggregate throughput for the enre filesystem for a benchmark (like iozone or IOR) • I/O load for representave workloads from your site • How much data and metadata is read/wrien per job? • Temporary space requirements • Archive and backup space requirements • How much churn is there in data that needs to be backed up?

18-22 May 2015 27 Storage Devices

• Solid State speed & cost capacity o Serial ATA (SATA): $/byte, large capacity, less • RAM reliable, slower (7.2k RPM) low • PCIe SSD high o Serial Aached SCSI (SAS): $$/byte, small • SATA/SAS SSD capacity, reliable, fast (15k RPM)

• Spinning Disk o Nearline-SAS: SATA drives with SAS interface: • SAS more reliable than SATA, cheaper than SAS, ~SATA speeds but with lower overhead • NL-SAS • SATA o Solid State Disk (SSD): No spinning disks, $$$/ low high byte, blazing fast, reliable1 • Tape

18-22 May 2015 28 What is an IOP?

• IOP == Input/Output Operaon • IOPS == Input/Output Operaons per Second • We care about two IOPS reports • The number we tell people when we say “Our Veridian Dynamics Frobulator 2021 gets 300PiB/s bandwidth!” • The number that affects users “Our Veridian Dynamics Frobulator 2021 only gets 5KiB/s for ” • Why the difference?

18-22 May 2015 29 More tradeoffs …

Space vs. Speed • Do you need 10GiB/s and 10TiB of space? • Do you need 1PiB of usable storage and 1GiB/s? • How do you meet your requirements?

Large vs. Small Files • What is a small file? • No hard rule. It depends on how you define it. • At GT, small is < 1MiB? • Why do you care? • Metadata operaons are deadly. The me required to do a metadata lookup on a 1TiB file takes the same me as a lookup on a 1KiB file.

18-22 May 2015 30 Example Storage Soluon (Georgia Tech) informaonal purposes only, not a recommendaon

Experienced catastrophic failure(s) with all of them at least once

• Panasas appliance for scratch (shared by all) • GPFS appliance on SATA/NL-SAS/SAS for long-term (and some home) • NFS on SATA for long-term (many servers) • NFS on SATA for home (a few servers) • NFS for administrave storage • NFS for daily backups • Coraid system (NFS) for applicaon repository and VM images

• Building a GPFS from commodity components for scratch!

18-22 May 2015 31 Storage Policies (Georgia Tech)

• 5GB home space • backed up daily • provided by GT • NFS • ∞ project space • backed up daily • faculty-purchased, but GT buys the backup space • Mix of NFS and GPFS (transioning to GPFS) • 5TB/7TB Scratch/Temporary • not backed up • purchased by GT • Panfs (soon to be something else)

18-22 May 2015 32 Storage Policies (Georgia Tech)

• Scratch • files older than 60 days are marked for removal • Users are given one week to save their data (or make a plea for more me) • Marked files are removed aer 1 week • Not backed up • Quotas • Quota increases must be requested by owner or designated manager

18-22 May 2015 33 Best Pracces

• Benchmark the system whenever you can • Especially when you first get it (this is the baseline) • Then, every me you take the system down (so that you can tell if something has changed) • Run the EXACT SAME test! • Test the redundancy and reliability • Does it survive a drive or server failure? Power something off or rip it out while you are pung a load on it • Don’t solely rely on generic benchmarks • Run the applicaons your stakeholders care about • Regularly get data about your data • Monitor the status of your filesystem, proacvely fix problems • Constantly ask users (and other stakeholders) how they feel about performance • It doesn’t maer if benchmarks are good if they feel it is bad

18-22 May 2015 34 How About Cloud and Big Data?

Design/Standards: • POSIX : Portable Operang System Interface (NFS, GPFS, Panasas, Lustre ) • REST: Representaonal State Transfer, designed for scalable web services

Case-specific soluons: • Soware defined, hardware independent storage (e.g. Swi) • Proprietary object storage (e.g. S3 for AWS, which is RESTful) • Geo-replicaon: DDN WOS, Azure, object storage: Ceph vs Gluster vs … • Big data (map/reduce): Hadoop Distributed File System (HDFS), QFS, …

18-22 May 2015 35 Future

• Hybridizaon of storage • Connecng different storages • Seamless migraon between storage soluons ( Object store <-> Object store, POSIX <-> Object)

• Ethernet connected drives • Seagate’s Kine interface • HGST’s open Ethernet drive

• YAC (Yet Another Cache) • Cache Acceleraon Soware • DDN Infinite Memory Engine • IBM FlashCache

18-22 May 2015 36 BONUS material: a lile bit of Benchmarking

• Use real user applicaons when possible!

• “dd” … quick & easy.

• “iozone” great for single/mul-node read/write performance

• “Bonnie++” simple to run, but comprehensive suite of tests

• “zcav” good test for spinning hard disks, where speed is a funcon of distance from the first sector.

18-22 May 2015 37 dd • Never run as root (destrucve if used incorrectly)! • Reads from and input file “if” and writes to output file “of” • You don’t need to use real files… • can read from devices e.g. /dev/zero, /dev/random, etc • can write to /dev/null • Caching can be misleading… Prefer direct I/O (oflag=direct)

Example: $ dd if=/dev/zero of=./test.dd bs=1G count=1 oflag=direct 1+0 records in 1+0 records out 1073741824 bytes (1.1 GB) copied, 37.8403 s, 28.4 MB/s

18-22 May 2015 38 iozone • Common test ulity for read/write performance • Great for both single-node and mul-node tesng (i.e. aggr. perf.) • Sensive to caching, use “-I” for direct I/O. • Can run multhreaded (-t)

• Simple, ‘auto’ mode: # iozone -i 0 -i 1 -+n -r 1M -s 1G -t 16 -I … iozone –a … Throughput test with 16 processes • Or pick tests using ‘-i’: Each process writes a 1048576 kByte file in 1024 kByte records Children see throughput for 16 initial writers = 1582953.29 kB/sec -i 0: Read/re-read Parent sees throughput for 16 initial writers = 1542978.62 kB/sec Min throughput per process = 97130.07 kB/sec -i 1: write/rewrite Max throughput per process = 100058.16 kB/sec Avg throughput per process = 98934.58 kB/sec Min xfer = 1019904.00 kB

Children see throughput for 16 readers = 1393510.91 kB/sec Parent sees throughput for 16 readers = 1392664.11 kB/sec Min throughput per process = 84657.09 kB/sec Max throughput per process = 88483.99 kB/sec Avg throughput per process = 87094.43 kB/sec Min xfer = 1003520.00 kB

18-22 May 2015 39 iozone mul-node tesng

• Great for tesng for HPC storage “peak” aggregate performance • Network becomes a significant contributor • Requires a “hosile” with: hosts, test_dir, iozone_path E.g: iw-h34-17 //pace1/ddn /usr/bin/iozone iw-h34-18 /gpfs/pace1/ddn /usr/bin/iozone iw-h34-19 /gpfs/pace1/ddn /usr/bin/iozone • Fire away! iozone -i 0 -i 1 -+n –e -r 128k –s -t -+m -i : tests (0: read/re-read, 1: write/re-write) -+n : no retests selected -e : include flush/fflush in timing calculations -r : record (block) size in Kb

18-22 May 2015 40 Bonnie++

• Comprehensive set of tests: • Create files in sequenal order • Stat files in sequenal order • Delete files in sequenal order • Create files in random order • Stat files in random order • Delete files in random order (--wikipedia) • Just ‘cd’ to the directory on the filesystem, then run ‘bonnie++’ • Uses 2x client memory (by default) to avoid caching effects • Reports performance (K/sec, higher are beer) and the CPU used to perform operaons (%CP, lower are beer) • Highly configurable, check its man !

18-22 May 2015 41 Bonnie++

$ bonnie++ Writing with putc()...done … … Delete files in random order...done. Version 1.03e ------Sequential Output------Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP atlas-6.pace. 7648M 39329 74 235328 34 2599 2 37794 67 37943 4 46.9 0 ------Sequential Create------Random Create------Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 740 6 930 2 89 0 282 2 340 1 212 1 atlas-6.pace.gatech.edu,7648M, 39329,74,235328,34,2599,2,37794,67,37943,4,46.9,0,16,740,6,930,2,89,0,282,2,340,1,212,1

18-22 May 2015 42 zcav • Part of the Bonnie++ suite • “Constant Angular Velocity (CAV)” tests for spinning media • I/O Performance will differ depending on the distance of the heads from the center of the circular spinning media (first sector). • Not meaningful for network aached storage • SSD runs can be interesng (you expect to see a flat line, but…)

SATA disk example SSD example hp://www.coker.com.au/bonnie++/zcav/results.html (GT machine)

18-22 May 2015 43 zcav

Example: What’s going on here??

$ zcav –f /dev/sda $ zcav –f /dev/sda #loops: 1, version: 1.03e #loops: 1, version: 1.03e #block offset (GiB), MiB/s, time #block offset (GiB), MiB/s, time 0.00 115.75 2.212 #0.00 ++++ 0.092 0.25 95.93 2.669 #0.25 ++++ 0.094 0.50 114.63 2.233 0.75 119.14 2.149 #0.50 ++++ 0.091 … … … …

When you run the same example twice, you see super fast “cached” results! Here’s how you flush I/O cache:

sync && echo 3 > /proc/sys/vm/drop_caches

18-22 May 2015 44 The End. (for real this me)

Thanks for parcipang!

18-22 May 2015 45