9 Shades of Lustre 04/2019 [email protected] Lustre Features Review
Total Page:16
File Type:pdf, Size:1020Kb
9 shades of Lustre 04/2019 [email protected] Lustre features review ►Metadata operations & optimizations • DNE • Lazy size on MDT ►File striping extensions • PFL • FLR use-case • DoM use case ►Lustre network aspects • Multi-Rail • Dynamic Peer Discovery • UDSP • LNet Health whamcloud.com Lustre metadata ls -l speed improvement performance scale out Simplifying network Choosing the correct configuration striping Increasing network Strengthening data bandwidth availability Small file performance improvement whamcloud.com Lustre metadata scale out ►Lustre initial design whamcloud.com Lustre metadata scale out ►DNE • aka Distributed Namespace Environment • phase 1 introduced with Lustre 2.4 (May 2013) whamcloud.com Lustre metadata scale out whamcloud.com Lustre metadata scale out ►DNE phase 1 benefits: • support up to 256 MDTs • additional MDTs on additional MDSes ►DNE phase 1 limitations: • remote dir assigned to a single MDT • only remote directory creation/unlink are allowed • no migration tool to move between MDTs (needs ‘mv’) • synchronous cross-MDT operations whamcloud.com Lustre metadata scale out ►DNE phase 2 • introduced with Lustre 2.8 (March 2016) • spread a single directory across multiple MDTs Þallow striped dir in addition to remote dir Þmuch more flexible than phase 1 Striped Directory Dir shard 0 Dir shard 1 Dir shard 2 Dir shard 3 fileA fileB fileC fileD whamcloud.com Lustre metadata scale out ►DNE phase 2 addresses phase 1 limitations: • rename and link ops supported • tool to migrate directories from one MDT to another • asynchronous cross-MDT operations Þmore user-friendly ►DNE phase 2 benefits: • scale size and performance of large directories • simple load balancing across MDTs ►How to use: lfs mkdir -c mdt_count /mount_point/new_directory lfs mkdir -i mdt_index /mount_point/new_directory whamcloud.com Improving DNE usability ►DNE phase 3 • introduction began with Lustre 2.12 (December 2018) • directory restriping from single-MDT to striped directories ►How to migrate the contents of a large directory from its current default location to MDT0001 and MDT0003: lfs migrate -m 1,3 /mount_point/largedir ►How to pick least full MDT: lfs mkdir -i -1 /mount_point/new_dir whamcloud.com Improving DNE usability ►DNE phase 3 continuation in 2.13+ • automatically create new remote directory on "best" MDT with mkdir() osimplifies use of multiple MDTs without striping all directories osimilar to OST usage • automatic directory restriping oavoid explicit striping at create ostart with one-stripe directory oadd extra stripes as directory grows Master +4 dir stripes +12 directory stripes whamcloud.com Lustre metadata ls -l speed improvement performance scale out Simplifying network Choosing the correct configuration striping Increasing network Strengthening data bandwidth availability Small file performance improvement whamcloud.com Addressing ‘ls –l’ “slowness” ►Retrieving file size is not that simple in initial approach: • MDT holds some file metadata: ctime, mtime, owner, etc. • BUT IOs to files managed directly between clients and OSTs • need to send requests to all OSTs having a file’s objects to compute its size Þ“slow” ‘ls –l’ whamcloud.com Addressing ‘ls –l’ “slowness” ►Lazy Size on MDS (LSOM) • introduction began with Lustre 2.12 (December 2018) • lazy means ‘not real-time’ olazy size saved as an extended attribute on MDT olazy size updated on file close/truncate • useFul for policy engines that can read this extended attribute oRobinHood oStratagem o‘lfs find’ in 2.13 able to use LSOM to avoid OST RPCs • at this stage, Lustre client is not directly LSOM-aware. whamcloud.com Addressing ‘ls –l’ “slowness” ►‘ls –l’ performance with experimental LSOM-aware client whamcloud.com Lustre metadata ls -l speed improvement performance scale out Simplifying network Choosing the correct configuration striping Increasing network Strengthening data bandwidth availability Small file performance improvement whamcloud.com Choosing the right striping ►Striping is a convenient way to parallelize bandwidth to OSTs ►… but involving more components adds overhead • large stripe count to increase performance? • or small stripe count to limit overhead? ►… and users do not really understand striping • vast majority of files uses default striping • no correlation between file size and striping whamcloud.com Choosing the right striping ►Progressive File Layout (PFL) for improved performance and usability • introduced with Lustre 2.10 (July 2017) • file layout is described by a series of components • each component has its own stripe count, size, OST pool, etc 0 Object 4 Component 0 32M 32M Component 1 14 15 1616 17 1G 1G Component 2 EOF whamcloud.com Choosing the right striping ►Progressive File Layout (PFL) benefits • simplify Lustre usage for novice users • reasonable performance for a variety of I/O patterns olow overhead for small files ohigh bandwidth for large files • stepping stone to more features ►How to use PFL lfs setstripe [--component-end|-E end1] [STRIPE_OPTIONS] [--component-end|-E end2] [STRIPE_OPTIONS] ... filename whamcloud.com Performance benefit from PFL, as seen by end-users ►Multiple clients accessing the same shared file • IOR, 32 clients, 512 threads Source: ORNL presentation at LUG 2016 http://cdn.opensfs.org/wp-content/uploads/2016/04/LUG2016D3_Evaluating-Progressive-File-Layouts_Mohr.pdf whamcloud.com OST space balance with PFL [0, 1MB) stripe_count=1 [1MB, 64MB) stripe_count=4 [64MB, 128GB) stripe_count=16 [128GB, EOF) stripe_count=48 Source: ORNL presentation at LUG 2016 http://cdn.opensfs.org/wp-content/uploads/2016/04/LUG2016D3_Evaluating-Progressive-File-Layouts_Mohr.pdf whamcloud.com OST space balance with PFL Source: ORNL presentation at LUG 2016 http://cdn.opensfs.org/wp- content/uploads/2016/04/ LUG2016D3_Evaluating- Progressive-File- Layouts_Mohr.pdf whamcloud.com Lustre metadata ls -l speed improvement performance scale out Simplifying network Choosing the correct configuration striping Increasing network Strengthening data bandwidth availability Small file performance improvement whamcloud.com Strengthening data availability ►Software, network, hardware all contribute to unavailability of Lustre data • Lustre at the top of a deep software/hardware stack, depends on all components working • needs availability better than individual hardware and software components • needs more robustness against data loss/corruption whamcloud.com Strengthening data availability ►File Level Redundancy (FLR) provides significant value and functionality • introduction began with Lustre 2.11 (April 2018) • based on PFL feature • file’s data no longer in a single location oreplicas can be created on multiple OSTs Replica 1 Object j (PREFERRED) Replica 2 Object k whamcloud.com Strengthening data availability ►Multiple benefits from FLR, eg: • higher availability for server/network failure ofinally better than HA failover • robustness against data loss/corruption omirror, or M+N erasure coding for stripes (Lustre 2.14) • increased read speed for widely shared files omirror input data across many OSTs ►How to use FLR lfs mirror create <--mirror-count|-N[mirror_count] [setstripe_options]> ... <filename|directory> whamcloud.com Lustre metadata ls -l speed improvement performance scale out Simplifying network Choosing the correct configuration striping Increasing network Strengthening data bandwidth availability Small file performance improvement whamcloud.com Improving small file performance ►Small File I/O Concerns • small file data on a single OST ono benefit from multiple OSTs in parallel • random I/O patterns omore latency sensitive oslows down concurrent streaming I/O • data is small ono read-ahead possible omore RPCs for the same amount of data whamcloud.com Improving small file performance ►Data-on-MDT (DoM) small file performance • introduced with Lustre 2.11 (April 2018) • based on PFL feature • stores small file data directly on the MDT • DoM files grow on OSTs after the MDT size limit is reached Without DoM With DoM open(O_RDWR|O_TRUNC), open(O_RDWR|O_TRUNC), stat(), truncate() stat(), truncate() Client MDS Client MDS layout, lock, truncate, enqueue, attributes, read write OSS OSS OSS lock, read, OSS OSS attributes OSS whamcloud.com Improving small file performance ►Data-on-MDT (DoM) benefits • separates large and small I/O data streams • file size is immediately available ►Please keep in mind • increases RPC pressure on MDS • need for more storage space on MDTs whamcloud.com Improving small file performance 8k Reads ►DoM performance 900 for sub-DoM size 800 700 600 500 DOM MB/s Stripe=-1 (8) 400 Stripe=1 300 200 100 0 1 2 4 8 16 Clients whamcloud.com Improving small file performance ►DoM is completely configurable odecide which files to store on MDT odecide which size to store on MDT ►How to use DoM lfs setstripe --component-end|-E end1 --layout|-L mdt [--component-end|-E end2 [STRIPE_OPTIONS] ...] <filename> whamcloud.com Lustre metadata ls -l speed improvement performance scale out Simplifying network Choosing the correct configuration striping Increasing network Strengthening data bandwidth availability Small file performance improvement whamcloud.com Increasing network bandwidth ►Need for high network bandwidth • big Lustre clients with lot of memory and NUMA architecture • big Lustre servers with lot of storage (many OSTs, very large OSTs with DCR,…) ►Possible solutions? • Adding faster interfaces Þimplies replacing much or all of the network • Adding more interfaces to the nodes Þrequires a redesign of