Adam Roe –HPC Solutions architect Intel® High Performance Data Division Legal Information

All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance.

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at http://www.intel.com/content/www/us/en/software/intel-solutions-for--software.html.

You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. You agree to grant Intel a non-exclusive, royalty-free license to any patent claim thereafter drafted which includes subject matter disclosed herein. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non- infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade. This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.

3D XPoint, Intel, the Intel logo, Intel Core, Intel Xeon Phi, Optane and Xeon are trademarks of Intel Corporation in the U.S. and/or other countries. * Other names and brands may be claimed as the property of others. © 2016 Intel Corporation

2 What is Lustre*?

2012 Lustre* is an object-based, open source, Whamcloud 2009 joins Intel distributed, parallel, clustered file Oracle* system (GPLv2) acquires Sun § Runs externally from compute cluster § Accessed by clients over network (Ethernet, Infiniband*) 2003 § Up to 512 PB size, 32 PB per file with LDISKFS Lustre released 2010 § Production filesystems have exceeded 2TB/sec Whamcloud* founded Designed for maximum performance at massive scale 2007 Sun acquires POSIX compliant CFS Global, shared name space - All clients can access all data Very resource efficient and cost effective

* Other names and brands may be claimed as the property of others.

3 Intel® and Lustre*

Starting today (17/04/2017), Intel will contribute all Lustre features and enhancements to the open source community. This will mean that we will no longer provide Intel-branded releases of Lustre, and instead align our efforts and support around the community release.

These changes are designed to increase Intel’s involvement in the community and to accelerate technical innovation in Lustre. For the community as a whole this will mean easier access to the latest stable Lustre releases and an acceleration of the technical roadmap. See FAQ at end of deck for more details

4 History of Community Lustre* A journey into innovation and freedom

* Other names and brands may be claimed as the property of others.

5 Community Release Roadmap

2.9 2.10* 2.11 2.12

UID/GID Mapping ZFS Snapshots Data on MDT FLR – immediate resync Shared Key Crypto Multi-rail LNET LNET Network Health Large Block IO Progressive File Layouts FLR – delayed resync Subdirectory Mounts Project Quotas` Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 2017 2018

*LTS Release with maintenance releases provided Estimates are not commitments and are provided for informational purposes only Fuller details of features in development are available at http://wiki.lustre.org/Projects Last updated: April 5th 2017 Lustre*: Market Share 73% 9 of Top10 Sites 71% of Top100 Most Adopted PFS Most Scalable PFS 16% Open Source GPL v2 4% 7% Commercial Packaging Vibrant Community Lustre GPFS other unknown

June 2016: Intel’s Analysis of Top 100 Systems (top100.org)

7 Lustre* 2.9 – Intel Contributions

305 545 449 210 0 15 1 14 7 81 0 82 32 3617 879 43 2587 20 1 4790 11 1 1

37 5393 2786

32

8201 29 Intel, Intel, 359 36056

ANU* Atos* CEA* Clogeny* Cray* DDN* ANU Atos CEA Clogeny Cray DDN

Fujitsu* Intel IU* LLNL* ORNL* Other Fujitsu Intel IU LLNL ORNL Other

Purdue* Seagate* SGI * GSI* Purdue Seagate SGI GSI

Source: http://git.whamcloud.com/fs/lustre-release.git Statistics courtesy of Dustin Leverman(ORNL) Aggregated data by organization between 2.8.50 and 2.8.59 tags

8 A Vibrant Ecosystem of 140+ Partners

9 Lustre* Software – What’s Inside?

Management Object Storage Target (MGT) MetadataTarget (MDT) Targets (OSTs) Targets (OSTs)

Metadata Object Storage Servers Servers (1-10s) (10s-1000s)

Management Network Management for Lustre

High Performance Data Network (Infiniband*, 10GbE)

Lustre Clients (1 – 100,000+) Native Lustre* Client for Intel® Xeon Phi™ processor Intel® Omni-Path Support Robin Hood OpenZFS, RAIDz Hadoop* Adapters HSM

* Other names and brands may be claimed as the property of others.

10

Motivation

Usage Models Technical Needs LUSTRE with OpenZFS

Machine Learning Performance Extreme Performance at Scale Genomics Rapid Scalability Integrated Security Security & Compliance Autonomous Cars SW Management Stack Manageability Precision Medicine Data Integrity and Recovery Reliability / Availability EXAscaleComputing Deep Integration Open Source and Extensible

12 ZFS – Unique Features

Incredible reliability – Data is always consistent on disk; silent data corruption is detected and corrected; smart rebuild strategy Compression – Maximize usable capacity for increased ROI Snapshot – support built into Lustre – Consistent snapshot across all the storage targets without stopping the file system. Hybrid Storage Pool – Data is tiered automatically across DRAM, SSD/NVMe and HDD accelerating random & small file read performance Manageability – Powerful storage pool management makes it easy to assemble and maintain Lustre storage targets from individual devices

13 Industry Adoption

Path to Exascale – CORAL and future follow-on architectures are scoped with ZFS. LLNL Sequoia1 (55PB File System) – Cheaper, less complex, higher performance file system for Sequoia With Intel, Lustre and ZFS continue to advance – Collaborate with OpenZFS community on new features. – Improve metadata performance: LAD’16 Talk

1 http://computation.llnl.gov/projects/zfs-lustre

14 Intel’s Commitment to Lustre w/ZFS

Performance Enhancements Native Encryption ZFS improvements for increased Built-in encryption for metadata performance. data at rest to provide enhanced storage Fault Management security. OpenZFS Intel Enhanced fault monitoring and Persistent Read Cache management architecture for ZFS. Update of existing L2ARC read cache to D-RAID – persist data across De-clustered RAIDZ provides reboots. IPCC massively improved rebuild performance after a drive failure.

Parity acceleration – Using AVX instructions to accelerate parity calculation

15 16 ZFS Snapshots

The OpenZFS file system In Intel® EE for Lustre* In this release of the Intel® EE provides integrated support software, version 3.0, Intel has for Lustre* software, the for snapshots, a data developed a mechanism in snapshot is taken across the protection feature that enables Lustre that is capable of whole file system, which can an operator to checkpoint a leveraging ZFS to take a then be mounted as a separate file system volume. Common coordinated snapshot of an name space on a Lustre client. use cases are: entire Lustre file system, The snapshot appears to be a § Quick undo / undelete / roll-back in provided that all of the storage separate Lustre instance. Also case of user/administrator error targets in the file system are available in 2.10 LTS. § Prepare a consistent, read-only view formatted using ZFS. Also of data for backup available in 2.10 LTS. § Prepare for software upgrade

* Other names and brands may be claimed as the property of others.

17 Security1 – Access and Network Encryption

Kerberos* provides a means for This functionality has been applied to authentication and authorization of Intel® EE for Lustre* software for the participants on a computer network, as purpose of establishing trust between well as providing secure communications Lustre servers and clients, and optionally, through authentication. supporting encrypted network communications. Also available in 2.10 LTS

1 No computer system can be absolutely secure. * Other names and brands may be claimed as the property of others.

18 Security1 – Authorization and Access Control

SELinux* is a mature access-control platform for * systems, was Level: Top secret originally developed by the NSA, and is available in RHEL* to enforce access down No write control policies including Multi-Level Level: Secret Security (MLS).

The implementation is currently No read up Level: Confidential restricted to enforcing SELinux policies on the Lustre client.

No enforcement exists on the servers. Level: Unclassified

1 No computer system can be absolutely secure. * Other names and brands may be claimed as the property of others.

19 Dynamic Lnet

New commands to configure networks Dynamic LNet configuration (DLC) and routes (like "route") is a powerful extension of the LNet software to simplify system administration tasks for Lustre* networking. lnetctl net add --net {} --if {} [--peer_{credits,timeout} {}] lnetctl net del --net {} DLC allows an operator to make changes lnetctl net show [--net {}] [--verbose] to LNet (for example, network interfaces can be added and removed, or parameters lnetctl route add --net {} --gateway {} [--hop {}] changed) without requiring that the kernel lnetctl route del --net {} modules be removed and reloaded. lnetctl route show [--net {}] [--verbose] lnetctl set {tiny,small,large}_buffers 8192

20 Lustre* for Big Data (Now Open Source)

Administrator UI & Storage Connector (“HadoopAdapter for Lustre”) CLI Hadoop* Management Dashboard § Allows Hadoop clients to use Lustre as primary storage for data

analytics REST API Hadoop* Applications (MR2) § Enables Hadoop nodes and Lustre storage to scale independently Intelligent Monitoring Layer HPC Adapter for Lustre (interoperates with YARN) § Consolidate HPC and data analytics storage to remove data silos Connectors Hadoop Adapter for Lustre (replaces HDFS storage) Job Connector (“HPC Adapter for MapReduce”) Storage Plug-in Lustre File System HW RAID Storage Software Defined Storage § Integrates YARN with HPC Schedulers for HPC and data (ZFS) analytics integration § PBS Professional (NEW) and SLURM job schedulers HadoopAdapter and HPC Adapter for Lustreare compatible with Apache* 2.5 and 2.6

* Other names and brands may be claimed as the property of others.

21 ZFS Enhancements (2.9+ - Ongoing work)

Changes for using ZFS better § 1MB+ ZFS blocksize (IO performance, LLNL) § Improved file create performance (Intel) ZFS Storage Pool Declustered Parity VDEV Metadata Class § Snapshots of whole filesystem (Intel) dRAID Mirror Mirror

Changes to core ZFS code … Log Class L2 ARC Mirror SSD § Inode quota accounting (Intel) § Large dnodes to improve xattr performance (LLNL) § Declustered parity & distributed hot spaces to improve resilvering (Intel) § Metadata allocation class to store all metadata on SSD/NVRAM (Intel) § Reduce CPU with hardware-assisted checksums, compression (Intel)

22 Data Security for All Environments (2.9 & 2.10 LTS)

UID/GID Mapping and Shared Secret Key Crypto (IU*, OpenSFS* 2.9) § Data encryption for networks including RDMA (IB, OPA) § Strong client node authentication into administrative node groups § UID/GID mapping for WAN clients § Block unauthorized clients by network Data isolation via filesystemcontainers (DDN* 2.9) § Subdirectory mounts with client authentication § Usable with hosted, isolated environments

* Other names and brands may be claimed as the property of others.

23 Networking Improvements (2.10 LTS)

Improved networking capabilities § Support for EDR and FDR InfiniBand*, MLX5 § Intel® OmniPath network support § RPC crypto for RDMA networks like IB and OPA

§ Multi-Rail support for all network types Clients A Clients B Clients A Clients B

Multi-rail by o2iblnd lb0 lb0 lb0 lb0

IB SW Failure lb0 lb0 lb0 lb0

No MDS/OSS MDS/OSS failover required

* Other names and brands may be claimed as the property of others.

24 Improved Small File Performance (2.11)

Data-on-MDT optimizes small file IO § Avoid OST overhead (data, lock RPCs) § High-IOPS MDTs (mirrored SSD vs. RAID-6 HDD) § Avoid contention with streaming IO to OSTs § Prefetch file data with metadata open, write data, read, attr § Size on MDT for files Client MDS layout, lock, size, read data § Manage MDT usage by quota Small file IO directly to MDS Complementary with DNE 2 striped directories § Scale small file IOPS with multiple MDTs

25 Feature Optimisation: Data-on MDT

Glimpse-ahead Lock on Open Read on Open DoM File DoM File DoM File MDS_GETATTR+ LOCK OPEN Client1 MDS OPEN + IO LOCK IO LOCK + DATA Client1 MDS Client1 MDS IO BULK >128k

Client2 1 RPC 1 RPC + BULK if size >128k 1 RPC (2 with GLIMPSE) Traditional File Traditional File Traditional File MDS_GETARRT+ OPEN LOCK w/o size. blocks OPEN Client1 MDS Client1 MDS Client1 MDS

EXTEND LOCK READ Client2 OST OST IO BULK OST GLIMPSE 2 RPCs (3 with GLIMPSE) 2 RPCs 3 RPCs + BULK

26 Feature Optimisation: Data-on MDT (Cont.) Small file Creates directly on the Lustre MDT

File Create (4KiB): HDD vs. NVMe OST vs. 90000 DoM 79689.85 80000 § Architecturally very different, both from 70000 a hardware and software perspective 60000

§ Space used and load on the MDT Second is considerably higher 50000 44403.23 § 4x Speed up when using DoM for small 40000 files on an NVMe Lustre MDT (~4-32KiB 30000 20039.38 tested) File Creates per 20000

§ 1.9x of that is just from efficiency 10000 improvements in the network, i.e. less/better use of RPC’s 0 1 NVMe MDT + 1 HDD ZFS OST 1 NVMe MDT + 1 NVMe OST 1 NVMe MDT (DoM)

27 Composite File Layouts (2.10 LTS) Innovation in Storage Usage

Progressive File Layouts simplify usage and provide new options § Optimize performance for diverse users/applications § Low overhead for small files, high bandwidth for large files § Lower new user usage barrier and administrative burden § Multiple storage classes within a single file Example progressive file layout with 3 components § HDD or SSD, mirror or RAID 1 stripe 4 stripes 128 stripes [0, 32MB) [32MB, 1GB) [1GB, ∞)

28 Improved Data Availability (Intel 2.11+)

File Level Redundancy provides significant value and functionality for HPC § Configure on a per-file/dir basis (e.g., mirror input files and one daily checkpoint) § Higher availability for server/network failure - finally better than HA failover § Robustness against data loss/corruption - mirror or M+N erasure coding for stripes § Increased read speed for widely shared files - mirror input data across many OSTs § Replicate/migrate files between storage classes - NVRAM->SSD->HDD § Local vs. remote replicas § Partial HSM file restore Replica 0 Object j (primary, preferred) § File versioning, ...

Replica 1 Object k (stale) delayed resync

29

FAQ Intel® and Lustre* (1/3)

Why is Intel making these changes? Intel is seeking to improve the rate of innovation by making the team more efficient in producing the Lustre technology. Is Intel still offering Lustre support services? Yes. Intel will continue to provide its core Lustre L3 Support offering to OEM partners and direct customers. What is happening to the proprietary components from the Intel-branded releases? HAL and HAM have already been open sourced and are available on github: · https://github.com/intel-hpdd/lustre-connector-for-hadoop · https://github.com/intel-hpdd/scheduling-connector-for-hadoop and IML will be available in the near future.

32 Intel® and Lustre* (2/3)

Where can I access the community Lustre releases? These are available on lustre.org - https://downloads.hpdd.intel.com/public/lustre/ Will the community Lustre releases provide the quality provided by Intel-branded Lustre releases? We do not expect there to be any drop off in quality. Intel will be heavily involved in producing the Lustre community releases and expects there to be the same high level of quality that customers have come to expect from the Intel-branded releases. Furthermore, Intel will be publishing community maintenance releases for a designated Long Term Stable (LTS) release so that users can easily access an up to date “latest and greatest”. Which release will be the first LTS Lustre community release? The upcoming community Lustre 2.10 release will be the first LTS release. This release is targeted for the end of June, 2017. The latest status is available here - http://lustre.org/download/

33 Intel® and Lustre* (3/3)

Will sites running existing Intel branded releases be able to upgrade to the community LTS release? Yes. Engage through your support channels to get details on how to upgrade. What will happen with support for my existing deployments using Intel- branded releases? You will be able to get support through your existing support channels as today. Who should be contact point for any further questions that I have? Please contact the Lustre team at [email protected] for any other information required.

34