Lustrearchitecture-2017-08-17, V4

Introduction to Lustre* Architecture Lustre* systems and network administration October 2017 * Other names and brands may be claimed as the property of others Lustre – Fast, Scalable Storage for HPC Lustre* is an open-source, object-based, distributed, parallel, clustered file system § Designed for maximum performance at massive scale § Capable of Exascale capacities § Highest IO performance available for the world’s largest supercomputers § POSIX compliant § Efficient and cost effective 3 Lustre* is an open-source, distributed, parallel data storage platform designed for massive scalability, high-performance, and high-availability. Popular with the HPC community, Lustre is used to support the most demanding data-intensive applications. Lustre provides horizontally scalable IO for data centres of all sizes, and is deployed alongside some of the very largest supercomputers. The majority of the top 100 fastest computers, as measured by top500.org, use Lustre for their high performance, scalable storage. Lustre file systems can scale from very small platforms of a few hundred terabytes up to large scale platforms with hundreds of petabytes, in a single, POSIX-compliant, name space. Capacity and throughput performance scale easily. Lustre runs on Linux-based operating systems and employs a client-server network architecture. Lustre software services are implemented entirely within the Linux kernel, as loadable modules. Storage is provided by a set of servers that can scale to populations measuring up to several hundred hosts. Lustre servers for a single file system instance can, in aggregate, present up to tens of petabytes of storage to thousands of compute clients simultaneously, and deliver more than a terabyte-per-second of combined throughput. Lustre Scalable Storage Management Metadata DNE Metadata Object Storage Target (MGT) Target (MDT0) Targets (MDTi - MDTj) Targets (OSTs) Storage servers grouped into failover pairs Management Management Additional Object Storage Object Storage Network & Metadata Metadata Servers Servers Servers Servers High Performance Data Network (Omni-Path, InfiniBand, Ethernet) Lustre Clients (1 – 100,000+) 4 Lustre’s architecture uses distributed, object-based storage managed by servers and accessed by client computers using an efficient network protocol. There are metadata servers, responsible for storage allocation, and managing the file system name space, and object storage servers, responsible for the data content itself. A file in Lustre is comprised of a metadata inode object and one or more data objects. Lustre is a client-server, parallel, distributed, network file system. Servers manage the presentation of storage to a network of clients, and write data sent from clients to persistent storage targets. There are three different classes of server: • Management server provides configuration information, file system registries • Metadata servers record file system namespace, inodes. The metadata servers maintain the file system index. • Object storage servers record file content in distributed binary objects. A single file is comprised of 1 or more objects, and the data for that file is organized in stripes across the objects. Objects are distributed across the available storage targets. Lustre separates metadata (inode) storage from block data storage (file content). All file metadata operations (creating and deleting files, allocating data objects, managing permissions) are managed by the metadata servers. Metadata servers provide the index to the file system. Metadata is stored in key-value index objects that store file system inodes: file and directory names, permissions, block data locations, extended attributes, etc. Lustre object storage servers write the data content out to persistent storage. Object servers can be written to concurrently, and individual files can be distributed across multiple objects on multiple servers. This allows very large files to be created and accessed in parallel by processes distributed across a network of computer infrastructure. Block data is stored in binary byte array data objects on a set of storage targets. A single Lustre "file" can be written to multiple objects across many storage targets. In addition to the metadata and object data, there is the management service, which is used to keep track of servers, clients, storage targets and file system configuration parameters. Clients aggregate the metadata name space and object data to present a coherent POSIX file system to applications. Clients do not access storage directly: all I/O is sent over a network. The Lustre client software splits IO operations into metadata and block data, communicating with the appropriate services to service IO transactions. This is the key concept of Lustre’s design – separate small, random, IOPS-intensive metadata traffic from the large, throughput- intensive, streaming block IO. Lustre Building Blocks Building block for Scalability. Management Metadata Object Storage Add more OSS Target (MGT) Target (MDT) Targets (OSTs) Metadata is stored building blocks to separately from increase capacity file object data. and throughput With DNE, multiple bandwidth. metadata servers can be added to Uniformity in building increase block configuration Storage servers grouped namespace promotes consistency into failover pairs capacity and in performance and performance behavior as the file system grows. Management Metadata Object Storage Network Servers Servers Balance server IO with storage IO Intel Manager for Lustre capability for best utilisation. High Performance Data Network (InfiniBand, 10GbE) Lustre Clients (1 – 100,000+) 5 The major components of a Lustre file system cluster are: • MGS + MGT: Management service, provides a registry of all active Lustre servers and clients, and stores Lustre configuration information. MGT is the management service storage target used to store configuration data. • MDS + MDTs: Metadata service, provides file system namespace (the file system index), storing the inodes for a file system. MDT is the metadata storage target, the storage device used to hold metadata information persistently. Multiple MDS and MDTs can be added to provide metadata scaling. • OSS + OSTs: Object Storage service, provides Bulk storage of data. Files can Be written in stripes across multiple object storage targets (OSTs). Striping delivers scalable performance and capacity for files. OSS are the primary scalable service unit that determines overall aggregate throughput and capacity of the file system. • Clients: Lustre clients mount each Lustre file system instance using the Lustre Network protocol (LNet). Presents a POSIX-compliant file system to the OS. Applications use standard POSIX system calls for Lustre IO, and do not need to be written specifically for Lustre. • Network: Lustre is a network-based file system, all IO transactions are sent using network RPCs. Clients have no local persistent storage and are often diskless. Supports many different network technologies, including OPA, IB, Ethernet. Lustre Networking Applications do not run directly on storage servers: all application I/O is transacted over a network. Lustre network I/O is transmitted using a protocol called LNet, derived from the Portals network programming interface. LNet has native support for TCP/IP networks as well as RDMA networks such as Intel Omni-Path Architecture (OPA) and InfiniBand. LNet supports heterogeneous network environments. LNet can aggregate IO across independent interfaces, enabling network multipathing. Servers and clients can be multi-homed, and traffic can be routed using dedicated machines called LNet routers. Lustre network (LNet) routers provide a gateway between different LNet networks. Multiple routers can be grouped into pools to provide performance scalability and to provide multiple routes for availability. The Lustre network protocol is connection-based: end-points maintain shared, coordinated state. Servers maintain exports for each active connection from a client to a server storage target, and clients maintain imports as an inverse of server exports. Connections are per- target and per-client: if a server exports N targets to P clients, there will be (N * P) exports on the server, and each client will have N imports from that server. Clients will have imports for every Lustre storage target from every server that represents the file system. Most Lustre protocol actions are initiated by clients. The most common activity in the Lustre protocol is for a client to initiate an RPC to a specific target. A server may also initiate an RPC to the target on another server, e.g. an MDS RPC to the MGS for configuration data; or an RPC from MDS to an OST to update the MDS’s state with available space data. Object storage servers never communicate with other object storage servers: all coordination is managed via the MDS or MGS. OSS do not initiate connections to clients or to an MDS. An OSS is relatively passive: it waits for incoming requests from either an MDS or Lustre clients. Lustre and Linux The core of Lustre runs in the Linux kernel on both servers and clients. Lustre servers have a choice of backend storage target formats, either LDISKFS (derived from EXT4), or ZFS (ported from OpenSolaris to Linux). Lustre servers using LDISKFS storage require patches to the Linux kernel. These patches are to improve performance, or to enable instrumentation useful during the automated test processes in Lustre’s software development lifecycle. The list of patches continues to reduce

Load more