OrangeFS Overview Tutorial
Walt Ligon Clemson University Tutorial Goals
• Brief History • Architecture Overview • User Interfaces • Important Features • Installa on/Configura on Brief History
• Brief Historyß Star ng here! • Architecture Overview • User Interfaces • Important Features • Installa on/Configura on A Brief History
• First there was PVFS – V0 • Aric Blumer (Clemson) • 1993 • PVM User’s Mee ng Oak Ridge – V1 • Rob Ross (Clemson) • 1994 – 2000 – V2 • Rob Ross (Argonne), Phil Carns (Clemson), et al. • 2000 - Present PVFS Funding
• NASA • NSF – PACI – HECURA • DOE – Office of Science – SciDAC • Government Agencies PVFS Partnerships
• Clemson U • Ohio Supercomputer • Argonne NL Center • Northwestern U • Carnegie Mellon U • Acxiom • U of Heidleberg • Ames NL • Sandia NL • U of Michigan • U of Oregon • Ohio St. U Emergence of Orange
• Project started in 2007 – Develop PVFS for “non-tradi onal” uses • Very large number of small files • Smaller accesses • Much more metadata – Robust security features – Improved resilience • Began to see opportuni es for broader area – Big Data Management – Clouds – Enterprise Today and Beyond
• Clemson reclaims primary site – As of 2.8.4 began using name “OrangeFS” – Omnibond offers commercial support – Currently on 2.8.6 – Version 2.9.0 soon to be released with new features – 2.10 is internal development version for … • OrangeFS 3.0 What to Expect in 3.0
• Totally Redesigned Object Model – Replica on – Migra on – Dynamic configura on – Hierarchical Storage Management ( ers) – Metadata support for external devices (archive) – Rule-based security model – Rule-based policies Architecture Overview
• Brief History • Architecture Overview ß You are here! • User Interfaces • Important Features • Installa on/Configura on Architecture Overview
• OrangeFS is designed for parallel machines – Client/Server architecture – Expects mul ple clients and servers • Two major so ware components – Server daemon – Client library • Two auxiliary components – Client core daemon – Kernel module User Level File System
• Everything runs at the user level on clients and servers – Excep on is the kernel module, which is really just a shim – more details later • Server daemon runs on each server node and serves requests sent by the client library, which is linked to client code on compute nodes Networking
• Client and server communicate using an interface called “BMI” – BMI supports tcp/ip, ib, mx, portals – Module design allows addi on of new protocols – Requests use PVFS protocol similar to that of NFS but expanded to support a number of high performance features Server-to-Server Communica on
App App App App App App Mid Mid Mid Mid Mid Mid Client Client Client Client Client Client Network Network Serv Serv Serv Serv Serv Serv
Tradi onal Metadata Scalable Metadata Opera on Opera on Create request causes client to Create request communicates with a single communicate with all servers O(p) server which in turn communicates with other servers using a tree-based protocol O (log p) Storage
• The OrangeFS server interacts with storage on the host using an interface layer named “trove” – All storage objects referenced with a “handle” – Storage objects consist of two components • Bytestreams – sequences of data • Key/Value Pairs – data items that are accessed by key – As currently implemented … • Bytestreams with the local file system (data) • Key/value pairs with BerkeleyDB (metadata) File Model
• Every file consists of two or more objects – Metadata object contains • Tradi onal file metadata (owner, permissions, etc.) • List of data object handles • FS specific items (layout, distribu on, parameters) • User-defined a ributes – Data objects contain • Contents of the file • A ributes specific to that object (usually not visible) Directories
• Work just like files except – Data objects contain directory entries rather than data – Directory entries have entry name and handle of that entry’s metadata object (like inode number) – Extendable hash func on selects which data object contains each entry – Mechanisms based on GIGA+ manage consistency as directory grows or shrinks Distributed Directories
DirEnt1 Extensible Hashing Server0 DirEnt5 DirEnt1 DirEnt2 DirEnt3 Server1 DirEnt3
DirEnt4 DirEnt2 Server2 DirEnt5 DirEnt6 DirEnt6 DirEnt4 Server3 Policies
• Objects of any type can reside on any server. – Random selec on (default) used to balance – User can control various things • Which servers hold data • Which servers hold metadata • How many servers a given file or directory are spread across • How those servers are selected • How file data is distributed to the servers – Parameters can be set in configura on file, many on a directory, or for a specific file User Interfaces
• Brief History • Architecture Overview • User Interfacesß Half way! • Important Features • Installa on/Configura on User Interfaces
• System Interface • VFS Interface • Direct Access Library • MPI-IO • Windows Client • Web Services Module • U li es The System Interface
• Low-level interface of the client library – PVFS_sys_lookup() – PVFS_sys_geta r() – PVFS_sys_seta r() – PVFS_sys_io() • Based on the PVFS request protocol • Designed to have interfaces built on them • Provide access to all of the features • NOT a POSIX-like interface The VFS Interface
• Linux kernel module allows OrangeFS volumes to be mounted and used like any other – Limited mmap support – No client side cache – A few seman c issues • Must run client_core daemon – Reads requests from the kernel then calls library • The most common and convenient interface • BSD and OS X support via FUSE The Direct Library
• Interposi on library for file related stdio and Linux system calls • Links programs directly to the client library • Can preload the shared library and run programs without relinking • Faster, lower latency, more efficient than VFS • Configurable user-level client cache • Best op on for serious applica ons Direct Access Library
App • Implements:
Client Core – POSIX system calls Kernel – Stdio library calls PVFS lib • Parallel extensions – Noncon guous I/O App – Non-blocking I/O • MPI-IO library Direct lib Kernel PVFS lib Direct Interface Client Caching
Client Applica on
Direct Interface
Client Cache
File System File System File System File System File System File System
• Direct Interface enables Mul -Process Coherent Client Caching for a single client MPI-IO
• Most MPI libraries provide ROMIO support, which has support for direct access to the client library • A number of specific op miza ons for MPI programs, especially for collec ve IO – Aggregators, data sieving, data type support • Probably the best overall interface, but only for MPI program Windows Client
Direct Windows Apps Drive Mapping OrangeFS PVFS Protocol OrangeFS API Servers Dokan
• Supports Windows 32/64 bit • Server 2008, R2, Vista, 7 Web Services Module
• WebDAV • S3 • REST WebDAV
Apache Server ModDAV
DAVOrangeFS Server OrangeFS
PVFS Protocol OrangeFS API Server
• Supports DAV protocol and tested with (insert reference test run – check with mike) • Supports DAV coopera ve locking in metadata
S3
Apache Server
ModS3 Server OrangeFS
PVFS Protocol OrangeFS API Server Admin REST Interface
DojoToolkit UI REST
Server Apache
Mod_OrangeFSREST Server OrangeFS
PVFS Protocol OrangeFS API Server Important Features
• Brief History • Architecture Overview • User Interfaces • Important Featuresß Over the hump! • Installa on/Configura on Important Features
• Already Touched On – Parallel Files (1.0) – Stateless, User-level Implementa on (2.0) – Distributed Metadata (2.0) – Extended A ributes (2.0) – Configurable Parameters (2.0) – Distributed Directories (2.9) • Non-Con guous I/O (1.0) • SSD Metadata Storage (2.8.5) • Replicate On Immutable (2.8.5) • Capability-based Security (2.9) Non-Con guous I/O
§ Noncon guous I/O opera ons are Noncontiguous in memory Memory common in computa onal science applica ons § Most PFSs available today File implement a POSIX-like interface Noncontiguous in file (open, write, close) Memory § POSIX noncon guous support is poor: File • readv/writev only good for Noncontiguous in memory and file noncon guous in memory Memory • POSIX lis o requires matching sizes in
memory and file File § Be er interfaces allow for be er scalability SSD Metadata Storage
Data Disk
Server Metadata SSD
• Wri ng metadata to SSD – Improves Performance – Maintains Reliability Replicate On Immutable
File I/O Client Server Server
Metadata Data Data
• First Step in Replica on Roadmap • Replicate data to provide resiliency – Ini ally replicate on Immutable – Client read fails over to replicated file if primary is unavailable Capability Based Security
Cert or Creden al OpenSSL Metadata PKI Signed Capability Servers
Signed Capability I/O
Client Server Signed Capability I/O
Server Signed Capability I/O
Server Future Features
• A ribute Based Search • Hierarchical Data Management A ribute-Based Search
Metadata Key/Value Parallel Query Metadata Client Data File Access Data • Client tags files with Keys/Values • Keys/Values indexed on Metadata Servers • Clients query for files based on Keys/Values • Returns file handles with op ons for filename and path Hierarchical Data Management
Metadata OrangeFS Remote Systems TeraGrid, OSG, Lustre, GPFS
Users
Intermediate Archive Storage HPC NFS OrangeFS Installa on/Configura on
• Brief History • Architecture Overview • User Interfaces • Important Features • Installa on/Configura onß Last One! Installing OrangeFS
• Distributed as a tarball – You need • GNU Development environment – GCC, GNU Make, Autoconf, flex, bison • Berkeley DB V4 (4.8.32 or later) • Kernel Headers (for kernel module) • A few libraries – OpenSSL, etc. Building OrangeFS
# ./configure --prefix=
• Servers – Many installa ons have dedicated servers • Mul -core • Large RAM • RAID subsystems • Faster network connec ons – Others have every node as both client and server • Checkpoint scratch space • User-level file system • Smaller, more adaptable systems Server Storage Modules
• AltIO – General-purpose – Uses Linux AIO to implement reads/writes – Typically u lizes Linux page cache • DirectIO – Designed for dedicated storage with RAID – Uses Linux threads and direct IO – Bypasses Linux page cache Parameters to Consider
• Default number of IO Servers • Default strip size • RAID parameters – Cache behavior – Block sizes • Local file system layout (alignment) Metadata Servers
• As many or as few as you need – Every server can be a metadata server • Metadata access load spread across servers – Dedicated metadata servers • Keep metadata traffic from data servers • Custom configured for metadata – Depends on workload and data center resources Clients
• Minimum each client needs libpvfs2 – Direct access lib (libofs) – MPI libraries – Precompiled programs – Web Services • Most clients will want VFS interface – pvfs2-client-core – Modpvfs2 • Need a mount string poin ng to a server (fstab) – Many clients, spread load across servers Learning More
• www.orangefs.org web site – Releases – Documenta on – Wiki • [email protected] – Support for users • [email protected] – Support for developers Commercial Support
• www.orangefs.com & www.omnibond.com – Professional Support & Development team
QUESTIONS & ANSWERS