OrangeFS Overview Tutorial

Walt Ligon Clemson University Tutorial Goals

• Brief History • Architecture Overview • User Interfaces • Important Features • Installaon/Configuraon Brief History

• Brief Historyß Starng here! • Architecture Overview • User Interfaces • Important Features • Installaon/Configuraon A Brief History

• First there was PVFS – V0 • Aric Blumer (Clemson) • 1993 • PVM User’s Meeng Oak Ridge – V1 • Rob Ross (Clemson) • 1994 – 2000 – V2 • Rob Ross (Argonne), Phil Carns (Clemson), et al. • 2000 - Present PVFS Funding

• NASA • NSF – PACI – HECURA • DOE – Office of Science – SciDAC • Government Agencies PVFS Partnerships

• Clemson U • Ohio Supercomputer • Argonne NL Center • Northwestern U • Carnegie Mellon U • Acxiom • U of Heidleberg • Ames NL • Sandia NL • U of Michigan • U of Oregon • Ohio St. U Emergence of Orange

• Project started in 2007 – Develop PVFS for “non-tradional” uses • Very large number of small files • Smaller accesses • Much more metadata – Robust security features – Improved resilience • Began to see opportunies for broader area – Big Data Management – Clouds – Enterprise Today and Beyond

• Clemson reclaims primary site – As of 2.8.4 began using name “OrangeFS” – Omnibond offers commercial support – Currently on 2.8.6 – Version 2.9.0 soon to be released with new features – 2.10 is internal development version for … • OrangeFS 3.0 What to Expect in 3.0

• Totally Redesigned Object Model – Replicaon – Migraon – Dynamic configuraon – Hierarchical Storage Management (ers) – Metadata support for external devices (archive) – Rule-based security model – Rule-based policies Architecture Overview

• Brief History • Architecture Overview ß You are here! • User Interfaces • Important Features • Installaon/Configuraon Architecture Overview

• OrangeFS is designed for parallel machines – Client/Server architecture – Expects mulple clients and servers • Two major soware components – Server daemon – Client library • Two auxiliary components – Client core daemon – Kernel module User Level

• Everything runs at the user level on clients and servers – Excepon is the kernel module, which is really just a shim – more details later • Server daemon runs on each server node and serves requests sent by the client library, which is linked to client code on compute nodes Networking

• Client and server communicate using an interface called “BMI” – BMI supports tcp/ip, ib, mx, portals – Module design allows addion of new protocols – Requests use PVFS protocol similar to that of NFS but expanded to support a number of high performance features Server-to-Server Communicaon

App App App App App App Mid Mid Mid Mid Mid Mid Client Client Client Client Client Client Network Network Serv Serv Serv Serv Serv Serv

Tradional Metadata Scalable Metadata Operaon Operaon Create request causes client to Create request communicates with a single communicate with all servers O(p) server which in turn communicates with other servers using a tree-based protocol O (log p) Storage

• The OrangeFS server interacts with storage on the host using an interface layer named “trove” – All storage objects referenced with a “handle” – Storage objects consist of two components • Bytestreams – sequences of data • Key/Value Pairs – data items that are accessed by key – As currently implemented … • Bytestreams with the local file system (data) • Key/value pairs with BerkeleyDB (metadata) File Model

• Every file consists of two or more objects – Metadata object contains • Tradional file metadata (owner, permissions, etc.) • List of data object handles • FS specific items (layout, distribuon, parameters) • User-defined aributes – Data objects contain • Contents of the file • Aributes specific to that object (usually not visible) Directories

• Work just like files except – Data objects contain directory entries rather than data – Directory entries have entry name and handle of that entry’s metadata object (like inode number) – Extendable hash funcon selects which data object contains each entry – Mechanisms based on GIGA+ manage consistency as directory grows or shrinks Distributed Directories

DirEnt1 Extensible Hashing Server0 DirEnt5 DirEnt1 DirEnt2 DirEnt3 Server1 DirEnt3

DirEnt4 DirEnt2 Server2 DirEnt5 DirEnt6 DirEnt6 DirEnt4 Server3 Policies

• Objects of any type can reside on any server. – Random selecon (default) used to balance – User can control various things • Which servers hold data • Which servers hold metadata • How many servers a given file or directory are spread across • How those servers are selected • How file data is distributed to the servers – Parameters can be set in configuraon file, many on a directory, or for a specific file User Interfaces

• Brief History • Architecture Overview • User Interfacesß Half way! • Important Features • Installaon/Configuraon User Interfaces

• System Interface • VFS Interface • Direct Access Library • MPI-IO • Windows Client • Web Services Module • Ulies The System Interface

• Low-level interface of the client library – PVFS_sys_lookup() – PVFS_sys_getar() – PVFS_sys_setar() – PVFS_sys_io() • Based on the PVFS request protocol • Designed to have interfaces built on them • Provide access to all of the features • NOT a POSIX-like interface The VFS Interface

kernel module allows OrangeFS volumes to be mounted and used like any other – Limited mmap support – No client side cache – A few semanc issues • Must run client_core daemon – Reads requests from the kernel then calls library • The most common and convenient interface • BSD and OS X support via FUSE The Direct Library

• Interposion library for file related stdio and Linux system calls • Links programs directly to the client library • Can preload the shared library and run programs without relinking • Faster, lower latency, more efficient than VFS • Configurable user-level client cache • Best opon for serious applicaons Direct Access Library

App • Implements:

Client Core – POSIX system calls Kernel – Stdio library calls PVFS lib • Parallel extensions – Nonconguous I/O App – Non-blocking I/O • MPI-IO library Direct lib Kernel PVFS lib Direct Interface Client Caching

Client Applicaon

Direct Interface

Client Cache

File System File System File System File System File System File System

• Direct Interface enables Mul-Process Coherent Client Caching for a single client MPI-IO

• Most MPI libraries provide ROMIO support, which has support for direct access to the client library • A number of specific opmizaons for MPI programs, especially for collecve IO – Aggregators, data sieving, data type support • Probably the best overall interface, but only for MPI program Windows Client

Direct Windows Apps Drive Mapping OrangeFS PVFS Protocol OrangeFS API Servers Dokan

• Supports Windows 32/64 bit • Server 2008, R2, Vista, 7 Web Services Module

• WebDAV • S3 • REST WebDAV

Apache Server ModDAV

DAVOrangeFS Server OrangeFS

PVFS Protocol OrangeFS API Server

• Supports DAV protocol and tested with (insert reference test run – check with mike) • Supports DAV cooperave locking in metadata

S3

Apache Server

ModS3 Server OrangeFS

PVFS Protocol OrangeFS API Server Admin REST Interface

DojoToolkit UI REST

Server Apache

Mod_OrangeFSREST Server OrangeFS

PVFS Protocol OrangeFS API Server Important Features

• Brief History • Architecture Overview • User Interfaces • Important Featuresß Over the hump! • Installaon/Configuraon Important Features

• Already Touched On – Parallel Files (1.0) – Stateless, User-level Implementaon (2.0) – Distributed Metadata (2.0) – Extended Aributes (2.0) – Configurable Parameters (2.0) – Distributed Directories (2.9) • Non-Conguous I/O (1.0) • SSD Metadata Storage (2.8.5) • Replicate On Immutable (2.8.5) • Capability-based Security (2.9) Non-Conguous I/O

§ Nonconguous I/O operaons are Noncontiguous in memory Memory common in computaonal science applicaons § Most PFSs available today File implement a POSIX-like interface Noncontiguous in file (open, write, close) Memory § POSIX nonconguous support is poor: File • readv/writev only good for Noncontiguous in memory and file nonconguous in memory Memory • POSIX liso requires matching sizes in

memory and file File § Beer interfaces allow for beer scalability SSD Metadata Storage

Data Disk

Server Metadata SSD

• Wring metadata to SSD – Improves Performance – Maintains Reliability Replicate On Immutable

File I/O Client Server Server

Metadata Data Data

• First Step in Replicaon Roadmap • Replicate data to provide resiliency – Inially replicate on Immutable – Client read fails over to replicated file if primary is unavailable Capability Based Security

Cert or Credenal OpenSSL Metadata PKI Signed Capability Servers

Signed Capability I/O

Client Server Signed Capability I/O

Server Signed Capability I/O

Server Future Features

• Aribute Based Search • Hierarchical Data Management Aribute-Based Search

Metadata Key/Value Parallel Query Metadata Client Data File Access Data • Client tags files with Keys/Values • Keys/Values indexed on Metadata Servers • Clients query for files based on Keys/Values • Returns file handles with opons for filename and path Hierarchical Data Management

Metadata OrangeFS Remote Systems TeraGrid, OSG, , GPFS

Users

Intermediate Archive Storage HPC NFS OrangeFS Installaon/Configuraon

• Brief History • Architecture Overview • User Interfaces • Important Features • Installaon/Configuraonß Last One! Installing OrangeFS

• Distributed as a tarball – You need • GNU Development environment – GCC, GNU Make, Autoconf, flex, bison • Berkeley DB V4 (4.8.32 or later) • Kernel Headers (for kernel module) • A few libraries – OpenSSL, etc. Building OrangeFS

# ./configure --prefix= --with-kernel= --with-db= # make # make install # make kmod # make kmod_install Configuring OrangeFS

• Servers – Many installaons have dedicated servers • Mul-core • Large RAM • RAID subsystems • Faster network connecons – Others have every node as both client and server • Checkpoint scratch space • User-level file system • Smaller, more adaptable systems Server Storage Modules

• AltIO – General-purpose – Uses Linux AIO to implement reads/writes – Typically ulizes Linux page cache • DirectIO – Designed for dedicated storage with RAID – Uses Linux threads and direct IO – Bypasses Linux page cache Parameters to Consider

• Default number of IO Servers • Default strip size • RAID parameters – Cache behavior – Block sizes • Local file system layout (alignment) Metadata Servers

• As many or as few as you need – Every server can be a metadata server • Metadata access load spread across servers – Dedicated metadata servers • Keep metadata traffic from data servers • Custom configured for metadata – Depends on workload and data center resources Clients

• Minimum each client needs libpvfs2 – Direct access lib (libofs) – MPI libraries – Precompiled programs – Web Services • Most clients will want VFS interface – pvfs2-client-core – Modpvfs2 • Need a mount string poinng to a server (fstab) – Many clients, spread load across servers Learning More

• www.orangefs.org web site – Releases – Documentaon – Wiki • [email protected] – Support for users • [email protected] – Support for developers Commercial Support

• www.orangefs.com & www.omnibond.com – Professional Support & Development team

QUESTIONS & ANSWERS