An Overview of On- Premise File and Access Protocols

Dean Hildebrand - Research Staff Member, IBM Research Bill Owen - Senior Engineer, IBM

v1.2 Outline Introduction File and Object Discussion NFS Object Storage Introduction Swift S3 File to Object and Object to File Comparison Discussion

Conclusion 2 Dean Hildebrand Bill Owen Research Staff Member Senior Engineer IBM Research IBM

3 Attendance Poll

SysAdmin/ Storage Architect/ Developers Students Researchers Manager

4 Software Storage Market Growth

5 Accessing Data in On-Premise Storage Systems

6 Local vs Shared Storage

7 Local Storage

● Most common for laptops, desktops, mobile devices, server OS boot disks ● Typically formatted with a ○ E.g., , XFS, NTFS, HFS+, , ZFS ● Invaluable to manage a single device (or maybe a few with LVM) ● Varying levels of availability, durability, scalability, etc, supported ● All limited to a single node ○ E.g., Cannot support VM or container migration, support 1000s of applications, etc, etc ● In your research, think about the real benefits of further optimizing local storage ○ How many pressing problems are left to be solved? Only incremental gains? ○ Common to be used as a building block in higher level storage systems

8 Shared Storage

Supports any kind of client Independent Scaling of device clients

Supports any type of network and network/file protocol Network

Independent Scaling of storage bandwidth and SSD Fast Slow Tape Supports any kind of capacity Disk Disk storage device

9 Block Shared Storage

● Used to dominate, now mostly shrinking...except ○ FC continues to have very low iSCSI/FC/FCoE latency, and so finding new life (and others) with Flash storage systems FibreChannel/ ○ iSCSI still very popular for VMs

Typically in pairs for H/A

SSD Fast Slow Tape Disk Disk

10 Parallel and Scale-out File Systems

● Scalability (all dimensions) ● Performance (all dimensions) ● Support general applications and middleware ● Make managing billions of files, TB/s Proprietary File of bandwidth, and PBs of data *easy* Access Protocol

Commercial Scale-out as Infiniband/Ethernet needed HPC ...

SSD Fast Slow Tape Disk Disk 11 Distributed Access Protocols

Wide variety of solutions Ethernet Infiniband/Ethernet Vast Range of Performance and Scalability Options ...

Standard and Non- Slow Disk Standard Protocols SSD Fast Slow Tape Disk Disk 12 Distributed Access Protocols: Portability and Lock-In

● Standard APIs help ○ Maximize Application Portability ○ Minimize Vendor Lock-in ● Numerous benefits of standard protocols ○ Standard protocol clients ship in most OSs ○ Promote predictability of semantics and application behavior ○ Minimize changes to applications and system infrastructure when switching to new storage system (many times due to reasons out of your control) ○ Applications can move between on-premise and off-premise (cloud) systems ○ Wider and more broad user base makes it easier to find support and also hardens implementations

13 Distributed Access Protocols: Standards Are Not A Silver Bullet

● For File, while applications use POSIX, they are sensitive to implementation ○ No common set of commands guarantee crash consistency [***] ○ For distributed file systems, it becomes even more complicated ■ Different crash consistency semantics, cache consistency semantics, locking semantics, security levels/APIs/tools, etc ● For Object, each implementation varies w.r.t. level of eventual consistency, security, versioning, etc ● Even what we consider standards are not actually well defined ○ E.g., SMB, S3

Examples: ● CIFS/SMB does sequential consistency whereas NFS has close-to-open semantics ● Versioning is quite different between object protocols

[***] - All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications, OSDI’14 14 Distributed Access Protocols: One Storage Protocol CANNOT Do It All

● There are so many vendors...each claiming they have *solved* data storage ○ or is it world hunger? ● Vendors sell what they have, not what you need ○ A storage seller takes what they have a makes it fit for practically any requirement and use case ○ Leads to many unsatisfied customers soon after deployment ● Many protocols have existed: DDN WOS, EMC ATMOS, CDMI, AFS, DFS, RFS, 9P, etc Tips ● Attend sessions like this to learn more about reality and not hype :) ● Dig into advertised feature support ○ How many customers use a feature, will the customer talk about it, in what context do they use it, etc ● Validate system on-premise using realistic workloads (do you know your workloads?) ○ Remember there is no guarantee for what you haven’t tried (x- and y-axis’ have an upper bound for a reason) ● Don’t buy H/W first and then expect any storage S/W vendor to support it efficiently 15 On Premise Data Access Protocols: NFS and now Swift, S3

Winners File Object

NFS and SMB are the clear winners Industry appears to be centralizing around Swift and S3 SMB is being discussed in SNIA S3: Amazon + many, many tutorial this week, so we’ll focus on apps/tools NFS Swift: Open source + API + Note: HDFS also dominant for analytics 3 cloud vendors (or more)

Easily repatriate apps due to cost 16 Tutorial Goals

SysAdmin/ Storage Architect/ Developers Students Researchers Manager

• Understand which • Understand how to ● Introduction to NAS and ● Understand challenges protocols are best for choose the best protocol Object history and of on-premise which applications for an application (and vendor landscape distributed data access consequences of • Understand tradeoffs choosing the wrong ● Introduction to ● Understand on-premise between protocols protocol) distributed data access challenges research potential ● Introduction to • Introduction to vendor landscape distributed data access research potential • Be able to determine which file-based applications are good candidates for using an object protocol 17 Outline Introduction File and Object Discussion NFS Object Storage Introduction Swift S3 File to Object and Object to File Comparison Discussion

Conclusion 18 Fantasy

File and Object

Both Can Do Anything

19 Reality

vs File and Object

Each has their strengths and weaknesses

20 Reality

vs File and Object Confusion Each has their strengths and weaknesses

21 Object vs File Summary

Object Store File Target ‘cold’ data (Backup, Read-only, Archive) Target most workloads (except HPC)

Low to Medium Performance Medium to High performance

Typically Scales to Large Capacity Typically Scales to Medium Scalability Low cost Low to High Cost Global and Ubiquitous/Mobile Access Limited Capability for Global Distribution Data Access through REST APIs Standard File Data Access Immutable Objects and Versioning POSIX + Snapshots Loose/Eventual Consistency Strong(er) Consistency

22 Outline Introduction File and Object Discussion NFS Object Storage Introduction Swift S3 File to Object and Object to File Comparison Discussion

Conclusion 23 NFS: A Little History...

● NFSv2 in 1983 ○ Synchronous, stable writes,...outdated ○ Finally removed from Fedora ● NFSv3 in 1995 ○ Still default on many distros... ● NFSv4 in 2003, updated 2015 ○ Default in RHEL... possibly others ● NFSv4.1 and pNFS in 2010 ○ Many structural changes and new features ● NFSv4.2 practically complete now… ○ Many new features, VM workloads specifically targeted

● Now going to try per-feature releases… 24 Deployment

● The beauty is that it is everywhere… (even Windows) ○ Well mostly...more on that later with object ● Most NFS servers are in-kernel or proprietary ○ But Ganesha is the first open-source user-level NFS server daemon ● For the enterprise, Scale-out NAS now a requirement for capacity and availability ● New clients and environments emerging ○ VMware announces support for NFSv4.1 as a client for storing VMDKs ○ Amazon announces support for NFSv4.0 in AWS Elastic File System (EFS) ○ OpenStack Manila is a shared file service with NFS as the initial file protocol ○ Docker has volume plugins that support NFS

25 NFS Caching Semantics

● Not POSIX… ○ But a single client with exclusive data access should see POSIX semantics ● v2 could not cache data ○ Sync writes ● v3 can cache data, but... ○ Weak cache consistency ○ Revalidates on open and periodically (30s in ) ○ Data must be kept in cache until ‘committed’ by server (just in case the server fails) ● v4 standardizes ‘close to open’ cache consistency ○ Similar to v3, but guarantees cache is revalidated on OPEN, flushed at CLOSE ○ Also checked periodically and at LOCK/LOCKU ○ Note granularity is typically on 1 second ○ Delegations reduces number of revalidations required...

26 NFSv3

● Collection of protocols (file, mount, lock, status) ○ Each on their own port ● Stateless (mostly) ○ Locks add state ○ Server must keep request cache to prevent duplicate non-idempotent RPCs ● UNIX-centric, but seen in Windows too ● 32 bit numeric uids/gids ● UNIX permissions, but Kerberos also possible ● Works over UDP, TCP ● Needs a-priori agreement on character sets

27 NFSv4 New Features

● Security ● Finally standardized almost everything ○ NFSv4 ACLs (much more full featured than POSIX ACLs) ● Custom export tree with pseudo-name space ○ Use of named strings instead of 32-bit integers ● Mandatory use of congestion protocol (TCP) ○ Lofty goals with new GSS-API, but essentially benefit is that ● Delegations Kerberos is officially supported and easier to configure ■ Kerberos V5 must always supported (but not ○ Clients become server for a file, coordinating multi- necessarily used) threaded access ○ Less communication and better caching ● Compound RPCs ○ Also includes callbacks from server to client ○ Dream was to reduce number of messages, but…. ○ Linux only implements RO delegations ○ Due to state operations, and POSIX API, number of ● Uses a universal character set for file names messages actually increases in some cases ● ● Integrated and well defined locking Referrals ○ Server can refer clients to other servers for subtree ○ Removes need for additional ports and daemons ○ Migration, load balancing ○ Share reservations for Windows ○ Mandatory locks supported ● Increased create rate through asynchronous ops ○ Much easier to support consistency across failures ● Better inter-protocol support ○ OPEN operation allows coordination with CIFS, etc

28 NFSv4 Statefulness Implies Talkative

OPEN & 56% 43% CLOSE

29 *From “Newer Is Sometimes Better: An Evaluation of NFSv4.1”, SIGMETRICS’16 NFSv3 NFSv4

Statelessness Lease-based state

The state of being immortal

■ 30 But Does Statelessness Really Justify Lack of Innovation?

Are We Frozen In Time? 31 32 *From “Newer Is Sometimes Better: An Evaluation of NFSv4.1”, SIGMETRICS’16 Reading Small Files

33 *From “Newer Is Sometimes Better: An Evaluation of NFSv4.1”, SIGMETRICS’16 NFSv4.1 New Features

● Introduces a session layer ○ Exactly once semantics ○ Vastly simplifies locking ● Multipathing via Trunking ○ Utilize more paths by using multiple IP’s can be identified as same server ○ Retry failed requests over other paths ● Retention attributes for compliance ● Delegations are easier to manage ○ Recall ANY semantics allow clients to decide what is the best delegations to recall ○ Re-acquisition without re-open ● pNFS ○ Scalable data access to scale-out storage systems ○ Improved load balancing

34 What is pNFS? ● Scalable access to YOUR data ○ Direct and parallel data access ○ Scale with underlying storage system ○ Better load balancing ○ If NFS can access it, then so can pNFS ● Standard file access (part of OS) ○ Open client, no client licensing issues ● Layouts pNFS ○ Metadata ■ Clients always issue metadata requests to an NFSv4.1 server ■ Scale-out systems can support multiple metadata servers to the same data ○ Data ■ File layout part of NFSv4.1 ■ Object and Block variants in separate Drafts GPFS HDFS ● Security and Access Control ○ Control path uses NFSv4.1 security ○ Data path uses security of I/O protocol PanFS ZFS dCache Netapp

35 What Coming in NFSv4.2 ● Sparse File Support ○ Hole punching to reclaim space ○ Avoid transferring 0s and unallocated space across wire on reads Great for managing ● Space Reservation virtual disk images ○ Ensure application does not run out of space ● Server Side Copy ○ Finally stop copying data through client machine ● Labeled NFS ○ Allows (partial) SELinux support for Mandatory Access Control (MAC) ● Client can inform server of I/O patterns ○ Provide a fadvise-like mechanism over a network ● Application Data Holes ○ Allows definition of file format ○ E.g., Initializing a 30G database takes a single over the wire operation instead of 30G of traffic.

36 Other Notable NFS Features

● RDMA ○ Support possible for all versions of NFS, but best with NFSv4.1 ● Federated File System ○ Enables file access and namespace traversal across independent file servers ○ Across organizations or within a single organization ○ Suite of standards including DNS, NSDB, ADMIN, and file-access (NFS) ● “Extended Attribute (xattr)” support on track for first post-NFSv4.2 feature ○ Existing named attributes did not work well with modern OS xattrs ○ New NFS xattrs will interoperate with existing OS xattr support

37 Ganesha: User Space NFS Server o Ganesha History ▪ Developed by Philippe Deniel (CEA) Great Open Source o Ganesha Features Community that ▪ Efficient caching device for data and metadata ▪ Scalable, High Performance includes IBM, ▪ Per file system namespace (FSAL) modules Panasas, DDN, CEA, • Abstraction that allows each file system to perform its own optimizations • Also allows for proxy support and other non-traditional back-ends Redhat o User Space makes life easier ▪ Security Managers (like Kerberos) reside in User Space • Can be accessed directly via GSSAPI (no need for “rpc_pipefs”) ▪ ID mappers (NIS, LDAP) reside in User Space • Accessed directly (the daemon is linked with libnfsidmap) ▪ Less constraints for memory allocation than in kernel space • Managing huge pieces of memory is easy ▪ Developing in User Space is soooooooooooo much easier • Hopefully increased community support

38 Ganesha FSAL: File System Abstraction Layer

● Namespace independent API ○ Translation layer between NFS server and underlying file system ○ Allows file system to customize how files and directories are managed ○ Allows for file system value add ● Handle based API (lookup, readdir, …) ● Implements namespaces’ specific authentication mechanisms ● Many FSAL modules exist ○ GPFS, HPSS, Proxy, Lustre, XFS, ZFS, GlusterFS, etc

39 NFS Security

● NFSv3 first relied on ONC RPC ○ AUTH_SYS is trivial to exploit Kerberos ○ AUTH_DES is trivial to exploit by someone with a degree in Mathematics ○ AUTH_KERB is better, but it isn’t standard ■ No written specification to refer to ■ Like AUTH_SYS, AUTH_DES, there is no integrity or privacy protection. ● All NFS versions now support RPCSEC_GSS ● NFSv4 added ○ Mandatory support for Kerberos V5 ■ krb5 (authentication) ■ krb5i (auth+integrity) ■ krb5p (auth+integrity+privacy) ○ Removed external mount protocol ○ NFSv4 ACLs

40 Quick Basics on ACLs (Authorization)

● Linux permissions too coarse ○ Single user to narrow POSIX ○ Group too broad ● POSIX ACLs are very basic Example 1: Give myuser read permissions to file1: ○ Allows multiple users/groups per file/directory $ setfacl -m user:myuser:r file1 ○ Files/Directories inherit ACLs of their parent directory ○ Use standard userids ● NFSv4 ACLs are richer ○ Close to subsuming Windows ACLs ○ is a user/group (at an org) defined by text name NFSv4 ○ 4 types of Access Control Entities (ACEs) for a ■ ALLOW - Grant access Example 2: Give myuser read permissions to file1: ■ DENY - Deny access ■ AUDIT - Log access to any file or directory $ nfs4_setfacl -a "A::[email protected]:R" file1 ■ ALARM - Generate an alarm on attempt to access any file or directory ○ Can control inheritance among other things ○ Works well with enterprise directory services

41 So Do I Just Need an NFS Server and I’m in Business?

42 Maybe….

but how important is performance, scalability, availability,

durability, multi-protocol access, backup, disaster recovery,

encryption, compression, cost, ease of management, tiering, archiving, etc

to you? 43 If so, you need Scale-Out NAS

Many good options depending on ● High Availability requirements and budget ○ Tricky with NFSv4, since state must be migrated ○ Failure requires grace period to recover from state ● Capacity and Performance Scaling ● Much, much more....

44 Workloads and Benchmarks

● Modern NAS systems can support 100k+ iops, from 1000s of clients… ● So the range of workloads they are currently doing is practically everything

● Spec SFS 2008 only represents a very specific ‘metadata heavy’ workload

45 Meta-data ops SPECsfs2008: 72% Physical Virtual setup: < 1% Virtual Application Application Application Application Application Application

Virtual Machine Virtual Machine Virtual Machine Physical Machine Physical Machine Physical Machine Physical Machine NFS/SMB NFS /SMB W2 W1

NAS Appliance NAS Appliance

GPFS, WAFL, ZFS GPFS, WAFL, ZFS

W2 W1 Current NAS ≠ New NAS Benchmarks Benchmarks 46 VM Workload Changes

Workload Property Physical NAS clients Virtual NAS clients File and directory count Many files and directories Few files per VM Directory tree depth Deep and non-uniform Shallow and uniform File size Lean towards small files Multi-gigabyte, but sparse Meta-data operations Many Almost none I/O synchronization Async and sync All writes are sync In-file randomness Workload-dependent Increased randomness Cross-file randomness Workload-dependent Predictable

I/O sizes Workload-dependent Increased and decreased Read-modify-write Infrequent More frequent

Think time Workload-dependent Increased

47 Workloads ● Spec SFS 2014 - 4 separate workloads that support any POSIX interface ■ number of simultaneous builds ■ number of video streams that can be captured ■ number of simultaneous databases ■ number of virtual desktops ● So Spec SFS 2014 is a step forward, but still only represents a very very marginal slice of possible workloads ○ Makes assumptions on architecture and use of features ■ sparse/allocated files, file size, direct I/O, data ingest rates, etc ○ Client now plays a pivotal role in results ○ NAS systems rarely support a single dedicated workload ○ Locking? ○ Doesn’t cover day to day operations such as copying files, find, grep, etc ● Won’t see big performance difference between NFSv4 and NFSv3 ○ NFSv4 is more than just performance enhancements (pNFS an exception :)

48 Summary Comparison of NFSv4 over NFSv3 Benefits Drawbacks

● Single protocol ● More work for NFS developers ● Coherent locking ● Security ○ NFSv4 ACLs ■ ○ Enhanced Kerberos support ● Eliminate hotspots with pNFS ● Ride wave of NFS enhancements ● Exactly once semantics ● Asynchronous creates ● Close-to-open semantics

49 Summary

● In order for NFS to advance, need to move to NFSv4 ● Let’s work together to stop implementing new v3 servers ○ I do love the 90s though...but not everything is worth keeping ● Ask your NAS vendor if they have a path from NFSv3 to NFSv4.1 ○ And to NFSv4.2 and beyond ● *Can* do most anything, but really good at the following use cases ○ easy access to data within a LAN since laptops and servers have built-in clients ■ plug-n-play for any file-based application ■ very good performance without installing extra specialized clients ○ small to moderate amounts of data ○ interoperability with SMB ○ storage for virtualization (and other emerging areas like containers) ● NFS continues to struggle with several areas ○ Mobile, WAN, HPC, Cost (for H/A), Scalability, Searching for Files and Data 50 Outline Introduction File and Object Discussion NFS Object Storage Introduction Swift S3 File to Object and Object to File Comparison Discussion

Conclusion 51 Why Do Clients Need Object Storage?

Significantly Highly Global, Secure Supports Reduced Scalable with Multi-tenant Emerging Complexity Low Cost Access Workloads

• Simplified data • Software defined • Global data access and • Unstructured immutable scaling through flat storage flexibility distribution data store namespace • Leverage low cost • Multi-tenant • Social • Easy to use REST commodity management and data • Mobile based data access hardware access • Analytics • Storage • High density • Role based automation storage authentication • User defined • Handles ever • Encryption of data in metadata and increasing storage flight and at rest search capabilities requirements

52 Sample Object Storage Use Cases

Backup and Cloud Service Disaster Archive Content storage Recovery and distribution Provider (CSP)

• Private, public, • Active archive • Content • Public cloud storage hybrid backup • Compliant management • Static web content repository archive repository • Recover after data • Cold archive • Global collaboration loss event and distribution (corruption, • Big data deletion) storage / • Non-ephemeral data analytics store for cloud • Leverage copy on compute object storage to recover from a 53 disaster

53 How Do Clients Access Object Storage? ● Two APIs emerging for on-premise object storage deployments ○ OpenStack Swift ○ ● Many products/public clouds support proprietary APIs ○ Azure ○ Cloud Storage ○ EMC Atmos ○ DDN WOS ● CDMI is an attempt to standardize, but support is fading ● Concepts are similar across all APIs - we will focus on Swift and S3

54 Object Storage Introduction

Some questions we’ll answer:

1. Object APIs are built using RESTful APIs - what does that mean? 2. What are the commands Object APIs support? ○ Are there extensions? 3. What does an object command look like? ○ How do I know if my client request succeeded? 4. What is object data? 5. What is eventual consistency? ○ Is it the same for every kind of object storage? 6. How do I make my object store secure and protected?

55 Just Enough REST

REpresentational State Transfer Defined: In Practice:

● Resource Based ● Simple Interfaces ● Stateless ● Resources uniquely identified by URIs ● Client Server ● Relationships are identified by URIs ● Cacheable ● Can access from ANY HTTP client ● Layered System

Note: There is no REST standard. It is an “architectural style” with plenty of best practices defined. It is typically composed using standards like http, xml, json, etc.

56 Object Resources

Resource Description Swift S3

Your Data! Object Object

Collections of Objects Container Bucket

Collections of Containers/Buckets in Account Service (implicit) an Organization Unit (Department, Company, Site)

Discoverability - provides listing of configuration Info n/a Information

Location Information - provides URI to access Endpoints n/a resource directly from the storage server

Bucket Sub-resources (features provided with acl, policy, lifecycle, version, middleware) replication, tagging, cors, website, ...

Object Sub-resources n/a acl, torrent 57 Object Namespace - Super Simple

Swift Account Swift Account Swift Account Amazon S3 Service

Container Container Container Bucket Bucket Bucket

Object Object Object Object Object Object

Object Object Object Object Object Object Object Object Object Object Object Object

58 Object REST Operations

Operation Description Idempotent? Safe?

GET Return the contents of the resource Y Y

HEAD Return the metadata for a resource Y Y

PUT Create or update the contents of the resource Y X

POST Create, Update or Delete metadata for the X X resource, or create a subresource

DELETE Remove the resource from the system Y X

COPY Copy an object to another location (Swift only) Y X 59 Example Command Format

General Format: command uri[?query-string] headers [data]

Swift URI: http(s)://server:port/api_version/account/container/object

Example: GET ://192.168.56.101:8080/v1/AUTH_acct/Demo-Container/object1 -H "X-Auth-Token: xxxxxx”

S3 URI: GET http(s)://server:port/bucket/object

Example GET https://192.168.56.101:8080/s3_test_bucket/object1 -H 'Date: Sat, 06 Feb 2016 19:25:22 +0000' -H 'Authorization: AWS s3key:xxxx'

60 And some common response codes...

Description Code Client Retry? Common Examples

Success 20x No effect 200: Success (GET) 201: New resource created (PUT) 202: Accepted (POST) 204: No content (HEAD)

Client Error 4xx No, will still fail 400: Bad Request (incorrectly formatted request, i.e., non-numeric quota specification) 401: Unauthorized (wrong credentials) 403: Forbidden (no access to resource) 404: Not Found (wrong ) 405: Method Not Allowed (PUT to a resource that doesn’t support put)

Server Error 5xx Yes, in most 500: Internal Server Error (system problem - can be transient) cases 503: Service Not Available (often due to loading - internal timeout)

S3 Details: http://docs.aws.amazon.com/AmazonS3/latest/API/ErrorResponses.html Swift Details: http://developer.openstack.org/api-ref-objectstorage-v1.html 61 Some Simple Example Clients/Libraries

Swift: S3:

● curl ● curl/s3curl ● boto (python library) ● boto ● poster (firefox browser plugin) ● poster ● swiftclient ● s3sh ● s3cmd

Note that client caching is not common in object libraries/clients today 62 Some Example Requests - Firefox Poster

Poster: When you want full control - container HEAD request

63 Some Example Requests - Firefox Poster

Results of HEAD request

64 Some Example Requests: GET

Get a list of containers in a Swift account using swift command line (hiding all command details):

Using curl:

Note: We will talk about authentication details later 65 Some Example Requests: GET

Using curl and formatting output as json or xml:

66 Some Example Requests: GET

Get a list of all objects in an S3 bucket using boto:

Output:

67 Additional Object API Features Feature

Access Control Lists

Quotas

Versioned Objects

Expiring Objects

Automatic Storage Tiering

Storage Policies (placement, durability, etc.)

Upload Multipart Large Objects

Container Synchronization

Notification Infrastructure

Metadata Search 68 Object Storage Metadata

● Useful for flexibly organizing data in the flat namespace and enriching data ● System Metadata on Objects ○ Creation time ○ etag (md5sum of object contents) ● User Metadata on Objects (and Accounts and Containers in Swift) ○ Attribute/Value pair passed as a header in PUT or POST request ○ Objects: new metadata overwrites all previous metadata for that object ○ Accounts & containers (Swift only): new metadata is added to existing metadata ● Coming Soon: Metadata Search

69 Object Storage Metadata - An Example bill_selfie.jpg

System Metadata: User (Client) Metadata: Content-Length: 68351 ● X-Object-Meta-Brightness: 10.5 Content-Type: image/jpeg ● X-Object-Meta-Latitude: 117.2303 ● X-Object-Meta-Longitude: 33.03279 Etag: 1f32161a3c3baefb9a548a72daffa7ab ● X-Object-Meta-Altitude: 2322.16 X-Timestamp: 1455144452.21496 ● X-Object-Meta-Aperture: 2.275 X-Object-Meta-Mtime: 1455144440.207139 ● X-Object-Meta-Camera-Model: iphone6.0.1.3

Metadata can be as valuable as the data itself! 70 Eventual Consistency - CAP Theorem

CAP Theorem: Pick Any 2

1. Consistency 2. Availability 3. Partition tolerance

Object Storage systems are typically AP ● Consistency is eventual ● No standard

Note that POSIX-based distributed file system would require CP...

71 Eventual Consistency: I/O Characteristics

● Typically no locking ● Object operations are atomic ○ The entire object must be written successfully to be committed ○ Reads will always return a consistent object or no object ● Range reads supported - not range writes ○ This is an artifact of HTTP GET/PUT, and derives from consistency model ● Last writer (creator) wins: ○ For concurrent creates of the same object, the one with latest timestamp wins ● Container/Bucket listings may be updated asynchronously

72 Eventual Consistency

● Consistency is a characteristic of object store implementation ○ No standard ○ Different products, different architectures = different consistency models ● When writing an object: ○ container listing will not show object until container updates are completed ● When deleting an object: ○ object may continue to to appear in container listing until container updates are completed ● When replacing an object: ○ reads may return existing version until new version is propagated across entire system [1]

[1] If storage backend is strongly consistent (like a parallel file system), the new or updated object is available to all nodes as soon as the write is committed.

73 Object Storage Architectures

Community Swift: ● Object PUTs 3x replicated ● Majority of writes must succeed for success status ● Consistency daemons ensure that failed replicas are eventually written ● Reads try each replica sequentially until success ● Account & Container listings updated

asynchronously 74 Object Storage Architectures

Swift with storage ● Object storage writes single replica ● The file system is responsible for data replication ● Account and Container listings updated asynchronously ● Reads always go to single replica

Object Node Object Node Object Node

Clustered File System 75 Object Security

● Production Object Storage systems typically interface with a dedicated identity service like OpenStack Keystone ● Simpler schemes can be used for proof of concept (Swift tempauth) ● Authentication: does the user in a request have a valid password or security token? ● Authentication service may integrate with enterprise directory service using LDAP or Microsoft Active Directory ● Authorization: does the user in a request have permission to execute that request? 76 Authentication/Authorization Example using OpenStack Swift

A client want to upload an object to a container in project MYACCOUNT:

1. The client sends credentials to Keystone identity service 2. Keystone verifies credentials, creates a new token and returns it to client 3. Token contains authorization information: a. Endpoint catalog (a list of available services) b. Projects the requesting user is assigned to c. Role for that project d. Token expiration time 4. The client sends the upload request (including the token) to object storage service 5. Object storage service verifies the token with Keystone (or with a cached copy of the token) 6. If the client has a valid role for MYACCOUNT, the upload request is implemented

77 Object Security

Secure Data In Flight ● SSL can be enabled from client to identity service, and to object storage service ● Load balancer can also provide SSL termination

Secure Data at Rest ● Data encryption can be provided by object storage software or by storage backend (or by client) ● Not all data needs to be encrypted - enable encryption on a bucket or container basis ● Consider maturity of encryption implementation ● External key manager vs. integral key encoding

78 Object Data Protection

Object Storage data protection typically implemented with ● 3x replication - local or geo-dispersed ● Erasure coding - local or geo-dispersed

Either approach can be implemented by Object Storage software or delegated to storage backend

How to protect against user error? Or application bugs? ● Backups and snapshots still have their place ● Snapshot and/or backup critical portions of your data ● Easy to select by container but can also select by metadata values

79 Outline Introduction File and Object Discussion NFS ObjectBREAK Storage Introduction Swift S3 File to Object and Object to File Comparison Discussion

Conclusion 80 Outline Introduction File and Object Discussion NFS Object Storage Introduction Swift S3 File to Object and Object to File Comparison Discussion

Conclusion 81 OpenStack – Open Source IAAS platform & global collaboration

Mar Exponential growth in ~1 YR Oct Mission: 2013 2014 Create an ubiquitous open source platform that 859 2556 is simple to implement and massively scalable Contributors Contributors 8,500 Members 16,000+ Members ▪ Scalable – Massive scale Design Goals − 1 Million physical machines, 60 Million VMs − Billions of objects stored ▪ Controlled by the OpenStack Foundation − IBM is proud to be a Platinum Sponsor ▪ Open – All code is Apache 2 licensed

▪ Simple – Architecture is modular

▪ Composed of multiple projects around the four capabilities − Compute − Network − Storage − Shared services

82 82 83 History of OpenStack Swift

Date Release Description

Aug 2009 n/a Swift development started by Rackspace

Jul 2010 n/a OpenStack launches with 25 member companies

Oct 2010 Swift 1.1.0 (Austin) First OpenStack release includes Swift & Nova

Jun 2012 Swift 1.6.0 (Essex) Integration with Keystone

Jun 2014 Swift 2.0.0 (Icehouse) Add Swift Storage Policy support

Jan 2016 Swift 2.6.0 (Liberty) Current release

Early OpenStack History: http://www.tiki-toki.com/timeline/entry/138134/OpenStack-History/

84 History of OpenStack Swift

As of June 2015: Over 300 PB of Swift storage deployed

85 Swift API and Semantics

OpenStack Swift is two parts: ● API specification & middleware description ● Object storage implementation

Two choices for object storage implementation: ● Native Swift ○ Can be extended, but core is Swift ● API Emulation ○ Can never be 100% compatible ○ Especially difficult to emulate middleware

API & Middleware Links: ● http://developer.openstack.org/api-ref-objectstorage-v1.html ● http://docs.openstack.org/developer/swift/middleware.html 86 High-Level on OpenStack Swift

▪ Load balancer (e.g., HAProxy) to balance requests – Each request stateless ▪ Proxy Nodes (public face) authorizes and forwards to appropriate storage server(s) using ring. ▪ Storage Nodes (account, container and object) store, serve and manage data and metadata Keystone Authentication Service (public face) partitioned based upon ring Authenticates credentials and provides an access token for future requests. Users can be defined ▪ Object mapping and layout locally or in external LDAP or AD system. Also – Objects mapped to partitions by hash on fully defines user roles for accounts / projects. qualified object name – Partitions mapped to Virtual Devices using Additional Swift Services Maintain eventual consistent hashing ring consistency in the distributed object storage environment. Account, container & object updaters, replicators, auditors, reaper. 87 Proxy Server Architecture Proxy Server wsgi pipeline

All User Requests Controllers ● Requests & responses pass through wsgi pipeline ● Community and custom middleware account container object ● Requests delegated to controller module ● Controller forwards requests ○ account, container or object server ● Responses are received by controller & passed to the client

88 Storage Server Architecture Object Server wsgi pipeline Object Server Pluggable Backend • Reads and writes object files onto storage • Pipeline for community or custom middleware diskfile • Pluggable backend interface • Diskfile controls objects layout on filesystem • SwiftOnFile diskfile provides file access to object data Account Server Account and Container Servers • Manage listing db for each account and container wsgi pipeline • Pipeline for community or custom middleware • Pluggable backend interface Pluggable Backend • Specified but no community implementations • Could allow the use of directory listings instead of account and container dbs for SwiftonFile layout Container Server

wsgi pipeline

Pluggable Backend 89 Anatomy of an Object Write: Client Gets a Token

1. Client sends token request to Keystone with credentials 2. Keystone authenticates credentials using local

or external identity server Keystone 3. If credentials are OK, Keystone returns token to client

Example: curl -i \ -H "Content-Type: application/json" \ -d @mycreds.json \ http://localhost:5000/v3/auth/tokens

Object Node Object Node Object Node

Clustered File System

90 Anatomy of an Object Write: Client Issues PUT request

1. Client sends PUT request to proxy-server with token, object URI and object data

2. Client saves the token for use until token Keystone expires

Example: curl –I -X PUT –H "X-Auth-Token $TOKEN” \ http://util5:8080/v1/acct/container/newobject \ -T vacation.mp4

Object Node Object Node Object Node

Clustered File System

91 Anatomy of an Object Write: Proxy Processes PUT request

1. Proxy-server receives request and each middleware in pipeline looks at and optionally acts on the request

2. authtoken and keystoneauth middleware authenticate Keystone and authorize the request (against data in memcached if possible)

Object Node Object Node Object Node

Clustered File System

92 Anatomy of an Object Write: Proxy Processes PUT request

1. Proxy-server adds X-timestamp header to the request with current system time 2. Use ring to determine where each replica of the object is to be placed • object-server IP • virtual device Keystone • partition • object uri hash 3. Pass PUT request to designated object-server(s) 4. Embedded wsgi server manages reading data a chunk at a time from client and passing on to object-server

Example uri: http://util5:8080/v1/acct/container/newobject is placed here: 192.167.12.22:$mount/object_fileset/o/z1device42/objec ts/13540/3bd/d39381ea07419cec19ae196149a943bd/ Object Node Object Node Object Node Clustered File System

93 Anatomy of an Object Write: Object Processes PUT request 1. Object-server receives PUT request and checks that it satisfies object constraints (valid timestamp, object name length within limits, etc.) 2. create diskfile instance for the new object 3. diskfile creates tmp file and begins writing to it 4. calculate length & md5sum for the new object as the object is written Keystone 5. when object write is complete, write system metadata to the object as file xattrs 6. move data to location specified by ring; filename .data 7. Remove any files older than

Example tmp file location: $mount/object_fileset/o/z1device42/tmp/tmpVkeXj

Example object location: Object Node Object Node Object Node $mount/object_fileset/o/z1device42/objects/13540/3bd/d39 381ea07419cec19ae196149a943bd/1442395677.59514.data Clustered File System

94 Anatomy of an Object Write: Update Container and Return Status

1. Send request to container server to add new object to container listing 2. Wait for a short time (2 sec) for container server response. 3. If container update times out, write data into Keystone async_pending directory Note: object-updater is responsible to update container db’ s with async_pending entries 4. Return status to proxy server, and on to client

Example async_pending location: $mount/object_fileset/o/z1device42/async_pending/

Object Server Object Server Object Server Container Server Container Server Container Server Clustered File System

95 Object Server Extending Swift - Diskfile Interface wsgi pipeline

Object server diskfile: on disk abstraction layer Pluggable Backend diskfile ● Deployers can implement their own storage interface ● Specialized classes for Manager, Reader & Writer ● Example Diskfiles: ○ Community (default) ○ Swiftonfile: Redhat, IBM ○ Swift- ○ Seagate-kinetics ○ Isilon ○ In-memory ● Swiftonfile provides native access to object data through the filesystem interface. 96 Proxy Server Extending Swift - WSGI Middleware wsgi pipeline

API? or Implementation? Controllers ● Web Services Gateway Interface: account container object ○ Python standard PEP 3333 ○ Chain together of modules to process requests ○ Used by all OpenStack services ● Middleware: ○ Pluggable modules that can be configured in request “pipeline” ○ Specified in configuration service configuration file ○ Each middleware module has a chance to process (or change) request coming in ○ And process (or change) response on the way out

97 Proxy Server Middleware

Proxy Server

Client WSGI Pipeline mware-1 mware-2 ... mware-n proxy-server

Operations

GET PUT POST HEAD DEL Controllers proxy-server.conf

98 Proxy Server Extending Swift - WSGI Middleware wsgi pipeline

Controllers

API? or Implementation? account container object

● authentication & authorization: auth_token, keystoneauth ● multi-part upload: slo, dlo ● quotas: account-quotas, container-quotas ● protocol emulation: swift3, s3token ● bulk operations: expand archive on upload ● object versioning ● container sync ● rate limiting ● domain remapping ● static web & temporary url ● profiling & monitoring ● your custom middleware

http://docs.openstack.org/developer/swift/middleware.html 99 Storage Policies

● Used by Object Server only ● Allow you to specify: ● Durability levels: 1, 2 or 3x replication ● Storage backends ○ cost vs performance tradeoffs ○ storage features - encryption, compression, ● Grouping of storage nodes ○ including multi-region

● Containers are permanently assigned to policies on creation ● default or explicit ● policies can be deprecated - no new containers assigned 100 Geo-Distributed Object Clusters Building an Active-Active Multi-Site Storage Cloud

Global Distribution Multi-Site Availability Flexible Ingest and Access from Any Objects Replicated Across 2 Async or Sync Replication Data Center or more Sites 101 Geo-Distributed Object Clusters Architecture Details Region A Data Center 1

▪ Disaster Recovery of data center failures - Active-Active storage cloud ▪ Binds geo-distributed sites into a extended capacity storage cloud ▪ Leverages Swift replication between sites ▪ Objects are stored in one or more regions depending on – Required durability - data copies can be 1 to N (typically max of 3) – Required number of supported data center failures ▪ Objects accessible from ANY site – If object not local, system retrieves it from remote region ▪ Asynchronous or synchronous replication ▪ Research on WAN acceleration technologies – Aspera or TransferSoft are examples Region B Data Center 2

Region C Data Center 3 102 Swift Authentication Pluggable Authentication and Authorization Three common flavors, one choice for production environments

1. Keystone ○ Production ready identity system ○ Models users, roles, projects, domains (v3) & groups (v3) ○ Supports integration with backend LDAP and AD ○ Authtoken (authentication) and keystoneauth (authorization) middleware ○ Authentication through separate Keystone API 2. tempauth, aka “version 1” ○ super simple ○ user credentials & project assignment stored in proxy-server.conf 3. swauth ○ user credentials & project assignment stored in Swift

103 Swift Authentication Role Based Access Control Two Swift authorization roles today:

1. operator a. Can create, update and delete containers and objects in projects where role is assigned b. Can assign ACLs to control other users access c. operator_roles config value (proxy-server.conf) specify Keystone roles 2. reseller_admin a. can operate on any account b. reseller_admin config value (proxy-server.conf) specify Keystones roles

Finer access control with Swift Container ACLs

104 Swift Additional Features

● Quotas on Accounts and Containers ○ Must have reseller_admin role to set Account quotas ● StaticWeb - serve container contents as static web site ● Versioning ○ Current version in current container ○ Older versions in dedicated container ○ Implemented in middleware (as of Swift version 2.4) ● Static and Dynamic Large Objects - multi part upload ● RateLimit - limit operations on Account and Container ● Object Expiration

105 Some OpenStack Swift Issues

▪ Community software hard to install & manage ▪ Performance – Standard Swift daemons scans directory metadata every 30s, decreasing performance of entire system by increasing CPU and disk utilization – No data caching – Upcoming erasure coding can hurt performance for small objects – Slow to rebuild ▪ Inefficient to scale capacity – Swift must re-balance partitions to add additional storage, creating potential for out-of-space conditions and requiring excessive over provisioning and data movement ▪ Lack of enterprise features – Backup/snapshots/encryption – No ILM for tiering or to external storage (Tape) – RAS, etc 106 Get Involved! ● Core Swift community: ○ Weekly meetings on IRC ○ Fix bugs, improve tests, improve docs ○ Single process optimizations ○ Container sharding ○ Improved Versioning ○ Encryption ○ Erasure Codes ● swiftonfile: Unified File and object access ○ Bi-weekly meetings on IRC ● swift3: Amazon S3 emulation middleware ○ Bi-weekly meetings on IRC

107 Outline Introduction File and Object Discussion NFS Object Storage Introduction Swift S3 File to Object and Object to File Comparison Discussion

Conclusion 108 History of Amazon S3 Storage & API

Date Description

June 2006 Amazon launches Simple Storage Service

2008 Amazon reports over 29 billion objects hosted by S3

2010 S3 API support for versioning, bucket policies, notifications, multi-part upload

2011 S3 API support for server side encryption, multi-object delete & object expiration

2012 S3 API support for CORS & archiving to glacier Approx S3 Object Count in S3 (billions) 2013 Amazon reports over 2 trillion objects hosted by S3

2014 S3 API support for life cycle versioning policies, sigv4, event notification

2015 S3 API support for cross region replication, infrequent access storage class

109 Why Use S3 for On-Premise Storage?

● Run same apps against on premise and cloud storage ● Repatriate S3 cloud data & applications to reduce cost ● Rich API and tool set ● Swift3 middleware provides emulation layer in Swift environment

But…

● Some APIs may not apply on premise: i.e, torrents, payments ● API is controlled by Amazon with no published extension points ● On-premise implementations will not be 100% compatible

110 S3 Models features explicitly

Middleware not required… Each resource/subresource is managed explicitly from the REST API (GET, PUT, DELETE)

But, how do you get changes into the API spec? 111 S3 Authentication

S3 Requests authenticated using credentials: ● Access Key ID (AWSAccessKeyID) ● Secret Access Key (AWSSecretKey)

Two signing algorithms today: ● AWS Signature V2: Secret Access Key is used to sign request string using ● AWS Signature V4: Secret Access Key to create signing key (valid for 7 days)

Each S3 request passes authorization header constructed using on of these algorithms

Both are tedious to construct - let your client create the signature for you!

Swift3 middleware today only supports AWS Signature V2.

112 S3 Lifecycle and Bucket Policies

Policy Resources to automate and manage of object storage resources

Lifecycle Policies ● Expire aged objects or object versions ○ Example: Automatically delete versions older than 90 days ● Transition objects to another storage class ○ Example: Move objects from Standard to Glacier after 30 days ● Combining Policies: ○ Example: Move from Standard to Standard_IA to Glacier to Expired

Bucket policies ● Another way to control access to bucket resources ○ Allow read-only access to anonymous user ○ Require MFA for bucket resources ○ Restrict access to specific client IP addresses 113 S3 Access Control

● S3 ACLs manage access to Buckets and Objects ● Every Bucket and Object has an ACL subresource ○ if no ACL specified on create a default ACL is used giving owner full control ● ACLs consist of Grants, Grantee and Permission ○ up to 100 grants per ACL ● Grantee types: ○ User: user id, user email, ○ Group: Authenticated User, All Users, Log Delivery Group ■ Note that these are the only possible groups ● Permissions: ○ Read, Write ○ Read_acp, Write_acp ○ Full_control ● Canned ACLs are predefined ACLs to simplify access control definition 114 S3 Access Control - Permissions

Permission Granted On a Bucket Granted On a Object

READ Allows grantee to list objects in a bucket Allows grantee to read object data and its metadata

WRITE Allows grantee to create, overwrite, and delete any Not applicable object in a bucket

READ_ACP Allows grantee to read a bucket ACL Allows grantee to read object ACL

WRITE_ACP Allows grantee to write an ACL for applicable bucket Allows grantee to write ACL for applicable object

FULL_CONTROL Allows grantee READ, WRITE, READ_ACP, and Allows grantee READ, READ_ACP, and WRITE_ACP WRITE_ACP permissions on a bucket permissions on an object

115 S3 Access Control - Example Default ACL

Single grant giving owner full control: Owner-Canonical-User-ID owner-display-name Owner-Canonical-User-ID display-name FULL_CONTROL

116 S3 Access Control - Example ACL

ACL with 2 user and 1 group grants

Owner-canonical-user-ID display-name FULL_CONTROL user1-canonical-user-ID display-name WRITE http://acs.amazonaws.com/groups/global/AllUsers READ 117 S3 Access Control - Canned ACLs

Canned ACL Applies To Permissions added to ACL private Bucket & Object Owner gets FULL_CONTROL. No one else has access rights (default). public-read Bucket & Object Owner gets FULL_CONTROL. The AllUsers group gets READ access. public-read-write Bucket & Object Owner gets FULL_CONTROL. The AllUsers group gets READ and WRITE access. Granting this on a bucket is generally not recommended. aws-exec-read Bucket & Object Owner gets FULL_CONTROL. Amazon EC2 gets READ access to GET an Amazon Machine Image (AMI) bundle from Amazon S3. authenticated-read Bucket & Object Owner gets FULL_CONTROL. The AuthenticatedUsers group gets READ access. bucket-owner-read Object only** Object owner gets FULL_CONTROL. Bucket owner gets READ access. bucket-owner- Object only** Both the object owner and the bucket owner get FULL_CONTROL over the object. full-control log-delivery-write Bucket only The LogDelivery group gets WRITE and READ_ACP permissions on the bucket.

118 ** If you specify this canned ACL when creating a bucket, Amazon S3 ignores it. S3 Access Control - Limitations

● Object PUTs reset object ACL to default (unless ACL specified in PUT request) ● If you give another user WRITE access to a bucket you own, they will be the owner of any objects they create. ○ You will not have READ access to those objects, and won’t be able to see metadata like size ○ You still have WRITE access from Bucket ACL, so you can delete or overwrite them ● Caution: When granting WRITE access at the bucket level ○ There is no object level WRITE access ○ With Bucket WRITE access, I can create or delete objects that you created ● Caution: Be especially careful giving Bucket WRITE access to groups

119 S3 Object Versioning

● Versioning enabled at the Bucket level ● Objects in these buckets have a current object and 0 or more versions ● PUT creates a new instance that becomes the current object ● GET bucket?versions lists all object versions ● GET bucket?versions&prefix=myobject lists all versions of myobject ● DELETE inserts a "delete marker" but no objects are removed ● DELETE myobject?version=1001 permanently deletes object version ● Undelete by deleting the marker: DELETE myobject?version=9876 ● GET myobject?version=1001 to retrieve older version delete marker myobject id=9876 myobject id=1002 myobject id=1001 120 Validating the API ceph-s3 tests: Open source compatibility tests for S3 clones ● Approximately 350 tests ● Swift3 v1.9 passes approx 75% of tests

121 https://github.com/ceph/s3-tests Comparing Swift and S3 Features Feature Swift S3

Access Control Lists Container Container & Object, plus policies

Quotas Account & Container No API support

Versioned Objects Y (limited functionality) Y

Expiring Objects Y Y (with lifecycle policies)

Automatic Storage Tiering Y (based on storage backend) Y (with lifecycle policies)

Storage Policies (placement, durability, etc.) Y No API support

Upload Multipart Large Objects SLO & DLO Y

Container Synchronization Y Y (cross region replication)

Notification Infrastructure Future SNS, SQS, AWS Lambda (cloud only)

Metadata Search Future Future? 122 Swift & S3 Summary

Swift S3

100% Open Source with active community that is Closed source implementation (except Swift3) steadily adding features

Deployers and customers can influence API and API controlled by single company features

Documented ways that you can extend with No documented extension points middleware and diskfile changes

Vendor extensions can address many of the No documented extension points management issues listed on earlier Swift slide

Large and growing support community Limited options to support S3 on premise deployments

123 Swift & S3 Summary

Swift S3

API and middleware provide feature set Well defined API, with features explicitly modelled

More complete feature set: ACL and Access Control model, versioning support, notification service

On premise deployment allows repatriating apps & On premise deployment allows repatriating apps & data from the cloud data from the cloud

Native Swift deployments are 100% compliant. On premise vendors have different levels of API-only deployments may lack key features, compliance - each says we support “core features” especially middleware. but what are those?

Improving development ecosystem Rich development ecosystem

124 Get Involved with S3 also!

● swift3: Amazon S3 emulation middleware ○ Bi-weekly meetings on IRC ○ S3 versioning ○ Lifecycle policies ○ Bucket Policies ● ceph/s3-tests ○ Improve test coverage ○ Fix compliance bugs in Swift3

125 Outline Introduction File and Object Discussion NFS Object Storage Introduction Swift S3 File to Object and Object to File Comparison Discussion

Conclusion 126 Object Storage Challenge...

The world is not object today!

(and never will be completely…)

Multi-Protocol Access to the Same Dataset Can Provide Value (S3/Swift/NFS/SMB/POSIX/HDFS)

127 Using File to Access Objects Primary Use Cases

1. Transition period ● Use file API as transition to object API 2. Single Management Pane ● Manage file and object within single system 3. Sharing Data Globally ● Create data via file interface and share globally using object interface 4. Analysis ● Many analysis tools are not a good match for object immutability semantics 5. Connecting NAS clients to object storage ● Home directories, shared storage from Linux clusters, etc

128 File Access to Objects - NAS Gateways and Accessors

Gateway Optional disk cache Accessor

Swift/S3

... GW and Accessor Caution Use Cases ● Can’t control users ● Good for browsing files ● How are users to know ● Ok for migration into what works well and what object store doesn’t? ● ● Scalability issues Ok for backup tool 129 File Access to Objects - Gateway and Accessor Vendors

Example File Accessor Vendors Example NAS Gateway Vendors

▪ Storage Made Easy ● Panzura – Sync-and-share ○ NAS front-end to cloud – Direct integration with Windows explorer, ○ Distributed caching and link to off- Mac Finder premise cloud (solution includes – Only Swift mobile access app disks) ▪ Cloudberry ● Avere – Windows only object access ○ NAS front-end to cloud – Separate application ● Maldivica – Supports all clouds ○ NAS gateway – Has backup apps as well… ▪ Cyberduck/Swift-explorer ● Nasuni ○ – Separate app for Mac, Windows, Linux NAS front-end to cloud support to Swift, S3, etc ● Riverbed – Open-source ○ Backup of branch offices ▪ Expandrive ● Ctera – Virtual USB drive that allows to ○ Consolidation of branch offices most cloud providers

130 File and Objects Access Integrated Solutions

● Several solutions exist that offer File and Objects in a single solution ● Object Solutions with Integrated NAS Gateway ○ Object storage solution that directly integrates a NAS gateway ○ Same advantages and disadvantages as with NAS Gateways ○ This is offered by almost every object storage vendor ● Full Integration of File and Object support ○ NAS support is just as good as a native NAS storage solution ○ Object support is just as good as a native object storage solution ○ This can include separate or the same datasets ○ Examples include IBM Spectrum Scale (GPFS) and Red Hat GlusterFS

131 File and Object Access To the Same Data What Should It Look Like?

● Research challenge: Dream of Full Simultaneous Access ○ How to achieve a unified user namespace? ○ Possible to achieve behavior similar to NFSv4+SMB3? ● Should File see file semantics, and Object see object semantics? ○ For workflows, this works quite well ■ e.g., Ingest through file, read through object ■ e.g., ingest through object, analyze and update, read results through object ● It’s All Semantics ○ Eventual semantics vs file semantics ■ Objects are allowed to just disappear...how would File deal with that ○ Buckets/Containers are supposed to scale without limit...but directories typically do not ○ Objects do not respect locks, but how does this fit with file? ■ Should object protocols wait on a lock? How would Object deal with the delay? ○ How in sync do the namespaces need to be? ○ Across sites, maintaining strong File Semantics is a challenge… ○ Separate security, e.g., ACLs, authentication servers, interpreting LDAP/AD users ● Do we need a new set of semantics? 132 A Way Forward: Swift-On-File

● A OpenStack Swift Per-Bucket/Container Storage Policy ● Stores objects on any cluster/parallel file system ● Objects created using Object API can be accessed as files and vice-versa ○ Newly created files immediately accessible via Swift/S3 ○ Newly created objects are immediately available for editing ● Challenges it overcomes ○ Harden object visibility semantics to ensure read after write ■ Object namespace eventually consistent ■ Object data is strongly consistent ■ Common LDAP/AD user database for both file and object ○ Maintaining both file attributes on new Object PUT ○ Currently working on further integrating ACLs, metadata and xattrs, etc ● Leverages File System data protection ○ Part of IBM Spectrum Scale 4.2 and experimental with Redhat GlusterFS

133 ● Swift code available at https://github.com/openstack/swiftonfile Co-Existence of Traditional and Swift-On-File

Traditional Swift Storage policy Object storage path:

Proxy Tier Object -rwxr-xr-x 1 swift swift 29 Aug 22 09:25

Ring 1 /mnt/sdb1/2/node/sdb2/objects/981/f79/ d7b843bf79/1401254393.89313.data

Swift/S3 user Object Ring 2 Spectrum Scale File System storage path: Swift on File Storage policy -rwxr-xr-x 1 swift swift 29 Aug 22 09:25 /mnt/fs/container/object1 11

134 File in Object: http://swift.example.com/v1/acct/cont/obj

Object in File: /mnt/fs/acct/cont/obj

135 Analytics for File and Object ▪ Analytics on File is well established ▪ Is Object storage storing Big Data or Dead Data? ▪ If data cannot be analyzed, might as well use Tape –Tape is still much cheaper ▪ Running directly through Swift/S3 API limits functionality – Hive and HBase (among others) lack efficient support due to file ‘append’ requirement – Plus many more...

HTTP slower than Load Imbalance Due RPC to Inefficient Data Distribution Large Data Movement on Name Changes

Multiple Network Loss of Data Hops When Locality Writing Data 136 Analytic Possibilities On Object Storage No Single Solution

1. Use object storage solution HDFS APIs –Mileage will vary –Performance results specific analytics framework 2. Spark –Targeted towards in-memory analytics –Lower demands on storage depending on application 3. Analytics Tool + Tachyon –Tachyon creates an in-memory distributed storage system ■ Not yet for production... –Can lower demands on storage solution 4. Use File + Object solution –Realize native file performance

137 Outline Introduction File and Object Discussion NFS Object Storage Introduction Swift S3 File to Object and Object to File Comparison Discussion

Conclusion 138 Between File and Object...

So are NFSv4, S3, and Swift really all needed?

139 Gross Generalization of Target Workloads

Object File

● Backup (write mostly) ● It can do object workloads and much more... ● Immutable object storage ● User data and home directories ○ Backup => write mostly ● Applications with small to medium ○ Distribution/streaming =>read mostly performance and scalability requirements ● Archive (write mostly) ● Analytics ○ Rarely accessed data, but when needed, it must be retrieved quickly

***Note that this is what Object is today, not necessarily where it will be ***Note that NFS (without pNFS) is still not ideal for scientific tomorrow applications that require high-throughput data access from medium to large compute clusters

140 Applications

● Converse in whole objects ● Converse in bytes, files, inodes, file descriptors ○ Simple API that doesn’t have complicated concepts like ○ Complicated yet now familiar hard links, crash consistency operations, etc ● Single long-lived TCP connection ● Many short-lived TCP connections ○ It's a benefit, but 1 TCP conn. not good in WAN ○ Adds latency but increases parallelism ● Stronger consistency, but that makes it confusing ● Must tolerate eventual consistency ● Must be aware of scaling issues ○ Must be willing to retry ○ E.g., too many objects in a single directory ○ Objects could temporarily disappear... ○ But highly available... ● Data sharing has shortcomings ○ Locking typically only advisory and creates delays during ● Simple hierarchy makes objects hard to find failure (due to state) ○ Many vendors disable even listing containers/buckets ○ Many apps keep separate database ● High performance, but NFS has inherent load ● Must tolerate low bandwidth/high latency imbalances without pNFS ○ This is today, so could change in future

141 Ease of Access

▪ Access data from anywhere on the globe ▪ NFS clients available in all OSs for laptops, ▪ Very thin client with no optimizations desktops, servers ▪ Mobile integration –But not to mobile devices –iPhone includes S3 client ▪ Most applications today natively support POSIX ▪ More and more applications supporting native object access ▪ To ease user transition, several startups have file-based viewers for Mac/Windows/Linux – Storage Made Easy, Cloudberry, Cyberduck, etc ▪ Several S3/Swift mobile apps exist as well – Storage Made Easy among many others ▪Use ‘curl’ and build your own HTTP request 142 Data Protection - What Can Go Wrong...

Data Transfer Corruption between storage client to storage device Coordinated H/W failures Data Center Failure

Storage software bugs Accidental User Error

Rack Failure Server Failure

Disk Failure/Corruption 143 Data Protection

● Object vendors writing SW from scratch ● NAS vendors support a wide variety of ○ Very new storage systems ○ Support 3-way replication and erasure coding ○ software based ● Object vendors currently focused on “being” ○ controller based with specialize H/W the backup, not backing up its data ○ controller based with commodity H/W ○ Little attention to backup ● Backup and DR support widely available ○ More focus on DR support ● Snapshots widely available ● Beware the “snake oil” salesman ○ Triplication and erasure coding does not prevent data loss ● Versioning ○ No ability to capture entire dataset

144 Security

● Typically provide multi-tenancy at the level ● Variety of authentication mechanisms of authentication of users ○ Kerberos now standard, and supports multi-tenancy, but ○ No client software required requires client-side support ○ Few if any provide data isolation ● Typically used in LAN, but can work in WAN ● Encryption becoming more common ● Rich ACL format ● Each protocol has its own ACL format and ● Data transfer encryption supported granularity ● True multi-tenancy (network and data ● HTTP-based token mechanisms work isolation) available from some vendors nicely for web and global access ● Privacy through HTTPS

145 Cost and Features

● Current solutions consist sold as ● Cost can vary widely ○ SW-only ○ Roll your own ○ SW+commodity H/W ○ SW-only ● Currently priced low to what market will bear) ○ SW+commodity H/W ○ OpenStack Swift is *free* (Minus blood, sweat, and tears) ○ SW+specialized H/W ● Typically simply storing data at this point ● Many have tape support ○ Analytics support mostly in name only ● Viable analytics support ● Relatively easy to manage ● Enterprise vendors support multi-protocol access ○ Only applies to supported vendor solution ● Block-storage support for VMs ○ Note this correlates with fewer features ○ Can support entire OpenStack storage ecosystem ○ VMWare, Hyper-V

146 Each Protocol Has Purpose and Real Value

S3

NFS

Swift

147 S3

NFS

Unique to File Swift

Require File POSIX?

● Proprietary Applications ● In-place updates ● File append ● Locking ● Strong Consistency

148 S3

NFS

Unique to Object Swift

Require Object Mobile or Cloud?

● Smartphone/tablet access ● Cloud-friendly security ● Cloud-friendly tools

149 S3

NFS The Overlap...today Swift

● Chances are you have applications that fit in the middle as well ● Today, stark differences exist between vendors, choice relatively easy

● Object vendors by and large have lower cost/capacity ○ Targeting backup/archive market ● NAS vendors by and large have higher performance and are feature-rich

150 S3

NFS The Overlap...tomorrow Swift

● Remember that NFS/Swift/S3 are simply protocols to access data ○ Nothing in the Swift or S3 limits performance or future possible features ○ Most enterprise and advanced features are independent of protocol ● Object vendors are busy working their way up the application chain ● Even in-place updates can be mitigated to some degree ○ Many videos are stored frame by frame, with each one updated in their entirety ○ With small files, updating entire file isn’t a big deal ■ E.g., IoT ● With better integration, maybe you won’t have to decide :)

151 Metadata Search

▪ It is hard to find data in both File and Object –A key issue with Object’s flat namespace is finding data TAG IT –Even File can become difficult with billions of files ▪ Scalable search becoming required to realize value of data –Find needles in unstructured haystacks ▪ Goal is to dynamically index objects/files FIND IT – Create structure of well known system and user attributes – Tags and attributes automatically added to database ▪ Useful for both users and administrators –Users search based upon their tags –Administrators search based upon system attributes General Use Cases • E.g., account_last_activity_time, container_read_permissions, ▪ Data Mining object_content_type ▪ Data Warehousing ▪ Rest-based search API ▪ Selective data retrieval, data backup, ▪ IBM has built open-source solution with OpenStack Swift data archival, data migration using RabbitMQ and ElasticSearch ▪ Management/Reporting

152 File vs. Object Summary

● So it's not cut and dry ● File is very mature, but can be complicated ● Object is very immature, but all disruptive technologies are…

The real question is how much of the NAS pie will Object eat?

153 Outline Introduction File and Object Discussion NFS Object Storage Introduction Swift S3 File to Object and Object to File Comparison Discussion

Conclusion 154 struct CLOSE4args {/* CURRENT_FH: object */; seqid4 seqid; stateid4 open_stateid;};

● Whew...that was a lot of info ○ The 5 Ws of File and Object ○ NFS, Swift, S3 ○ Industry File and Object Solutions ● There are few easy decisions… ○ There are some now, but it's getting harder as object vendors mature ● NFS ○ A long history...but let’s work together to advance the technology ○ Check out NFSv4.2 and help to make it the new default! ● Swift/S3 on-premise are still emerging as standards ○ Object access will become an essential data access mechanism for ALL data ● Get Involved! ○ Swift and NFS have active open source communities

155 More Information NFSv4 IETF working group ▪ https://datatracker.ietf.org/wg/nfsv4 NFSv4 RFC ▪ http://www.ietf.org/rfc/rfc3530.txt NFSv4.1 RFC ▪ http://www.ietf.org/rfc/rfc5661.txt NFSv4.2 RFC Draft ▪ https://tools.ietf.org/html/draft-ietf-nfsv4-minorversion2-41 Ganesha ▪ http://nfs-ganesha.sourceforge.net SNIA white papers & tutorials on NFS ▪ https://www.brighttalk.com/search?duration=0..&keywords[]=nfs&q=snia&rank=webcast_relevance ▪ http://www.snia.org/sites/default/files/SNIA_An_Overview_of_NFSv4-3_0.pdf ▪ http://www.snia.org/sites/default/files/Migrating_to_NFSv4_v04_-Final.pdf ▪ http://www.snia.org/sites/default/files/ChuckLever_Introducing_FedFS_On_Linux.pdf Original pNFS paper - “Exporting Storage Systems in a Scalable Manner with pNFS”, MSST’05 ▪ http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.76.3177&rep=rep1&type=pdf NFS XATTR Draft ▪ https://tools.ietf.org/html/draft-naik-nfsv4-xattrs-02

156 More Information NFS FAQ ▪ http://nfs.sourceforge.net/ Virtual Machine Workloads: The Case for New Benchmarks for NAS, FAST’13 ▪ https://www.usenix.org/system/files/conference/fast13/fast13-final84.pdf Newer Is Sometimes Better: An Evaluation of NFSv4.1., SIGMETRICS’15 ▪ https://www.fsl.cs.sunysb.edu/docs/nfs4perf/nfs4perf-sigm15.pdf All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications, OSDI’14 ▪ http://research.cs.wisc.edu/wind/Publications/alice-osdi14.pdf Boosting the Power of Swift Using Metadata Search ▪ https://www.youtube.com/watch?v=_bODZWvIprY From Archive to Insight: Debunking Myths of Analytics on Object Stores ▪ https://www.youtube.com/watch?v=brhEUptD3JQ Swift 101: Technology and Architecture for Beginners ▪ https://www.openstack.org/summit/openstack-paris-summit-2014/session-videos/presentation/swift-101-technology-and-architecture- for-beginners Building Applications with Swift: The Swift Developer On-Ramp ▪ https://www.openstack.org/summit/openstack-paris-summit-2014/session-videos/presentation/building-applications-with-swift-the-swift- developer-on-ramp

157 More Information Building web-applications using OpenStack Swift ▪ https://www.openstack.org/summit/tokyo-2015/videos/presentation/building-web-applications-using-openstack-swift

SwiftOnFile Project ▪ https://github.com/openstack/swiftonfile

Swift3 Project ▪ https://github.com/openstack/swift3 ceph/s3-tests Project ▪ https://github.com/ceph/s3-tests

158 BACKUP

159 What is Object Storage?

Simple APIs and Scalable Semantics Ubiquitous (Swift/S3 and Multi-Tenancy Metadata Access Whole File Access Updates)

Simpler Multi-Site management Scalable and Versioning Cloud and flatter Highly-Available Storage namespace

160 Data Protection In The Context of What Can Actually Go Wrong…(and not what is only likely to go wrong)

Per-object Auditing common Per file or block Auditing is ● Low coverage ● vendor specific Disk Failure/Corruption ● Typically high coverage Erasure coding

Erasure Coding or Triplication High-end supports Erasure Coding Server Failure Low-end has no support

Erasure Coding or Triplication Rack Failure High-end supports Erasure Coding

Erasure Coding or Replication High-end supports Replication Scalability can be a concern... Data Center Failure At file or block level

Per Object Versioning Snapshots - Dataset Consistent ● S3 supports undelete User Error Backup

End-to-end checksums vendor specific End-to-end checksums vendor specific Data Transfer Corruption Backup

Typically lack scalable backup Storage Software Bugs Backup

Typically lack scalable backup Coordinated H/W failures Backup

161 File and Object Security Comparison

Standard APIs, both standard Standard (Kerberos) and custom implementations Typically not globally accessible Designed for Global Access Authentication Userid/password or certificate Userid/password or certificate Many support an enterprise Many support an enterprise directory service, ldap/ad directory service, ldap/ad

ACLs (of varying granularity) NFSv4 and Posix ACLs Authorization

HTTPS Data privacy Kerberos

Typically software-based Typically software-based separation separation Multi-Tenancy Shared servers and storage for High-end vendors can provide everyone physical separation as well

162 S3 Authentication Signing V2 (Backup)

StringToSign = HTTP-Verb + "\n" + Access Key ID (AWSAccessKeyID) Content-MD5 + "\n" + Content-Type + "\n" + Secret Access Key (AWSSecretKey) Date + "\n" + CanonicalizedAmzHeaders + CanonicalizedResource

signature = Base64( HMAC-SHA1( AWSSecretKey, UTF-8-Encoding-Of( StringToSign )))

Authorization Header

-H ‘Authorization: AWS awsaccesskeyid:signature’

163 http://docs.aws.amazon.com/AmazonS3/latest/dev/RESTAuthentication.html S3 Authentication Signature V4 (backup)

Access Key ID (AWSAccessKeyID) Secret Access Key (AWSSecretKey)

Authorization Header

-H ‘Authorization: AWS4-HMAC-SHA256 Credential=awsaccesskeyid/20160220/us-east-1/s3/aws4_request, SignedHeaders=host;range;x-amz-date, Signature=signature’

164 http://docs.aws.amazon.com/AmazonS3/latest/API/sig-v4-header-based-auth.html