CSE 6350 File and Storage System Infrastructure in Data centers Supporting Internet-wide Services

Ceph: A Scalable, High-Performance Distributed File System Presenter: Zhichao Yan What is designed for?

• For HPC application workloads 1. tens or hundreds of thousands of hosts concurrently reading from or writing to the same file or creating files in the same directory 2. workloads are inherently dynamic, with significant variation in data and metadata access as active applications and data sets change over time. • Different from that in GFS appending new data rather than overwriting existing data, random writes within a file are practically non-existent, once written, the files are only read, and often only sequentially What is Ceph designed for?

• Target: performance, scalability, reliability • Performance Dynamic workloads • Scalability Storage capacity, throughput, client performance • Reliability failures are the norm rather than the exception What is Ceph designed for?

• Main approaches 1. decoupled data and metadata 2. dynamic distributed metadata management 3. reliable autonomic distributed object storage (RADOS) A brief history of Ceph

• Ceph was initially created by Sage Weil as his dissertation. • After his graduation in fall 2007, Sage continued this work. • On March 19, 2010, merged the Ceph client into kernel version 2.6.34 which was released on May 16, 2010. • In 2012, Weil created Inktank Storage for professional services and support for Ceph. • In April 2014, purchased Inktank • On April 21, 2016, the Ceph development team released “Jewel”, the first Ceph release in which CephFS is considered stable

(1) “OSDs replace the traditional block-level interface with one in which clients can read or write byte ranges to much larger (and often variably sized) named objects, distributing low-level block allocation decisions to the devices themselves.” What are the major differences between OSD (Object Storage Device) and conventional hard disk?

Conventional hard disks are replaced with intelligent object storage devices (OSDs) which combine a CPU, network interface, and local cache with an underlying disk or RAID

OSDs have the computation capability which can be leveraged for storage management.

Conventional hard disk is a kind of dumb device who can’t reorganize data in the disks, or do some filter operations

Object can encapsulate the data itself, some metadata information, and some defined storage management methods to automatically manage objects Ceph Overview

• Client: expose a near-POSIX FS interface to a host • OSDs cluster: stores all data and metadata • Metadata server cluster: manages the namespace (file names and directories) while coordinating security, consistency and coherence.

(2) “Ceph decouples data and metadata operations by eliminating file allocation tables and replacing them with generating functions. This allows Ceph to leverage the intelligence present in OSDs to distribute the complexity surrounding data access, update serialization, replication and reliability, failure detection, and recovery.” Does GFS have a file allocation table? Who is responsible for managing “data access, update serialization, replication and reliability, failure detection, and recovery” in GFS?

Yes . GFS has a file allocation table , maintained by GFS’s Master. In GFS, master and chunkserver will involve in these operations. Main Features

• Decoupled data and metadata Files striped onto predictably named objects CRUSH maps objects to storage devices • Dynamic distributed metadata management Dynamic subtree partitioning Distributes metadata amongst MDSs • Reliable autonomic distributed object storage (RADOS) OSDs handle migration, replication, failure detection and recovery (3) “Ceph directly addresses the issue of scalability while simultaneously achieving high performance, reliability and availability through three fundamental design features:…” What are the Ceph’s design features? Compare Figure 1 with “Figure 1: GFS Architecture” in the GFS’s paper, read Section 2 and indicate the fundamental differences between them?

One Master vs Metadata Cluster

Traditional block FS vs OBS

File Allocation Table vs Hashing Function

Fixed chunk size vs Various Objects

Fixed Workload vs Dynamic Workload Client Operations

1. Client sends open request to MDS 2. MDS returns capability, file inode, file size and stripe information (mapping a file into a list of objects) 3. Client read/write directly from/to OSDs (based on object ID) 4. MDS manages the capability 5. Client sends close request, relinquishes capability, provides details to MDS File Mapping Mechanism (4) “To avoid any need for file allocation metadata, object names simply combine the file inode number and the stripe number.” Does GFS maintain file allocation metadata? Briefly explain how this can be removed.

Yes, GFS maintains the file allocation metadata in GFS Master, which indicates the place to store the corresponding chunks

Ceph get inode, size and striping information from MDS, those help map a file into several objects, then using CRUSH to map each object to corresponding OSD Synchronization

• Adheres to POSIX POSIX: reads reflect any data previously written, writes are atomic • Includes HPC oriented extensions – Consistency / correctness by default – Optionally relax constraints via extensions – Extensions for both data and metadata • Synchronous I/O used with multiple writers or mix of readers and writers (5) “POSIX semantics sensibly require … ” “Most notably, these include an O_LAZY flag for open …” What is the POSIX defined consistency? How does the flag relax the consistency?

POSIX: reads reflect any data previously written, writes are atomic

A set of HPC extensions to the POSIX I/O interface have been proposed by HPC community to relax the POSIX defined consistency.

Example: 1. allowing multiple write operation to different parts of the same file (data) 2. client caching metadata information, if a readdir is immediately followed by one or more stats, it only returns cached readdir metadata information (metadata)

O_LAZY flag is set for open to allow buffer writes or cache reads other than synchronous I/O for a shared-write file Applications can explicitly synchronize with two calls: 1. lazyio_propagate: flush a given byte range to the object store 2. lazyio_synchronize: ensure the effects of previous propagations are reflected in any subsequent reads Q&A

Thanks!