<<

COSC 6397 Big Data Analytics

Distributed File Systems

Edgar Gabriel Spring 2015

What is a

• A clearly defined method that the OS uses to store, catalog and retrieve files • Manage the bits that make up a file itself and Metadata • Metadata: “data about data”, e.g. – where data is logically placed on hard drive – file name – organizational hierarchies (i.e. directory) – Last modification date – Permissions(read,write,execute etc.)

1 File Model - overview

• A File is a sequence of bytes • When a program opens a file, the file system establishes a file pointer. The file pointer is an integer indicating the position in the file, where the next byte will be written/read. • Disk drives read and write data in fixed-sized units (disk sectors) • File systems allocate space in blocks, which is a fixed number of contiguous disk sectors. • In UNIX based file systems, the blocks that hold data are listed in an . An inode contains the information needed to find all the blocks that belong to a file. • If a file is too large and an inode can not hold the whole list of blocks, intermediate nodes (indirect blocks) are introduced.

Write operations

• Write: – the file systems copies bytes from the user buffer into system buffer. – If buffer filled up, system sends data to disk

• System buffering + allows file systems to collect full blocks of data before sending to disk + File system can send several blocks at once to the disk (delayed write or write behind) - Data not really saved in the case of a system - For very large write operations, the additional copy from user to system buffer could/should be avoided

2 Read operations

• Read: – File system determines, which blocks contain requested data – Read blocks from disk into system buffer – Copy data from system buffer into user memory

• System buffering: + file system always reads a full block (file caching) + If application reads data sequentially, prefetching (read ahead) can improve performance - Prefetching harmful to the performance, if application has a random access pattern.

Hiding disk latency: Caching and buffering

• Avoids repeated access to the same block • Allows a file system to smooth out I/O behavior • Helps to hide the latency of the hard drives • Lowers the performance of I/O operations for irregular access

• Non-blocking I/O gives users control over prefetching and delayed writing – Initiate read/write operations as soon as possible – Wait for the finishing of the read/write operations just when absolutely necessary.

3 Journaling file systems

• Updating a file takes typically multiple steps. An interruption between the steps leads to an inconsistent file system • Example: deleting a file – Remove the directory entry – Mark the inode blocks as free in the space map • A journaling file system keeps track of the changes that will be made in a journal before committing them to the main file system – Entries to journal are made before modifying the file sytem • After a crash, the journal is replied and an entry either – Succeeds: could be completely replayed during recovery – Not replayed: journal entry has not been finished – Journal entries often contain a checksum per entry to verify for corruption

Journaling file systems (II)

• Physical journal: – Data and metadata are written to the journal before modifying the file system – Large overhead -> data written twice • Logical journal: – Only metadata written to journal – Modifications to data written to file system directly -> worst case scenario: data is garbage, but directory structure and file structure are consistent -> trade off between performance and reliability

4 Log structured file systems

• Conventional file systems lay out files to optimize spatial locality – make in-place changes to their data structures in order to perform well on magnetic disks (seek is slow) • Log-structured file systems treat storage as a circular buffer – Write always occurs to the head of the log • Writes create multiple, chronologically-advancing versions of both file data and meta-data – Can be used to make old file versions nameable and accessible (snapshotting) • Recovery from crashes is simpler: upon its next mount, the file system can reconstruct its state from the last consistent point in the journal – not need to walk all its data structures

Distributed File Systems

• The generic term for a client/server file system where the data is not locally attached to a host. • Clients, servers, and storage are dispersed across machines. • Configuration and implementation may vary • Clients should view a DFS the same way they would a centralized FS; the distribution is hidden at a lower level. • Performance is concerned with throughput and response time.

Slide based on a lecture by Jerry Breecher: http://web.cs.wpi.edu/~jb/CS502/lectures/Section17-Dist_File_Sys.ppt

5 Distributed File Systems - Characteristics

• Naming: mapping between logical and physical objects – Example: A filename maps to . – In a conventional file system, it's understood where the file actually resides; the system and disk are known. – In a transparent DFS, the location of a file, somewhere in the network, is hidden. • Location transparency: The name of a file does not reveal any hint of the file's physical storage location. • Location independence: The name of a file doesn't need to be changed when the file's physical storage location changes.

Slide based on a lecture by Jerry Breecher: http://web.cs.wpi.edu/~jb/CS502/lectures/Section17-Dist_File_Sys.ppt

Distributed File Systems - Characteristics

• Caching – Reduce network traffic by retaining recently accessed disk blocks in a cache, so that repeated accesses to the same information can be handled locally. – If required data is not already cached, a copy of data is brought from the server to the user. – Perform accesses on the cached copy. – Files are identified with one master copy residing at the server machine, – Copies of (parts of) the file are scattered in different caches. • Cache Consistency Problem: Keeping the cached copies consistent with the master file.

Slide based on a lecture by Jerry Breecher: http://web.cs.wpi.edu/~jb/CS502/lectures/Section17-Dist_File_Sys.ppt

6 Distributed File Systems - Characteristics

• Typical steps for a read operation: – The client makes a request for file access. – The request is passed to the server in message format. – The server makes the file access. – Return messages bring the result back to the client.

• Cache location: – data can be kept in the local memory or in the local disk. – Caching can be done on the client and the server side

Slide based on a lecture by Jerry Breecher: http://web.cs.wpi.edu/~jb/CS502/lectures/Section17-Dist_File_Sys.ppt

Distributed File Systems - Characteristics

• Stateful: server keeps track of information about client requests. – Maintains what files are opened by a client – Memory must be reclaimed when client closes file or when client dies. – Good for Performance: no need to parse the filename each time, or "open/close" file on every request. – Bad for Reliability: stateful server loses everything on crash

• Stateless: Each client request provides complete information needed by the server (i.e., filename, file offset ). – Server maintains information on behalf of the client – Stateless remembers nothing so it can start easily after a crash Slide based on a lecture by Jerry Breecher: http://web.cs.wpi.edu/~jb/CS502/lectures/Section17-Dist_File_Sys.ppt

7 Example: NFS – The • Protocol for a remote file service • Stateless server (v3) • Communication based on RPC (Remote Procedure Call) • NFS provides session semantics – changes to an open file are initially only visible to the that modified the file • File locking not part of NFS protocol (v3) but often available through a separate protocol/daemon Image taken from a lecture by Jerry Breecher: http://web.cs.wpi.edu/~jb/CS502/lectures/Section1 • Client caching not part of the 7-Dist_File_Sys.ppt NFS protocol (v3) – implementation dependent behavior

Parallel File Systems

• Parallel File System: data blocks are striped across multiple storage devices on multiple storage servers. • Support for parallel applications: all nodes access to the same files at the same time (concurrent read and write capabilities) • Three relevant parameters: – Stripe factor: number of disks – Stripe size: size of each block – Which disk contains the first block of the file

… Block 1 Block 2 Block 3 Block n …

Disk 1 Disk 2 Disk 3 Disk 4

8 Parallel File Systems: Conceptual overview

Compute nodes Meta-data server

storage server 0

storage server 1

storage server 2

storage server 3

Parallel File Systems - Concept

• Metadata server: – stores namespace metadata, such as filenames, directories, access permissions, and file layout. – Metadata server not necessarily involved in file I/O operations • Distributed Metadata server: – E.g. multiple metadata server available, each hosting a part of the namespace • hashing function on file name or • Sub trees of the directory • Write operations: – Require locking of entire file or file block to ensure consistency – Distributed locking protocols can be used

9 Example: Parallel

• Open source project from Clemson University • Lightweight server daemon to provide simultaneous access to storage • Each node in the cluster can be a server, a client, or both. • Best suited for providing large, fast temporary storage. • The basic PVFS2 package consists of three components: a server, a client, and a kernel module. • Default stripe size: 64kB – In practice: often changed to 1 MB – Can be adjusted on a per-directory basis

Slides based on a talk by James W. Barker: http://www.slideshare.net/lystrata/survey-of-clusteredparallelfilesystems004lanlppt-10538039

Example: Parallel Virtual File System

• Stateless architecture – PVFS2 servers do not keep track of typical file system bookkeeping information such as which files have been opened, file positions, etc. – No shared lock state to manage – Can fail and resume without disturbing the system as a whole. • Distributed Metadata server – Relies on relaxed consistency semantics – Defines semantics of data access without requiring locking Slides based on a talk by James W. Barker: http://www.slideshare.net/lystrata/survey-of-clusteredparallelfilesystems004lanlppt-10538039

10 Example: Parallel Virtual File System

• No client-side caching of metadata: – status operations (e.g. “ls”) take a long time, as the information is retrieved over the network. – PVFS2 is better suited for I/O intensive applications, rather than for hosting a home directory. – PVFS2 is optimized for efficient reading and writing of large amounts of data, and thus it’s very well suited for scientific applications.

Slides based on a talk by James W. Barker: http://www.slideshare.net/lystrata/survey-of-clusteredparallelfilesystems004lanlppt-10538039

Example: Parallel Virtual File System

• Two methods are provided for accessing PVFS2 file systems. – Mount PVFS2 file system. • allows user to access file system using regular POSIX commands/function • introduces some performance overhead – PVFS2 library functions: e.g. used by MPI-IO • Doesn’t implement POSIX semantics • Optimize access to single files by many processes on different nodes. • Provides “noncontiguous” access operations that allow for efficient access to data spread throughout the file. Slides based on a talk by James W. Barker: http://www.slideshare.net/lystrata/survey-of-clusteredparallelfilesystems004lanlppt-10538039

11