DMF 7 & Tier Zero: Scalable Architecture & Roadmap
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Development of a Verified Flash File System ⋆
Development of a Verified Flash File System ? Gerhard Schellhorn, Gidon Ernst, J¨orgPf¨ahler,Dominik Haneberg, and Wolfgang Reif Institute for Software & Systems Engineering University of Augsburg, Germany fschellhorn,ernst,joerg.pfaehler,haneberg,reifg @informatik.uni-augsburg.de Abstract. This paper gives an overview over the development of a for- mally verified file system for flash memory. We describe our approach that is based on Abstract State Machines and incremental modular re- finement. Some of the important intermediate levels and the features they introduce are given. We report on the verification challenges addressed so far, and point to open problems and future work. We furthermore draw preliminary conclusions on the methodology and the required tool support. 1 Introduction Flaws in the design and implementation of file systems already lead to serious problems in mission-critical systems. A prominent example is the Mars Explo- ration Rover Spirit [34] that got stuck in a reset cycle. In 2013, the Mars Rover Curiosity also had a bug in its file system implementation, that triggered an au- tomatic switch to safe mode. The first incident prompted a proposal to formally verify a file system for flash memory [24,18] as a pilot project for Hoare's Grand Challenge [22]. We are developing a verified flash file system (FFS). This paper reports on our progress and discusses some of the aspects of the project. We describe parts of the design, the formal models, and proofs, pointing out challenges and solutions. The main characteristic of flash memory that guides the design is that data cannot be overwritten in place, instead space can only be reused by erasing whole blocks. -
Silicon Graphics, Inc. Scalable Filesystems XFS & CXFS
Silicon Graphics, Inc. Scalable Filesystems XFS & CXFS Presented by: Yingping Lu January 31, 2007 Outline • XFS Overview •XFS Architecture • XFS Fundamental Data Structure – Extent list –B+Tree – Inode • XFS Filesystem On-Disk Layout • XFS Directory Structure • CXFS: shared file system ||January 31, 2007 Page 2 XFS: A World-Class File System –Scalable • Full 64 bit support • Dynamic allocation of metadata space • Scalable structures and algorithms –Fast • Fast metadata speeds • High bandwidths • High transaction rates –Reliable • Field proven • Log/Journal ||January 31, 2007 Page 3 Scalable –Full 64 bit support • Large Filesystem – 18,446,744,073,709,551,615 = 264-1 = 18 million TB (exabytes) • Large Files – 9,223,372,036,854,775,807 = 263-1 = 9 million TB (exabytes) – Dynamic allocation of metadata space • Inode size configurable, inode space allocated dynamically • Unlimited number of files (constrained by storage space) – Scalable structures and algorithms (B-Trees) • Performance is not an issue with large numbers of files and directories ||January 31, 2007 Page 4 Fast –Fast metadata speeds • B-Trees everywhere (Nearly all lists of metadata information) – Directory contents – Metadata free lists – Extent lists within file – High bandwidths (Storage: RM6700) • 7.32 GB/s on one filesystem (32p Origin2000, 897 FC disks) • >4 GB/s in one file (same Origin, 704 FC disks) • Large extents (4 KB to 4 GB) • Request parallelism (multiple AGs) • Delayed allocation, Read ahead/Write behind – High transaction rates: 92,423 IOPS (Storage: TP9700) -
W4118: File Systems
W4118: file systems Instructor: Junfeng Yang References: Modern Operating Systems (3rd edition), Operating Systems Concepts (8th edition), previous W4118, and OS at MIT, Stanford, and UWisc Outline File system concepts . What is a file? . What operations can be performed on files? . What is a directory and how is it organized? File implementation . How to allocate disk space to files? 1 What is a file User view . Named byte array • Types defined by user . Persistent across reboots and power failures OS view . Map bytes as collection of blocks on physical storage . Stored on nonvolatile storage device • Magnetic Disks 2 Role of file system Naming . How to “name” files . Translate “name” + offset logical block # Reliability . Must not lose file data Protection . Must mediate file access from different users Disk management . Fair, efficient use of disk space . Fast access to files 3 File metadata Name – only information kept in human-readable form Identifier – unique tag (number) identifies file within file system (inode number in UNIX) Location – pointer to file location on device Size – current file size Protection – controls who can do reading, writing, executing Time, date, and user identification – data for protection, security, and usage monitoring How is metadata stored? (inode in UNIX) 4 File operations int creat(const char* pathname, mode_t mode) int unlink(const char* pathname) int rename(const char* oldpath, const char* newpath) int open(const char* pathname, int flags, mode_t mode) int read(int fd, void* buf, size_t count); int write(int fd, const void* buf, size_t count) int lseek(int fd, offset_t offset, int whence) int truncate(const char* pathname, offset_t len) .. -
Lecture 17: Files and Directories
11/1/16 CS 422/522 Design & Implementation of Operating Systems Lecture 17: Files and Directories Zhong Shao Dept. of Computer Science Yale University Acknowledgement: some slides are taken from previous versions of the CS422/522 lectures taught by Prof. Bryan Ford and Dr. David Wolinsky, and also from the official set of slides accompanying the OSPP textbook by Anderson and Dahlin. The big picture ◆ Lectures before the fall break: – Management of CPU & concurrency – Management of main memory & virtual memory ◆ Current topics --- “Management of I/O devices” – Last week: I/O devices & device drivers – Last week: storage devices – This week: file systems * File system structure * Naming and directories * Efficiency and performance * Reliability and protection 1 11/1/16 This lecture ◆ Implementing file system abstraction Physical Reality File System Abstraction block oriented byte oriented physical sector #’s named files no protection users protected from each other data might be corrupted robust to machine failures if machine crashes File system components ◆ Disk management User – Arrange collection of disk blocks into files File File ◆ Naming Naming access – User gives file name, not track or sector number, to locate data Disk management ◆ Security / protection – Keep information secure Disk drivers ◆ Reliability/durability – When system crashes, lose stuff in memory, but want files to be durable 2 11/1/16 User vs. system view of a file ◆ User’s view – Durable data structures ◆ System’s view (system call interface) – Collection of bytes (Unix) ◆ System’s view (inside OS): – Collection of blocks – A block is a logical transfer unit, while a sector is the physical transfer unit. -
CXFSTM Administration Guide for SGI® Infinitestorage
CXFSTM Administration Guide for SGI® InfiniteStorage 007–4016–025 CONTRIBUTORS Written by Lori Johnson Illustrated by Chrystie Danzer Engineering contributions to the book by Vladmir Apostolov, Rich Altmaier, Neil Bannister, François Barbou des Places, Ken Beck, Felix Blyakher, Laurie Costello, Mark Cruciani, Rupak Das, Alex Elder, Dave Ellis, Brian Gaffey, Philippe Gregoire, Gary Hagensen, Ryan Hankins, George Hyman, Dean Jansa, Erik Jacobson, John Keller, Dennis Kender, Bob Kierski, Chris Kirby, Ted Kline, Dan Knappe, Kent Koeninger, Linda Lait, Bob LaPreze, Jinglei Li, Yingping Lu, Steve Lord, Aaron Mantel, Troy McCorkell, LaNet Merrill, Terry Merth, Jim Nead, Nate Pearlstein, Bryce Petty, Dave Pulido, Alain Renaud, John Relph, Elaine Robinson, Dean Roehrich, Eric Sandeen, Yui Sakazume, Wesley Smith, Kerm Steffenhagen, Paddy Sreenivasan, Roger Strassburg, Andy Tran, Rebecca Underwood, Connie Woodward, Michelle Webster, Geoffrey Wehrman, Sammy Wilborn COPYRIGHT © 1999–2007 SGI. All rights reserved; provided portions may be copyright in third parties, as indicated elsewhere herein. No permission is granted to copy, distribute, or create derivative works from the contents of this electronic documentation in any manner, in whole or in part, without the prior written permission of SGI. LIMITED RIGHTS LEGEND The software described in this document is "commercial computer software" provided with restricted rights (except as to included open/free source) as specified in the FAR 52.227-19 and/or the DFAR 227.7202, or successive sections. Use beyond -
Recursive Updates in Copy-On-Write File Systems - Modeling and Analysis
2342 JOURNAL OF COMPUTERS, VOL. 9, NO. 10, OCTOBER 2014 Recursive Updates in Copy-on-write File Systems - Modeling and Analysis Jie Chen*, Jun Wang†, Zhihu Tan*, Changsheng Xie* *School of Computer Science and Technology Huazhong University of Science and Technology, China *Wuhan National Laboratory for Optoelectronics, Wuhan, Hubei 430074, China [email protected], {stan, cs_xie}@hust.edu.cn †Dept. of Electrical Engineering and Computer Science University of Central Florida, Orlando, Florida 32826, USA [email protected] Abstract—Copy-On-Write (COW) is a powerful technique recursive update. Recursive updates can lead to several for data protection in file systems. Unfortunately, it side effects to a storage system, such as write introduces a recursively updating problem, which leads to a amplification (also can be referred as additional writes) side effect of write amplification. Studying the behaviors of [4], I/O pattern alternation [5], and performance write amplification is important for designing, choosing and degradation [6]. This paper focuses on the side effects of optimizing the next generation file systems. However, there are many difficulties for evaluation due to the complexity of write amplification. file systems. To solve this problem, we proposed a typical Studying the behaviors of write amplification is COW file system model based on BTRFS, verified its important for designing, choosing, and optimizing the correctness through carefully designed experiments. By next generation file systems, especially when the file analyzing this model, we found that write amplification is systems uses a flash-memory-based underlying storage greatly affected by the distributions of files being accessed, system under online transaction processing (OLTP) which varies from 1.1x to 4.2x. -
F2punifycr: a Flash-Friendly Persistent Burst-Buffer File System
F2PUnifyCR: A Flash-friendly Persistent Burst-Buffer File System ThanOS Department of Computer Science Florida State University Tallahassee, United States I. ABSTRACT manifold depending on the workloads it is handling for With the increased amount of supercomputing power, applications. In order to leverage the capabilities of burst it is now possible to work with large scale data that buffers to the utmost level, it is very important to have a pose a continuous opportunity for exascale computing standardized software interface across systems. It has to that puts immense pressure on underlying persistent data deal with an immense amount of data during the runtime storage. Burst buffers, a distributed array of node-local of the applications. persistent flash storage devices deployed on most of Using node-local burst buffer can achieve scalable the leardership supercomputers, are means to efficiently write bandwidth as it lets each process write to the handling the bursty I/O invoked through cutting-edge local flash drive, but when the files are shared across scientific applications. In order to manage these burst many processes, it puts the management of metadata buffers, many ephemeral user level file system solutions, and object data of the files under huge challenge. In like UnifyCR, are present in the research and industry order to handle all the challenges posed by the bursty arena. Because of the intrinsic nature of the flash devices and random I/O requests by the Scientific Applica- due to background processing overhead, like Garbage tions running on leadership Supercomputing clusters, Collection, peak write bandwidth is hard to get. -
Operating Systems Lecture #5: File Management
Operating Systems Lecture #5: File Management Written by David Goodwin based on the lecture series of Dr. Dayou Li and the book Understanding Operating Systems 4thed. by I.M.Flynn and A.McIver McHoes (2006) Department of Computer Science and Technology, University of Bedfordshire. Operating Systems, 2013 25th February 2013 Outline Lecture #5 File Management David Goodwin 1 Introduction University of Bedfordshire 2 Interaction with the file manager Introduction Interaction with the file manager 3 Files Files Physical storage 4 Physical storage allocation allocation Directories 5 Directories File system Access 6 File system Data compression summary 7 Access 8 Data compression 9 summary Operating Systems 46 Lecture #5 File Management David Goodwin University of Bedfordshire Introduction 3 Interaction with the file manager Introduction Files Physical storage allocation Directories File system Access Data compression summary Operating Systems 46 Introduction Lecture #5 File Management David Goodwin University of Bedfordshire Introduction 4 Responsibilities of the file manager Interaction with the file manager 1 Keep track of where each file is stored Files 2 Use a policy that will determine where and how the files will Physical storage be stored, making sure to efficiently use the available storage allocation space and provide efficient access to the files. Directories 3 Allocate each file when a user has been cleared for access to File system it, and then record its use. Access 4 Deallocate the file when the file is to be returned to storage, Data compression and communicate its availability to others who may be summary waiting for it. Operating Systems 46 Definitions Lecture #5 File Management field is a group of related bytes that can be identified by David Goodwin University of the user with a name, type, and size. -
Introduction to ISO 9660
Disc Manufacturing, Inc. A QUIXOTE COMPANY Introduction to ISO 9660, what it is, how it is implemented, and how it has been extended. Clayton Summers Copyright © 1993 by Disc Manufacturing, Inc. All rights reserved. WHO IS DMI? Disc Manufacturing, Inc. (DMI) manufactures all compact disc formats (i.e., CD-Audio, CD-ROM, CD-ROM XA, CDI, PHOTO CD, 3DO, KARAOKE, etc.) at two plant sites in the U.S.; Huntsville, AL, and Anaheim, CA. To help you, DMI has one of the largest Product Engineering/Technical Support staff and sales force dedicated solely to CD-ROM in the industry. The company has had a long term commitment to optical disc technology and has performed developmental work and manufactured (laser) optical discs of various types since 1981. In 1983, DMI manufactured the first compact disc in the United States. DMI has developed extensive mastering expertise during this time and is frequently called upon by other companies to provide special mastering services for products in development. In August 1991, DMI purchased the U.S. CD-ROM business from the Philips and Du Pont Optical Company (PDO). PDO employees in sales, marketing and technical services were retained. DMI is a wholly-owned subsidiary of Quixote Corporation, a publicly owned corporation whose stock is traded on the NASDAQ exchange as QUIX. Quixote is a diversified technology company composed of Energy Absorption Systems, Inc. (manufactures highway crash cushions), Stenograph Corporation (manufactures shorthand machines and computer systems for court reporting) and Disc Manufacturing, Inc. We would be pleased to help you with your CD project or answer any questions you may have. -
File System Layout
File Systems Main Points • File layout • Directory layout • Reliability/durability Named Data in a File System index !le name directory !le number structure storage o"set o"set block Last Time: File System Design Constraints • For small files: – Small blocks for storage efficiency – Files used together should be stored together • For large files: – ConCguous allocaon for sequenCal access – Efficient lookup for random access • May not know at file creaon – Whether file will become small or large File System Design Opons FAT FFS NTFS Index Linked list Tree Tree structure (fixed, asym) (dynamic) granularity block block extent free space FAT array Bitmap Bitmap allocaon (fixed (file) locaon) Locality defragmentaon Block groups Extents + reserve Best fit space defrag MicrosoS File Allocaon Table (FAT) • Linked list index structure – Simple, easy to implement – SCll widely used (e.g., thumb drives) • File table: – Linear map of all blocks on disk – Each file a linked list of blocks FAT MFT Data Blocks 0 1 2 3 !le 9 block 3 4 5 6 7 8 9 !le 9 block 0 10 !le 9 block 1 11 !le 9 block 2 12 !le 12 block 0 13 14 15 16 !le 12 block 1 17 18 !le 9 block 4 19 20 FAT • Pros: – Easy to find free block – Easy to append to a file – Easy to delete a file • Cons: – Small file access is slow – Random access is very slow – Fragmentaon • File blocks for a given file may be scaered • Files in the same directory may be scaered • Problem becomes worse as disk fills Berkeley UNIX FFS (Fast File System) • inode table – Analogous to FAT table • inode – Metadata • File owner, access permissions, -
Architecture and Performance of Perlmutter's 35 PB Clusterstor
Lawrence Berkeley National Laboratory Recent Work Title Architecture and Performance of Perlmutter’s 35 PB ClusterStor E1000 All-Flash File System Permalink https://escholarship.org/uc/item/90r3s04z Authors Lockwood, Glenn K Chiusole, Alberto Gerhardt, Lisa et al. Publication Date 2021-06-18 Peer reviewed eScholarship.org Powered by the California Digital Library University of California Architecture and Performance of Perlmutter’s 35 PB ClusterStor E1000 All-Flash File System Glenn K. Lockwood, Alberto Chiusole, Lisa Gerhardt, Kirill Lozinskiy, David Paul, Nicholas J. Wright Lawrence Berkeley National Laboratory fglock, chiusole, lgerhardt, klozinskiy, dpaul, [email protected] Abstract—NERSC’s newest system, Perlmutter, features a 35 1,536 GPU nodes 16 Lustre MDSes PB all-flash Lustre file system built on HPE Cray ClusterStor 1x AMD Epyc 7763 1x AMD Epyc 7502 4x NVIDIA A100 2x Slingshot NICs E1000. We present its architecture, early performance figures, 4x Slingshot NICs Slingshot Network 24x 15.36 TB NVMe and performance considerations unique to this architecture. We 200 Gb/s demonstrate the performance of E1000 OSSes through low-level 2-level dragonfly 16 Lustre OSSes 3,072 CPU nodes Lustre tests that achieve over 90% of the theoretical bandwidth 1x AMD Epyc 7502 2x AMD Epyc 7763 2x Slingshot NICs of the SSDs at the OST and LNet levels. We also show end-to-end 1x Slingshot NICs performance for both traditional dimensions of I/O performance 24x 15.36 TB NVMe (peak bulk-synchronous bandwidth) and non-optimal workloads endemic to production computing (small, incoherent I/Os at 24 Gateway nodes 2 Arista 7804 routers 2x Slingshot NICs 400 Gb/s/port random offsets) and compare them to NERSC’s previous system, 2x 200G HCAs > 10 Tb/s routing Cori, to illustrate that Perlmutter achieves the performance of a burst buffer and the resilience of a scratch file system. -
Comparative Analysis of Distributed and Parallel File Systems' Internal Techniques
Comparative Analysis of Distributed and Parallel File Systems’ Internal Techniques Viacheslav Dubeyko Content 1 TERMINOLOGY AND ABBREVIATIONS ................................................................................ 4 2 INTRODUCTION......................................................................................................................... 5 3 COMPARATIVE ANALYSIS METHODOLOGY ....................................................................... 5 4 FILE SYSTEM FEATURES CLASSIFICATION ........................................................................ 5 4.1 Distributed File Systems ............................................................................................................................ 6 4.1.1 HDFS ..................................................................................................................................................... 6 4.1.2 GFS (Google File System) ....................................................................................................................... 7 4.1.3 InterMezzo ............................................................................................................................................ 9 4.1.4 CodA .................................................................................................................................................... 10 4.1.5 Ceph.................................................................................................................................................... 12 4.1.6 DDFS ..................................................................................................................................................