ORION: a Distributed File System for Non-Volatile Main Memories and RDMA-Capable Networks

Total Page:16

File Type:pdf, Size:1020Kb

ORION: a Distributed File System for Non-Volatile Main Memories and RDMA-Capable Networks ORION: A Distributed File System for Non-Volatile Main Memories and RDMA-Capable Networks Jian Yang, Joseph Izraelevitz, Steven Swanson Non-Volatile Systems Laboratory Department of Computer Science & Engineering University of California, San Diego 1 Accessing NVMM as Remote Storage Application Application Application DRAM Memory NVMM DRAM NVMM DRAM NVMM RDMA Network Non-Volatile Main Memory (NVMM) Remote Direct Memory Access (RDMA) • PCM, STT-RAM, ReRAM, Intel 3DXPoint • DMA from/to remote memory • Performant: DRAM-class latency/BW • Two-sided verbs (Send/Recv) • Byte-addressable • One-sided verbs (Read/Write) • Persistent over power failures • Bypasses remote CPU • Byte-addressable 2 Accessing Local NVMM vs. Remote NVMM Latency of write(4KB) + fsync() Better Access local NVMM Access remote NVMM over Dist. FS NVMMFS (NOVA) 3.86 us Application Application iSCSI/RDMA 19x Ceph/RDMA 241x RDMA NVMM FS Distributed FS Distributed FS Throughput of fileserver workload Better NVMMFS 6653MB/s NVMM NVMM (NOVA) iSCSI/RDMA 19% Ceph/RDMA 5% 3 Issue #1: Existing Dist. FSs are Slow on NVMM Application File Access RPC • Layered Design Storage Client Storage Server • Indirection overhead User File Request File Access Kernel • Expensive to persist (e.g., fsync()) File System/FUSE File System Block Access NVMM 4 NVMM is Faster than RDMA Harddrive (NVMe) RDMA Network NVMM Latency 70 μs >> 3 μs 300 ns Bandwidth 1.3-3.2 GB/s < 5 GB/s 2-16 GB/s Access Size 4 KB (Page) 2 KB (MTU) 64 Byte (Cacheline) (@ Max BW) Networking is faster than storage 5 NVMM is Faster than RDMA Harddrive (NVMe) RDMA Network NVMM Latency 70 μs 3 μs >> 300 ns Bandwidth 1.3-3.2 GB/s 5 GB/s < 2-16 GB/s Access Size 4 KB (Page) 2 KB (MTU) 64 Byte (Cacheline) (@ Max BW) NVMM is faster than networking 6 Issue #2: Lack of Support for Local NVMM • Use case of converged storage Application – Local NVMM supports Direct Access (DAX) Storage RDMA Storage Server • Existing systems do not store data at local Client NVMM FS • Run Local FS and Dist. FS File System – Expensive to move data DAX Page Cache Local NVMM Remote NVMM 7 ORION: A Distributed File System for NVMM and RDMA-Capable Networks • A clean slate design for NVMM and RDMA • A unified layer: kernel FS + networking Application Application POSIX I/O • Pooled NVMM storage ORION • Accessing metadata/data directly over DAX Access RDMA Access Direct Access (DAX) and RDMA NVMM Pooled NVMM NVMM • Designed for rack-scale scalability 8 Outline • Background • Design overview • Metadata and data management • Replication • RDMA persistence • Evaluation • Conclusion 9 ORION: Cluster Overview MDS Metadata NVMM • Metadata Server (MDS): Runs ORIONFS, Sync/Update keeps authoritative metadata of the whole FS Client • Client: Runs ORIONFS, keeps active metadata Client ClientMetadata and cached data. Access local NVMM DRAM • Data Store (DS): Pooled NVMM data Data $ RDMA Read/Write DAX Data Store • Metadata Access: Clients <=> MDS (Two-sided) Read/Write DataData Store Store • Data Access: Clients => DSs (One-sided) Data Data NVMM 10 The ORION File System • Inherited from NOVA [Xu, FAST 16] MDS MDS – Per-inode metadata (operation) log VFS a – Build in-DRAM data structures on recovery inodes b – Atomic log append dentries c inode log • Metadata: Client – DMA (Physical) memory region (MR) – RDMA-able metadata structures VFS b inodes c dentries • Data: Globally partitioned Data Store inode log [$] Data $ [Y] Data [X] Data 11 Operations in ORION File System MDS a b • Open(a) c – Client Allocate inode inode log RecvQ Client b c RPC: Open inode log Path: a [$] Data $ WAddr:&alloc [X] Data 12 Operations in ORION File System MDS a b • Open(a) c – Client Allocate inode inode log RecvQ – Client Issue an RPC via RDMA_Send Client b c inode log [$] Data $ [X] Data 13 Operations in ORION File System MDS a b • Open(a) c – Client Allocate inode inode log RecvQ – Client Issue an RPC via RDMA_Send Client – MDS RDMA_Write to allocated space b a c inode log [$] Data $ [X] Data 14 Operations in ORION File System MDS a b • Open(a) c – Client Allocate inode inode log RecvQ – Client Issue an RPC via RDMA_Send Client – MDS RDMA_Write to allocated space b a – Client RDMA_Read the rest of the log c inode log [$] Data $ [X] Data 15 Operations in ORION File System MDS 1a • Write(c) b – Client Allocate & CoW to client-owned pages c inode log RecvQ Client b a c inode log [$] Data $ [X] Data 16 Operations in ORION File System MDS 1a • Write(c) b – Client Allocate & CoW to client-owned pages c RecvQ – Client Append log entry inode log Client b a c Log: FileWrite inode log Addr=(X,addr) Size=4096 [$] Data $ [X] Data 17 Operations in ORION File System MDS 1a • Write(c) b – Client Allocate & CoW to client-owned pages c RecvQ – Client Append log entry inode log – Client Commit log entry via RDMA_Send Client b a c Log: FileWrite inode log Addr=(X,addr) Size=4096 [$] Data $ [X] Data 18 Operations in ORION File System MDS 1a • Write(c) b – Client Allocate & CoW to client-owned pages c RecvQ – Client Append log entry inode log – Client Commit log entry via RDMA_Send Client – MDS Append log entry b a c Log: FileWrite inode log Addr=(X,addr) Size=4096 [$] Data $ [X] Data 19 Operations in ORION File System MDS 1a • Write(c) b – Client Allocate & CoW to client-owned pages c RecvQ – Client Append log entry inode log – Client Commit log entry via RDMA_Send Client – MDS Append log entry b a – Client MDS Update tail pointers atomically c Log: FileWrite inode log Addr=(X,addr) Size=4096 [$] Data $ [X] Data 20 Operations in ORION File System MDS 1a • Tailcheck(b) b c – MDS Log commit from another client inode log RecvQ Client b c inode log [$] Data $ [X] Data 21 Operations in ORION File System MDS 1a • Tailcheck(b) b c – MDS Log commit from another client inode log RecvQ – Client RDMA_Read remote log tail Client b c inode log [$] Data $ [X] Data 22 Operations in ORION File System MDS 1a • Tailcheck(b) b c – MDS Log commit from another client inode log RecvQ – Client RDMA_Read remote log tail Client – Client Read from MDS if Len(Local) < Len(Remote) b c inode log [$] Data $ [X] Data 23 Operations in ORION File System • Read(b) MDS – Client Tailcheck (async) 1a – Client RDMA_Read from data store b c inode log RecvQ Client • Data locality Type=FileWrite Addr=(Y,3) – Future reads will hit DRAM cache b Size=4096 – Future writes will go to local NVMM c Data Store inode log [$] Data $ 3 [Y] Data [X] Data 24 Operations in ORION File System • Read(b) MDS – Client Tailcheck (async) 1a – Client RDMA_Read from data store b – Client In-place update to log entry c inode log RecvQ Client • Data locality Type=FileWrite Addr=($,0) – Future reads will hit DRAM cache b Size=4096 – Future writes will go to local NVMM c Data Store inode log [$] Data $ 3 [Y] Data [X] Data 25 Accelerating Metadata Accesses One-sided RDMA Throughtput Read [Inbound] 15Mop/s • Observations: Write[Outbound] 1.9Mop/s – RDMA prefers inbound operations [Su, EuroSys 17] RDMA Send Latency – RDMA prefers small operations [Kalia, ATC 16] 8 Bytes 1.3 us 512 Bytes 3.3 us • MDS request handling: – Tailcheck (8B RDMA_Read): MDS-bypass – Log Commit (~128B RDMA_Send): Single-inode operations – RPC (Varies): Other operations, less common 26 Optimizing Log Commits MDS Tail • Speculative log commit: Local Tail – Return when RDMA_Send verb is signaled inode log – Tailcheck before send #1 #2 Client 2 Client – Rebuild inode from log when necessary – RPCs for complex operations (e.g. inode log #1 #2 #3 O_APPEND) MDS RecvQ • Log commit + Persist: ~ 500 CPU Cycles (memcpy) (flush+fence) inode log #1 #3 #3 Client 1 Client Rebuild 27 Evaluation ORION Prototype • ORION kernel modules (~15K LOC) Networking • Linux Kernel 4.10 • 12 Nodes connected to a switch • RDMA Stack: MLNX_OFED 4.3 • InfiniBand Switch (QLogic 12300) • Bind to 1 core for each client Hardware • 2x Intel Westmere-EP CPU • 16GB DRAM as DRAM • 32GB DRAM as NVMM • RNIC: Mellanox ConnectX-2 VPI (40Gbps) 28 Evaluation: File Operations Orion vs. Distributed File Systems Orion vs. Local File Systems Orion Ceph Gluster Orion NOVA Ext4-DAX 582/932 417 100 30 75 20 50 1.3x 162x 10 1.3x Access Access Latency (us) 25 3x 0 0 Create Mkdir Unlink Rmdir FIO 4K FIO 4K Better Create Mkdir Unlink Rmdir FIO 4K FIO 4K Better Read Write Read Write 29 Evaluation: Applications Filebench workloads and MongoDB (YCSB-A) throughput (relative to NOVA) Orion NOVA Ext4-DAX Ceph Gluster Better 1 85% 0.8 68% 0.6 0.4 Throughput (Relative) 0.2 0 varmail fileserver webserver mongodb 30 Evaluation: Metadata Accesses Relative throughput of metadata operations Throughput of metadata operations 8 Better Tailcheck Log Commit RPC 7 15 6 5 4 10 3 2 1 5 0 Throughput (Relative Throughputto1 Client) (Relative Throughput (Mop/s) 1 2 3 4 5 6 7 8 # Clients 0 Tailcheck Log Commit RPC 8 Clients 31 Conclusion • Existing distributed file systems lack of NVMM support and have significant software overhead • ORION unifies the NVMM file system and the networking layer • ORION provides fast metadata accesses • ORION allows DAX to local NVMM data • Performance comparable to local NVMM file systems 32.
Recommended publications
  • Optimizing the Ceph Distributed File System for High Performance Computing
    Optimizing the Ceph Distributed File System for High Performance Computing Kisik Jeong Carl Duffy Jin-Soo Kim Joonwon Lee Sungkyunkwan University Seoul National University Seoul National University Sungkyunkwan University Suwon, South Korea Seoul, South Korea Seoul, South Korea Suwon, South Korea [email protected] [email protected] [email protected] [email protected] Abstract—With increasing demand for running big data ana- ditional HPC workloads. Traditional HPC workloads perform lytics and machine learning workloads with diverse data types, reads from a large shared file stored in a file system, followed high performance computing (HPC) systems consequently need by writes to the file system in a parallel manner. In this case, to support diverse types of storage services. Ceph is one possible candidate for such HPC environments, as Ceph provides inter- a storage service should provide good sequential read/write faces for object, block, and file storage. Ceph, however, is not performance. However, Ceph divides large files into a number designed for HPC environments, thus it needs to be optimized of chunks and distributes them to several different disks. This for HPC workloads. In this paper, we find and analyze problems feature has two main problems in HPC environments: 1) Ceph that arise when running HPC workloads on Ceph, and propose translates sequential accesses into random accesses, 2) Ceph a novel optimization technique called F2FS-split, based on the F2FS file system and several other optimizations. We measure needs to manage many files, incurring high journal overheads the performance of Ceph in HPC environments, and show that in the underlying file system.
    [Show full text]
  • Proxmox Ve Mit Ceph &
    PROXMOX VE MIT CEPH & ZFS ZUKUNFTSSICHERE INFRASTRUKTUR IM RECHENZENTRUM Alwin Antreich Proxmox Server Solutions GmbH FrOSCon 14 | 10. August 2019 Alwin Antreich Software Entwickler @ Proxmox 15 Jahre in der IT als Willkommen! System / Netzwerk Administrator FrOSCon 14 | 10.08.2019 2/33 Proxmox Server Solutions GmbH Aktive Community Proxmox seit 2005 Globales Partnernetz in Wien (AT) Proxmox Mail Gateway Enterprise (AGPL,v3) Proxmox VE (AGPL,v3) Support & Services FrOSCon 14 | 10.08.2019 3/33 Traditionelle Infrastruktur FrOSCon 14 | 10.08.2019 4/33 Hyperkonvergenz FrOSCon 14 | 10.08.2019 5/33 Hyperkonvergente Infrastruktur FrOSCon 14 | 10.08.2019 6/33 Voraussetzung für Hyperkonvergenz CPU / RAM / Netzwerk / Storage Verwende immer genug von allem. FrOSCon 14 | 10.08.2019 7/33 FrOSCon 14 | 10.08.2019 8/33 Was ist ‚das‘? ● Ceph & ZFS - software-basierte Storagelösungen ● ZFS lokal / Ceph verteilt im Netzwerk ● Hervorragende Performance, Verfügbarkeit und https://ceph.io/ Skalierbarkeit ● Verwaltung und Überwachung mit Proxmox VE ● Technischer Support für Ceph & ZFS inkludiert in Proxmox Subskription http://open-zfs.org/ FrOSCon 14 | 10.08.2019 9/33 FrOSCon 14 | 10.08.2019 10/33 FrOSCon 14 | 10.08.2019 11/33 ZFS Architektur FrOSCon 14 | 10.08.2019 12/33 ZFS ARC, L2ARC and ZIL With ZIL Without ZIL ARC RAM ARC RAM ZIL ZIL Application HDD Application SSD HDD FrOSCon 14 | 10.08.2019 13/33 FrOSCon 14 | 10.08.2019 14/33 FrOSCon 14 | 10.08.2019 15/33 Ceph Network Ceph Docs: https://docs.ceph.com/docs/master/ FrOSCon 14 | 10.08.2019 16/33 FrOSCon
    [Show full text]
  • Red Hat Openstack* Platform with Red Hat Ceph* Storage
    Intel® Data Center Blocks for Cloud – Red Hat* OpenStack* Platform with Red Hat Ceph* Storage Reference Architecture Guide for deploying a private cloud based on Red Hat OpenStack* Platform with Red Hat Ceph Storage using Intel® Server Products. Rev 1.0 January 2017 Intel® Server Products and Solutions <Blank page> Intel® Data Center Blocks for Cloud – Red Hat* OpenStack* Platform with Red Hat Ceph* Storage Document Revision History Date Revision Changes January 2017 1.0 Initial release. 3 Intel® Data Center Blocks for Cloud – Red Hat® OpenStack® Platform with Red Hat Ceph Storage Disclaimers Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software, or service activation. Learn more at Intel.com, or from the OEM or retailer. You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. You agree to grant Intel a non-exclusive, royalty-free license to any patent claim thereafter drafted which includes subject matter disclosed herein. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade. Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725 or by visiting www.intel.com/design/literature.htm.
    [Show full text]
  • Red Hat Data Analytics Infrastructure Solution
    TECHNOLOGY DETAIL RED HAT DATA ANALYTICS INFRASTRUCTURE SOLUTION TABLE OF CONTENTS 1 INTRODUCTION ................................................................................................................ 2 2 RED HAT DATA ANALYTICS INFRASTRUCTURE SOLUTION ..................................... 2 2.1 The evolution of analytics infrastructure ....................................................................................... 3 Give data scientists and data 2.2 Benefits of a shared data repository on Red Hat Ceph Storage .............................................. 3 analytics teams access to their own clusters without the unnec- 2.3 Solution components ...........................................................................................................................4 essary cost and complexity of 3 TESTING ENVIRONMENT OVERVIEW ............................................................................ 4 duplicating Hadoop Distributed File System (HDFS) datasets. 4 RELATIVE COST AND PERFORMANCE COMPARISON ................................................ 6 4.1 Findings summary ................................................................................................................................. 6 Rapidly deploy and decom- 4.2 Workload details .................................................................................................................................... 7 mission analytics clusters on 4.3 24-hour ingest ........................................................................................................................................8
    [Show full text]
  • Unlock Bigdata Analytic Efficiency with Ceph Data Lake
    Unlock Bigdata Analytic Efficiency With Ceph Data Lake Jian Zhang, Yong Fu, March, 2018 Agenda . Background & Motivations . The Workloads, Reference Architecture Evolution and Performance Optimization . Performance Comparison with Remote HDFS . Summary & Next Step 3 Challenges of scaling Hadoop* Storage BOUNDED Storage and Compute resources on Hadoop Nodes brings challenges Data Capacity Silos Costs Performance & efficiency Typical Challenges Data/Capacity Multiple Storage Silos Space, Spent, Power, Utilization Upgrade Cost Inadequate Performance Provisioning And Configuration Source: 451 Research, Voice of the Enterprise: Storage Q4 2015 *Other names and brands may be claimed as the property of others. 4 Options To Address The Challenges Compute and Large Cluster More Clusters Storage Disaggregation • Lacks isolation - • Cost of • Isolation of high- noisy neighbors duplicating priority workloads hinder SLAs datasets across • Shared big • Lacks elasticity - clusters datasets rigid cluster size • Lacks on-demand • On-demand • Can’t scale provisioning provisioning compute/storage • Can’t scale • compute/storage costs separately compute/storage costs scale costs separately separately Compute and Storage disaggregation provides Simplicity, Elasticity, Isolation 5 Unified Hadoop* File System and API for cloud storage Hadoop Compatible File System abstraction layer: Unified storage API interface Hadoop fs –ls s3a://job/ adl:// oss:// s3n:// gs:// s3:// s3a:// wasb:// 2006 2008 2014 2015 2016 6 Proposal: Apache Hadoop* with disagreed Object Storage SQL …… Hadoop Services • Virtual Machine • Container • Bare Metal HCFS Compute 1 Compute 2 Compute 3 … Compute N Object Storage Services Object Object Object Object • Co-located with gateway Storage 1 Storage 2 Storage 3 … Storage N • Dynamic DNS or load balancer • Data protection via storage replication or erasure code Disaggregated Object Storage Cluster • Storage tiering *Other names and brands may be claimed as the property of others.
    [Show full text]
  • Evaluation of Active Storage Strategies for the Lustre Parallel File System
    Evaluation of Active Storage Strategies for the Lustre Parallel File System Juan Piernas Jarek Nieplocha Evan J. Felix Pacific Northwest National Pacific Northwest National Pacific Northwest National Laboratory Laboratory Laboratory P.O. Box 999 P.O. Box 999 P.O. Box 999 Richland, WA 99352 Richland, WA 99352 Richland, WA 99352 [email protected] [email protected] [email protected] ABSTRACT umes of data remains a challenging problem. Despite the Active Storage provides an opportunity for reducing the improvements of storage capacities, the cost of bandwidth amount of data movement between storage and compute for moving data between the processing nodes and the stor- nodes of a parallel filesystem such as Lustre, and PVFS. age devices has not improved at the same rate as the disk ca- It allows certain types of data processing operations to be pacity. One approach to reduce the bandwidth requirements performed directly on the storage nodes of modern paral- between storage and compute devices is, when possible, to lel filesystems, near the data they manage. This is possible move computation closer to the storage devices. Similarly by exploiting the underutilized processor and memory re- to the processing-in-memory (PIM) approach for random ac- sources of storage nodes that are implemented using general cess memory [16], the active disk concept was proposed for purpose servers and operating systems. In this paper, we hard disk storage systems [1, 15, 24]. The active disk idea present a novel user-space implementation of Active Storage exploits the processing power of the embedded hard drive for Lustre, and compare it to the traditional kernel-based controller to process the data on the disk without the need implementation.
    [Show full text]
  • Design and Implementation of Archival Storage Component of OAIS Reference Model
    Masaryk University Faculty of Informatics Design and implementation of Archival Storage component of OAIS Reference Model Master’s Thesis Jan Tomášek Brno, Spring 2018 Masaryk University Faculty of Informatics Design and implementation of Archival Storage component of OAIS Reference Model Master’s Thesis Jan Tomášek Brno, Spring 2018 Declaration Hereby I declare that this paper is my original authorial work, which I have worked out on my own. All sources, references, and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source. Jan Tomášek Advisor: doc. RNDr. Tomáš Pitner Ph.D. i Acknowledgement I would like to express my gratitude to my supervisor, doc. RNDr. Tomáš Pitner, Ph.D. for valuable advice concerning formal aspects of my thesis, readiness to help and unprecedented responsiveness. Next, I would like to thank to RNDr. Miroslav Bartošek, CSc. for providing me with long-term preservation literature, to RNDr. Michal Růžička, Ph.D. for managing the very helpful internal documents of the ARCLib project, and to the whole ARCLib team for the great col- laboration and willingness to meet up and discuss questions emerged during the implementation. I would also like to thank to my colleagues from the inQool com- pany for providing me with this opportunity, help in solving of practi- cal issues, and for the time flexibility needed to finish this work. Last but not least, I would like to thank to my closest family and friends for all the moral support. iii Abstract This thesis deals with the development of the Archival Storage module of the Reference Model for an Open Archival Information System.
    [Show full text]
  • Ceph Done Right for Openstack
    Solution Brief: + Ceph done right for OpenStack The rise of OpenStack For years, the desire to standardize on an OpenStack deployments vary drastically open platform and adopt uniform APIs was as business and application needs vary the primary driver behind OpenStack’s rise By its nature and by design from the outset, OpenStack introduces across public and private clouds. choice for every aspect and component that comprises the cloud system. Cloud architects who deploy OpenStack enjoy Deployments continue to grow rapidly; with more and larger a cloud platform that is flexible, open and customizable. clouds, deep adoption throughout users’ cloud infrastructure and maturing technology as clouds move into production. Compute, network and storage resources are selected by business requirement and application needs. For storage needs, OpenStack OpenStack’s top attributes are, not surprisingly, shared by the most cloud architects typically choose Ceph as the storage system for its popular storage software for deployments: Ceph. With Ceph, users scalability, versatility and rich OpenStack support. get all the benefits of open source software, along with interfaces for object, block and file-level storage that give OpenStack what While OpenStack cloud architects find Ceph integrates nicely it needs to run at its best. Plus, the combination of OpenStack without too much effort, actually configuring Ceph to perform and Ceph enables clouds to get faster and more reliable requires experienced expertise in Ceph Software Defined Storage the larger they get. (SDS) itself, hardware architecture, storage device selection and networking technologies. Once applications are deployed It sounds like a match made in heaven - and it is - but it can also be within an OpenStack cloud configured for Ceph SDS, running a challenge if you have your heart set on a DIY approach.
    [Show full text]
  • Shared File Systems: Determining the Best Choice for Your Distributed SAS® Foundation Applications Margaret Crevar, SAS Institute Inc., Cary, NC
    Paper SAS569-2017 Shared File Systems: Determining the Best Choice for your Distributed SAS® Foundation Applications Margaret Crevar, SAS Institute Inc., Cary, NC ABSTRACT If you are planning on deploying SAS® Grid Manager and SAS® Enterprise BI (or other distributed SAS® Foundation applications) with load balanced servers on multiple operating systems instances, , a shared file system is required. In order to determine the best shared file system choice for a given deployment, it is important to understand how the file system is used, the SAS® I/O workload characteristics performed on it, and the stressors that SAS Foundation applications produce on the file system. For the purposes of this paper, we use the term "shared file system" to mean both a clustered file system and shared file system, even though" shared" can denote a network file system and a distributed file system – not clustered. INTRODUCTION This paper examines the shared file systems that are most commonly used with SAS and reviews their strengths and weaknesses. SAS GRID COMPUTING REQUIREMENTS FOR SHARED FILE SYSTEMS Before we get into the reasons why a shared file system is needed for SAS® Grid Computing, let’s briefly discuss the SAS I/O characteristics. GENERAL SAS I/O CHARACTERISTICS SAS Foundation creates a high volume of predominately large-block, sequential access I/O, generally at block sizes of 64K, 128K, or 256K, and the interactions with data storage are significantly different from typical interactive applications and RDBMSs. Here are some major points to understand (more details about the bullets below can be found in this paper): SAS tends to perform large sequential Reads and Writes.
    [Show full text]
  • Cs 5412/Lecture 24. Ceph: a Scalable High-Performance Distributed File System
    CS 5412/LECTURE 24. CEPH: A Ken Birman SCALABLE HIGH-PERFORMANCE Spring, 2019 DISTRIBUTED FILE SYSTEM HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2019SP 1 HDFS LIMITATIONS Although many applications are designed to use the normal “POSIX” file system API (operations like file create/open, read/write, close, rename/replace, delete, and snapshot), some modern applications find POSIX inefficient. Some main issues: HDFS can handle big files, but treats them as sequences of fixed-size blocks. Many application are object-oriented HDFS lacks some of the “file system management” tools big-data needs HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2019SP 2 CEPH PROJECT Created by Sage Weihl, a PhD student at U.C. Santa Cruz Later became a company and then was acquired into Red Hat Linux Now the “InkStack” portion of Linux offers Ceph plus various tools to leverage it, and Ceph is starting to replace HDFS worldwide. Ceph is similar in some ways to HDFS but unrelated to it. Many big data systems are migrating to the system. HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2019SP 3 CEPH HAS THREE “APIS” First is the standard POSIX file system API. You can use Ceph in any situation where you might use GFS, HDFS, NFS, etc. Second, there are extensions to POSIX that allow Ceph to offer better performance in supercomputing systems, like at CERN. Finally, Ceph has a lowest layer called RADOS that can be used directly as a key-value object store. HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2019SP 4 WHY TALK DIRECTLY TO RADOS? SERIALIZATION/DESERIALIZATION! When an object is in memory, the data associated with it is managed by the class (or type) definition, and can include pointers, fields with gaps or other “subtle” properties, etc.
    [Show full text]
  • Openzfs on Linux Hepix 2014 October 16, 2014 Brian Behlendorf [email protected]
    OpenZFS on Linux HEPiX 2014 October 16, 2014 Brian Behlendorf [email protected] LLNL-PRES-XXXXXX This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC 1 Lawrence Livermore National Laboratory LLNL-PRES-xxxxxx High Performance Computing Advanced Simulation Massive Scale Data Intensive Top 500 (June 2014) #3 Sequoia 20.1 Peak PFLOP/s 1,572,864 cores 55 PB of storage at 850 GB/s #9 Vulcan 5.0 Peak PFLOPS/s 393,216 cores 6.7 PB of storage at 106 GB/s World class computing resources 2 Lawrence Livermore National Laboratory LLNL-PRES-xxxxxx Linux Clusters 2001 – First Linux cluster Today – ~10 large Linux clusters (100– 3,000 nodes) and 10-20 smaller ones Near-commodity hardware Open source software stack (TOSS) Tri-Labratory Operating System Stack Modified RHEL targeted for HPC RHEL kernel optimized for clusters Moab and SLURM for scheduling Lustre parallel filesystem Additional packages for monitoring, power/console, compilers, etc LLNL Loves Linux 3 Lawrence Livermore National Laboratory LLNL-PRES-xxxxxx Lustre Lustre is our parallel filesystem of choice Livermore Computing Filesystems: Open – 20PB in 5 filesystems Secure – 22PB in 3 filesystems Plus the 55PB Sequoia filesystem 1000+ storage servers Billions of files Users want: Access to their data from every cluster Good performance High availability How do we design a system to handle this? Lustre is used extensively
    [Show full text]
  • Comparative Analysis of Distributed and Parallel File Systems' Internal Techniques
    Comparative Analysis of Distributed and Parallel File Systems’ Internal Techniques Viacheslav Dubeyko Content 1 TERMINOLOGY AND ABBREVIATIONS ................................................................................ 4 2 INTRODUCTION......................................................................................................................... 5 3 COMPARATIVE ANALYSIS METHODOLOGY ....................................................................... 5 4 FILE SYSTEM FEATURES CLASSIFICATION ........................................................................ 5 4.1 Distributed File Systems ............................................................................................................................ 6 4.1.1 HDFS ..................................................................................................................................................... 6 4.1.2 GFS (Google File System) ....................................................................................................................... 7 4.1.3 InterMezzo ............................................................................................................................................ 9 4.1.4 CodA .................................................................................................................................................... 10 4.1.5 Ceph.................................................................................................................................................... 12 4.1.6 DDFS ..................................................................................................................................................
    [Show full text]