Failure-Atomic Updates of Application Data in a Linux File System

Total Page:16

File Type:pdf, Size:1020Kb

Failure-Atomic Updates of Application Data in a Linux File System Failure-Atomic Updates of Application Data in a Linux File System Rajat Verma and Anton Ajay Mendez, Hewlett-Packard; Stan Park, Hewlett-Packard Labs; Sandya Mannarswamy, Hewlett-Packard; Terence Kelly and Charles B. Morrey III, Hewlett-Packard Labs https://www.usenix.org/conference/fast15/technical-sessions/presentation/verma This paper is included in the Proceedings of the 13th USENIX Conference on File and Storage Technologies (FAST ’15). February 16–19, 2015 • Santa Clara, CA, USA ISBN 978-1-931971-201 Open access to the Proceedings of the 13th USENIX Conference on File and Storage Technologies is sponsored by USENIX Failure-Atomic Updates of Application Data in a Linux File System Rajat Verma1 Anton Ajay Mendez1 Stan Park2 Sandya Mannarswamy1 Terence Kelly2 Charles B. Morrey III2 1Hewlett-Packard Storage Division 2Hewlett-Packard Laboratories Abstract erroneous behavior under power failures; the proprietary commercial databases tested lose data [36]. We present the design, implementation, and evaluation File systems strive to protect their internal metadata of a file system mechanism that protects the integrity of from corruption, but most offer no corresponding protec- application data from failures such as process crashes, tion for application data, providing neither transactions kernel panics, and power outages. A simple interface of- on application data nor any other unified solution to the fers applications a guarantee that the application data in a CMADD problem. Instead, file systems offer primitives file always reflects the most recent successful fsync or for controlling the order in which application data attains msync operation on the file. Our file system furthermore durability; applications shoulder the burden of restoring offers a new syncv mechanism that failure-atomically consistency to their data following failures. Added to commits changes to multiple files. Failure-injection tests the inconvenience and expense of implementing correct verify that our file system protects the integrity of ap- recovery is the inefficiency of the sequences of primi- plication data from crashes and performance measure- tive operations required for complex updates: Consider, ments confirm that our implementation is efficient. Our for example, the chore of failure-atomically updating a file system runs on conventional hardware and unmod- set of files scattered throughout a POSIX-like file sys- ified Linux kernels and will be released commercially. tem. Remarkably, the vast majority of file systems do We believe that our mechanism is implementable in any not provide the straightforward operation that CMADD file system that supports per-file writable snapshots. demands: the ability to modify application data in (sets of) files failure-atomically and efficiently. 1 Introduction We present the design, implementation, and evalua- tion of failure-atomic application data updates in HP’s Many applications modify data on durable media, and Advanced File System (AdvFS), a modern industrial- failures during updates—application process crashes, OS strength Linux file system derived from DEC’s Tru64 kernel panics, and power outages—jeopardize the in- file system [1]. AdvFS provides a simple interface tegrity of the application data. We therefore require so- that generalizes failure-atomic variants of writev [8] lutions to the fundamental problem of consistent modifi- and msync [20]: If a file is opened with a new cation of application durable data (CMADD), i.e., the O_ATOMIC flag, the state of its application data will al- problem of evolving durable application data without ways reflect the most recent successful msync, fsync, fear that failure will preclude recovery to a consistent or fdatasync. AdvFS furthermore includes a new state. syncv operation that combines updates to multiple files Existing mechanisms provide imperfect support for into a failure-atomic bundle, comparable to the multi- solving the CMADD problem. Relational databases of- file transaction support in Windows Vista TxF [17] and fer ACID transactions; similarly, many key-value stores TxOS [22] but much simpler than the former and more allow failure-atomic bundling of updates [2,13,14]. De- capable than the latter. The size of transactional updates spite the obvious attractions of transactions, both kinds in AdvFS is limited only by the free space in the file sys- of databases can lead to two difficulties: First, in- tem. AdvFS requires no special hardware and runs on memory data structures do not always translate conve- unmodified Linux kernels. niently or efficiently to and from database formats; re- The remainder of this paper is organized as follows: peated attempts to smooth over the “impedance mis- Section 2 situates our contributions in the context of prior match” between data formats have met with limited suc- work. Section 3 describes AdvFS and the features that cess [19]. Second, the complexity of modern databases made it possible to implement failure-atomic updates of offers fertile ground for implementation bugs that negate application data. Section 4 presents experimental evalua- the promise of ACID: A recent study has shown that tions of both the correctness and performance of AdvFS, widely used key-value and relational databases exhibit and Section 5 concludes with a discussion. USENIX Association 13th USENIX Conference on File and Storage Technologies (FAST ’15) 203 2 Related Work support inter-process isolation even in the presence of non-transactional accesses by legacy applications [25]. Most widely deployed mainstream file systems offer only The price that transaction-aware applications pay for this limited and indirect support for consistent modification sophisticated support includes a substantial burden of of application durable data (CMADD).1 Semantically logging: Applications must perform a Log Append weak OS interfaces are partly to blame. For example, syscall prior to modifying a page of a file within a trans- POSIX permits write to succeed partially, making it action, which is awkward at best for the important case difficult to define atomic semantics for this call [30]. of random STOREs to a memory-mapped file. Synchronization calls such as fsync and msync con- An attractive approach to the CMADD problem on strain the order in which application data reaches durable emerging durable media is a persistent heap support- media, and recent research has proposed decoupling or- ing atomic updates via a transactional memory (TM) dering from durability [5]. However applications re- interface. Mnemosyne [31] and Hathi [24] imple- main responsible for building CMADD solutions (e.g., ment such mechanisms for byte-addressable non-volatile atomicity mechanisms) atop ordering primitives and for memory (NVM) and flash storage, respectively. Per- reconstructing a consistent state of application data fol- sistent heaps obviate the need for separate in-memory lowing a crash. Experience has shown that custom recov- and durable data formats: Applications simply manipu- ery code is difficult to write and prone to bugs. Some- late in-memory data structures using LOAD and STORE times applications circumvent the need for recovery by instructions, which seems especially natural for byte- using the one failure-atomicity mechanism provided in addressable NVM. One limitation of these systems is that conventional file systems: file rename [11]. For ex- they do not support conventional file operations; another ample, desktop applications can open a temporary file, is that they are tailored to specific durable media. Fi- write the entire modified contents of a file to it, then use nally, they employ software TM, which carries substan- rename (or a specialized equivalent [7, 23]) to imple- tial overheads. ment an atomic file update—a reasonable expedient for Persistent heaps can be implemented for conventional small files but untenable for large ones. block storage and need not employ TM. Recent exam- FusionIO provides an elegant and efficient mechanism ples include Software Persistent Memory (SoftPM) [9] for solving the CMADD problem for data on their flash- and Ken [34], whose persistent heaps expose malloc- based storage devices: a failure-atomic writev sup- like interfaces and support atomic checkpointing. Such ported by the flash translation layer [8]. The MySQL approaches provide ergonomic benefits and are com- database exploits this new mechanism to eliminate patible with conventional hardware, but their atomic- application-level double writes and thereby improve per- update mechanisms entail substantial complexity and formance substantially [29]. Still more impressive gains overheads. For example, SoftPM automatically copies are available to applications architected from scratch volatile data into persistent containers as necessary around the new mechanism. For example, a key-value through a novel hybrid of static and dynamic pointer store designed to exploit the new feature achieves both analysis, making development easier and less error- performance and flash endurance benefits [15]. The lim- prone. However SoftPM tracks data modification in itations of failure-atomic writev are that it requires coarse-grained chunks of 512 KB or larger, which can special hardware, applies only to single-file updates, and lead to write amplification at the storage layer. Ken’s does not address modifications to memory-mapped files. user-space persistent heap tracks modifications at 4 KB Fully general support for failure-atomic bundles of file memory-page granularity, which may reduce write am- modifications is surprisingly rare. Windows
Recommended publications
  • ZFS 2018 and Onward
    SEE TEXT ONLY 2018 and Onward BY ALLAN JUDE 2018 is going to be a big year for ZFS RAID-Z Expansion not only FreeBSD, but for OpenZFS • 2018Q4 as well. OpenZFS is the collaboration of developers from IllumOS, FreeBSD, The ability to add an additional disk to an existing RAID-Z VDEV to grow its size with- Linux, Mac OS X, many industry ven- out changing its redundancy level. For dors, and a number of other projects example, a RAID-Z2 consisting of 6 disks to maintain and advance the open- could be expanded to contain 7 disks, source version of Sun’s ZFS filesystem. increasing the available space. This is accomplished by reflowing the data across Improved Pool Import the disks, so as to not change an individual • 2018Q1 block’s offset from the start of the VDEV. The new disk will appear as a new column on the A new patch set changes the way pools are right, and the data will be relocated to main- imported, such that they depend less on con- tain the offset within the VDEV. Similar to figuration data from the system. Instead, the reflowing a paragraph by making it wider, possibly outdated configuration from the sys- without changing the order of the words. In tem is used to find the pool, and partially open the end this means all of the new space shows it read-only. Then the correct on-disk configura- up at the end of the disk, without any frag- tion is loaded. This reduces the number of mentation.
    [Show full text]
  • Advfs Command Line and Application Programming Interface
    AdvFS Command Line and Application Programming Interface External Reference Specification Version 1.13 JA CASL Not Inspected/Date Inspected Building ZK3 110 Spit Brook Road Nashua, NH 03062 Copyright (C) 2008 Hewlett-Packard Development Company, L.P. Page 1 of 65 Preface Version 1.4 of the AdvFS Command Line and API External Reference Specification is being made available to all partners in order to allow them to design and implement code meeting the specifications contained herein. 1 INTRODUCTION ....................................................................................................................................................................4 1.1 ABSTRACT .........................................................................................................................................................................4 1.2 AUDIENCE ..........................................................................................................................................................................4 1.3 TERMS AND DEFINITIONS ...................................................................................................................................................4 1.4 RELATED DOCUMENTS ......................................................................................................................................................4 1.5 ACKNOWLEDGEMENTS ......................................................................................................................................................4
    [Show full text]
  • Best Practices for Openzfs L2ARC in the Era of Nvme
    Best Practices for OpenZFS L2ARC in the Era of NVMe Ryan McKenzie iXsystems 2019 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved. 1 Agenda ❏ Brief Overview of OpenZFS ARC / L2ARC ❏ Key Performance Factors ❏ Existing “Best Practices” for L2ARC ❏ Rules of Thumb, Tribal Knowledge, etc. ❏ NVMe as L2ARC ❏ Testing and Results ❏ Revised “Best Practices” 2019 Storage Developer Conference. © iXsystems. All Rights Reserved. 2 ARC Overview ● Adaptive Replacement Cache ● Resides in system memory NIC/HBA ● Shared by all pools ● Used to store/cache: ARC OpenZFS ○ All incoming data ○ “Hottest” data and Metadata (a tunable ratio) SLOG vdevs ● Balances between ○ Most Frequently Used (MFU) Data vdevs L2ARC vdevs ○ Most Recently zpool Used (MRU) FreeBSD + OpenZFS Server 2019 Storage Developer Conference. © iXsystems. All Rights Reserved. 3 L2ARC Overview ● Level 2 Adaptive Replacement Cache ● Resides on one or more storage devices NIC/HBA ○ Usually Flash ● Device(s) added to pool ARC OpenZFS ○ Only services data held by that pool ● Used to store/cache: ○ “Warm” data and SLOG vdevs metadata ( about to be evicted from ARC) Data vdevs L2ARC vdevs ○ Indexes to L2ARC zpool blocks stored in ARC headers FreeBSD + OpenZFS Server 2019 Storage Developer Conference. © iXsystems. All Rights Reserved. 4 ZFS Writes ● All writes go through ARC, written blocks are “dirty” until on stable storage ACK NIC/HBA ○ async write ACKs immediately ARC OpenZFS ○ sync write copied to ZIL/SLOG then ACKs ○ copied to data SLOG vdevs vdev in TXG ● When no longer dirty, written blocks stay in Data vdevs L2ARC vdevs ARC and move through zpool MRU/MFU lists normally FreeBSD + OpenZFS Server 2019 Storage Developer Conference.
    [Show full text]
  • Petascale Data Management: Guided by Measurement
    Petascale Data Management: Guided by Measurement petascale data storage institute www.pdsi-scidac.org/ MPP2 www.pdsi-scidac.org MEMBER ORGANIZATIONS & Lustre • Los Alamos National Laboratory – institute.lanl.gov/pdsi/ • Parallel Data Lab, Carnegie Mellon University – www.pdl.cmu.edu/ • Oak Ridge National Laboratory – www.csm.ornl.gov/ • Sandia National Laboratories – www.sandia.gov/ • National Energy Research Scientific Computing Center • Center for Information Technology Integration, U. of Michigan pdsi.nersc.gov/ www.citi.umich.edu/projects/pdsi/ • Pacific Northwest National Laboratory – www.pnl.gov/ • University of California at Santa Cruz – www.pdsi.ucsc.edu/ The Computer Failure Data Repository Filesystems Statistics Survey • Goal: to collect and make available failure data from a large variety of sites GOALS • Better understanding of the characteristics of failures in the real world • Gather & build large DB of static filetree summary • Now maintained by USENIX at cfdr.usenix.org/ • Build small, non-invasive, anonymizing stats gather tool • Distribute fsstats tool via easily used web site Red Storm NAME SYSTEM TYPE SYSTEM SIZE TIME PERIOD TYPE OF DATA • Encourage contributions (output of tool) from many FSs Any node • Offer uploaded statistics & summaries to public & Lustre 22 HPC clusters 5000 nodes 9 years outage . Label Date Type File Total Size Total Space # files # dirs max size max space max dir max name avg file avg dir . 765 nodes (2008) System TB TB M K GB GB ents bytes MB ents . 1 HPC cluster 5 years PITTSBURGH 3,400 disks
    [Show full text]
  • Oracle Database Administrator's Reference for UNIX-Based Operating Systems
    Oracle® Database Administrator’s Reference 10g Release 2 (10.2) for UNIX-Based Operating Systems B15658-06 March 2009 Oracle Database Administrator's Reference, 10g Release 2 (10.2) for UNIX-Based Operating Systems B15658-06 Copyright © 2006, 2009, Oracle and/or its affiliates. All rights reserved. Primary Author: Brintha Bennet Contributing Authors: Kevin Flood, Pat Huey, Clara Jaeckel, Emily Murphy, Terri Winters, Ashmita Bose Contributors: David Austin, Subhranshu Banerjee, Mark Bauer, Robert Chang, Jonathan Creighton, Sudip Datta, Padmanabhan Ganapathy, Thirumaleshwara Hasandka, Joel Kallman, George Kotsovolos, Richard Long, Rolly Lv, Padmanabhan Manavazhi, Matthew Mckerley, Sreejith Minnanghat, Krishna Mohan, Rajendra Pingte, Hanlin Qian, Janelle Simmons, Roy Swonger, Lyju Vadassery, Douglas Williams This software and related documentation are provided under a license agreement containing restrictions on use and disclosure and are protected by intellectual property laws. Except as expressly permitted in your license agreement or allowed by law, you may not use, copy, reproduce, translate, broadcast, modify, license, transmit, distribute, exhibit, perform, publish, or display any part, in any form, or by any means. Reverse engineering, disassembly, or decompilation of this software, unless required by law for interoperability, is prohibited. The information contained herein is subject to change without notice and is not warranted to be error-free. If you find any errors, please report them to us in writing. If this software or related documentation is delivered to the U.S. Government or anyone licensing it on behalf of the U.S. Government, the following notice is applicable: U.S. GOVERNMENT RIGHTS Programs, software, databases, and related documentation and technical data delivered to U.S.
    [Show full text]
  • Referência Debian I
    Referência Debian i Referência Debian Osamu Aoki Referência Debian ii Copyright © 2013-2021 Osamu Aoki Esta Referência Debian (versão 2.85) (2021-09-17 09:11:56 UTC) pretende fornecer uma visão geral do sistema Debian como um guia do utilizador pós-instalação. Cobre muitos aspetos da administração do sistema através de exemplos shell-command para não programadores. Referência Debian iii COLLABORATORS TITLE : Referência Debian ACTION NAME DATE SIGNATURE WRITTEN BY Osamu Aoki 17 de setembro de 2021 REVISION HISTORY NUMBER DATE DESCRIPTION NAME Referência Debian iv Conteúdo 1 Manuais de GNU/Linux 1 1.1 Básico da consola ................................................... 1 1.1.1 A linha de comandos da shell ........................................ 1 1.1.2 The shell prompt under GUI ......................................... 2 1.1.3 A conta root .................................................. 2 1.1.4 A linha de comandos shell do root ...................................... 3 1.1.5 GUI de ferramentas de administração do sistema .............................. 3 1.1.6 Consolas virtuais ............................................... 3 1.1.7 Como abandonar a linha de comandos .................................... 3 1.1.8 Como desligar o sistema ........................................... 4 1.1.9 Recuperar uma consola sã .......................................... 4 1.1.10 Sugestões de pacotes adicionais para o novato ................................ 4 1.1.11 Uma conta de utilizador extra ........................................ 5 1.1.12 Configuração
    [Show full text]
  • Openzfs on Linux Hepix 2014 October 16, 2014 Brian Behlendorf [email protected]
    OpenZFS on Linux HEPiX 2014 October 16, 2014 Brian Behlendorf [email protected] LLNL-PRES-XXXXXX This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC 1 Lawrence Livermore National Laboratory LLNL-PRES-xxxxxx High Performance Computing Advanced Simulation Massive Scale Data Intensive Top 500 (June 2014) #3 Sequoia 20.1 Peak PFLOP/s 1,572,864 cores 55 PB of storage at 850 GB/s #9 Vulcan 5.0 Peak PFLOPS/s 393,216 cores 6.7 PB of storage at 106 GB/s World class computing resources 2 Lawrence Livermore National Laboratory LLNL-PRES-xxxxxx Linux Clusters 2001 – First Linux cluster Today – ~10 large Linux clusters (100– 3,000 nodes) and 10-20 smaller ones Near-commodity hardware Open source software stack (TOSS) Tri-Labratory Operating System Stack Modified RHEL targeted for HPC RHEL kernel optimized for clusters Moab and SLURM for scheduling Lustre parallel filesystem Additional packages for monitoring, power/console, compilers, etc LLNL Loves Linux 3 Lawrence Livermore National Laboratory LLNL-PRES-xxxxxx Lustre Lustre is our parallel filesystem of choice Livermore Computing Filesystems: Open – 20PB in 5 filesystems Secure – 22PB in 3 filesystems Plus the 55PB Sequoia filesystem 1000+ storage servers Billions of files Users want: Access to their data from every cluster Good performance High availability How do we design a system to handle this? Lustre is used extensively
    [Show full text]
  • Performance and Scalability of a Sensor Data Storage Framework
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Aaltodoc Publication Archive Aalto University School of Science Degree Programme in Computer Science and Engineering Paul Tötterman Performance and Scalability of a Sensor Data Storage Framework Master’s Thesis Helsinki, March 10, 2015 Supervisor: Associate Professor Keijo Heljanko Advisor: Lasse Rasinen M.Sc. (Tech.) Aalto University School of Science ABSTRACT OF Degree Programme in Computer Science and Engineering MASTER’S THESIS Author: Paul Tötterman Title: Performance and Scalability of a Sensor Data Storage Framework Date: March 10, 2015 Pages: 67 Major: Software Technology Code: T-110 Supervisor: Associate Professor Keijo Heljanko Advisor: Lasse Rasinen M.Sc. (Tech.) Modern artificial intelligence and machine learning applications build on analysis and training using large datasets. New research and development does not always start with existing big datasets, but accumulate data over time. The same storage solution does not necessarily cover the scale during the lifetime of the research, especially if scaling up from using common workgroup storage technologies. The storage infrastructure at ZenRobotics has grown using standard workgroup technologies. The current approach is starting to show its limits, while the storage growth is predicted to continue and accelerate. Successful capacity planning and expansion requires a better understanding of the patterns of the use of storage and its growth. We have examined the current storage architecture and stored data from different perspectives in order to gain a better understanding of the situation. By performing a number of experiments we determine key properties of the employed technologies. The combination of these factors allows us to make informed decisions about future storage solutions.
    [Show full text]
  • Guido Van Rossum on PYTHON 3 Get Your Sleep From
    JavaScript | Inform 6 & 7 | Falcon | Sleep | Enlightenment | PHP LINUX JOURNAL ™ THE STATE OF LINUX AUDIO SOFTWARE LANGUAGES Since 1994: The Original Magazine of the Linux Community OCTOBER 2008 | ISSUE 174 Inform 7 REVIEWED JavaScript | Inform 6 & 7 | Falcon | Sleep Enlightenment PHP Audio | Inform 6 & 7 Falcon JavaScript Don’t Get Eaten HP Media by a Grue! Vault 5150 Scalent’s Managing Virtual Operating PHP Code Environment Guido van Rossum on PYTHON 3 Get Your Sleep from OCTOBER Java www.linuxjournal.com 2008 $5.99US $5.99CAN 10 ISSUE Martin Messner Enlightenment E17 Insights from SUSE’s Lightweight Alternative 174 + Security Team Lead to KDE and GNOME 0 09281 03102 4 MULTIPLY ENERGY EFFICIENCY AND MAXIMIZE COOLING. THE WORLD’S FIRST QUAD-CORE PROCESSOR FOR MAINSTREAM SERVERS. THE NEW QUAD-CORE INTEL® XEON® PROCESSOR 5300 SERIES DELIVERS UP TO 50% 1 MORE PERFORMANCE*PERFORMANCE THAN PREVIOUS INTEL XEON PROCESSORS IN THE SAME POWERPOWER ENVELOPE.ENVELOPE. BASEDBASED ONON THETHE ULTRA-EFFICIENTULTRA-EFFICIENT INTEL®INTEL® CORE™CORE™ MICROMICROARCHITECTURE, ARCHITECTURE IT’S THE ULTIMATE SOLUTION FOR MANAGING RUNAWAY COOLING EXPENSES. LEARN WHYWHY GREAT GREAT BUSINESS BUSINESS COMPUTING COMPUTING STARTS STARTS WITH WITH INTEL INTEL INSIDE. INSIDE. VISIT VISIT INTEL.CO.UK/XEON INTEL.COM/XEON. RELION 2612 s 1UAD #ORE)NTEL®8EON® RELION 1670 s 1UAD #ORE)NTEL®8EON® PROCESSOR PROCESSOR s 5SERVERWITHUPTO4" s )NTEL@3EABURG CHIPSET s )DEALFORCOST EFFECTIVE&ILE$" WITH-(ZFRONTSIDEBUS APPLICATIONS s 5PTO'"2!-IN5CLASS s 2!32ELIABILITY !VAILABILITY LEADINGMEMORYCAPACITY 3ERVICEABILITY s -ANAGEMENTFEATURESTOSUPPORT LARGECLUSTERDEPLOYMENTS 34!24).'!4$2429.00 34!24).'!4$1969.00 Penguin Computing provides turnkey x86/Linux clusters for high performance technical computing applications.
    [Show full text]
  • Ted Ts'o on Linux File Systems
    Ted Ts’o on Linux File Systems An Interview RIK FARROW Rik Farrow is the Editor of ;login:. ran into Ted Ts’o during a tutorial luncheon at LISA ’12, and that later [email protected] sparked an email discussion. I started by asking Ted questions that had I puzzled me about the early history of ext2 having to do with the perfor- mance of ext2 compared to the BSD Fast File System (FFS). I had met Rob Kolstad, then president of BSDi, because of my interest in the AT&T lawsuit against the University of California and BSDi. BSDi was being sued for, among other things, Theodore Ts’o is the first having a phone number that could be spelled 800-ITS-UNIX. I thought that it was important North American Linux for the future of open source operating systems that AT&T lose that lawsuit. Kernel Developer, having That said, when I compared the performance of early versions of Linux to the current version started working with Linux of BSDi, I found that they were closely matched, with one glaring exception. Unpacking tar in September 1991. He also archives using Linux (likely .9) was blazingly fast compared to BSDi. I asked Rob, and he served as the tech lead for the MIT Kerberos explained that the issue had to do with synchronous writes, finally clearing up a mystery for me. V5 development team, and was the architect at IBM in charge of bringing real-time Linux Now I had a chance to ask Ted about the story from the Linux side, as well as other questions in support of real-time Java to the US Navy.
    [Show full text]
  • ZFS Basics Various ZFS RAID Lebels = & . = a Little About Zetabyte File System ( ZFS / Openzfs)
    = BDNOG11 | Cox's Bazar | Day # 3 | ZFS Basics & Various ZFS RAID Lebels. = = ZFS Basics & Various ZFS RAID Lebels . = A Little About Zetabyte File System ( ZFS / OpenZFS) =1= = BDNOG11 | Cox's Bazar | Day # 3 | ZFS Basics & Various ZFS RAID Lebels. = Disclaimer: The scope of this topic here is not to discuss in detail about the architecture of the ZFS (OpenZFS) rather Features, Use Cases and Operational Method. =2= = BDNOG11 | Cox's Bazar | Day # 3 | ZFS Basics & Various ZFS RAID Lebels. = What is ZFS? =0= ZFS is a combined file system and logical volume manager designed by Sun Microsystems and now owned by Oracle Corporation. It was designed and implemented by a team at Sun Microsystems led by Jeff Bonwick and Matthew Ahrens. Matt is the founding member of OpenZFS. =0= Its development started in 2001 and it was officially announced in 2004. In 2005 it was integrated into the main trunk of Solaris and released as part of OpenSolaris. =0= It was described by one analyst as "the only proven Open Source data-validating enterprise file system". This file =3= = BDNOG11 | Cox's Bazar | Day # 3 | ZFS Basics & Various ZFS RAID Lebels. = system also termed as the “world’s safest file system” by some analyst. =0= ZFS is scalable, and includes extensive protection against data corruption, support for high storage capacities, efficient data compression and deduplication. =0= ZFS is available for Solaris (and its variants), BSD (and its variants) & Linux (Canonical’s Ubuntu Integrated as native kernel module from version 16.04x) =0= OpenZFS was announced in September 2013 as the truly open source successor to the ZFS project.
    [Show full text]
  • Matt Ahrens [email protected] Brian Behlendorf Behlendorf1@Llnl
    LinuxCon 2013 September 17, 2013 Brian Behlendorf, Open ZFS on Linux LLNL-PRES-643675 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC High Performance Computing ● Advanced Simulation – Massive Scale – Data Intensive ● Top 500 – #3 Sequoia ● 20.1 Peak PFLOP/s ● 1,572,864 cores ● 55 PB of storage at 850 GB/s – #8 Vulcan ● 5.0 Peak PFLOPS/s ● 393,216 cores ● 6.7 PB of storage at 106 GB/s ● ● World class computing resources 2 Lawrence Livermore National Laboratory LLNL-PRES-xxxxxx Linux Clusters ● First Linux cluster deployed in 2001 ● Near-commodity hardware ● Open Source ● Clustered HiGh Availability OperatinG System (CHAOS) – Modified RHEL Kernel – New packaGes – monitorinG, power/console, compilers, etc – Lustre Parallel Filesystem – ZFS Filesystem LLNL Loves Linux 3 Lawrence Livermore National Laboratory LLNL-PRES-643675 Lustre Filesystem ● Massively parallel distributed filesystem ● Lustre servers use a modified ext4 – Stable and fast, but... – No sCalability – No data integrity – No online manageability ● Something better was needed – Use XFS, BTRFS, etC? – Write a filesystem from sCratCh? – Port ZFS to Linux? Existing Linux filesystems do not meet our requirements 4 Lawrence Livermore National Laboratory LLNL-PRES-643675 ZFS on Linux History ● 2008 – Prototype to determine viability ● 2009 – Initial ZVOL and Lustre support ● 2010 – Development moved to Github ● 2011 – POSIX layer
    [Show full text]