Lustre Today

Total Page:16

File Type:pdf, Size:1020Kb

Lustre Today Lustre – the future Peter Braam Lustre Today Scalable to Scalable architecture provides users more than enough 10,000’s of CPUs headroom for future growth - simple incremental HW addition Support for native high-performance network IO guarantees fast Supports Multiple client & server side I/O Network Protocols Routing enables multi-cluster, multi-network namespace HW agnosticism allows users flexibility of choice - eliminates SW-only Platform vendor lock-ins Experience in multiple industries; 100s of deployed systems Proven Capability demonstrates production-quality that can exceed 99% uptime Lustre’s failover protocols eliminate single-points-of-failure and Failover Capable increase system reliability POSIX Compliant Users benefit from strict semantics and data coherency With the support of many leading OEMs and a robust user base, Broad Industry Lustre is an emerging file system standard with a very Support promising future Lustre Architecture Today = failover Commodity Storage Servers Lustre MetaData OSS 1 Servers (MDS) MDS 1 MDS 2 (active) (standby) Direct Connected Quadrics Elan Direct Connected Storage Arrays Myrinet GM/MX Infiniband OSS 2 Lustre Clients Multiple networks are OSS 3 supported Supporting simultaneously 10,000s Enterprise -Class OSS 4 Storage Arrays & SAN Fabrics OSS 5 Gigabit Ethernet 10Gb Ethernet Lustre Object OSS 6 Storage Servers (OSS, 100s) OSS 7 Lustre Today: version 1.4.6 Clients: 10,000 / 130,000 CPUs Object Storage Servers: 256 OSSs Scalability MetaData Servers: 1 (+ failover) Files: 500M File System: Single Server: 2 GB/s + Single Client: 2 GB/s + Performance Single Namespace: 100 GB/s File Operations: 9000 ops/second Redundancy No Single Point of failure with shared storage Security POSIX ACLs Management Quota RHEL 3, RHEL 4, SLES 9, Patches required Kernel/OS Support Limited NFS Support; Full CIFS Support POSIX Compliant Yes User Profile: Primarily High-End HPC IntergalacticIntergalactic StrategyStrategy • 10 TB/s • HSM • 1 PFlop Systems Lustre v3.0 • 1 Trillion files ~2008/9 • 1M file creates / sec • 30 GB/s mixed files • 1 TB/s Lustre v2.0 Late 2007 • WB caches • 5-10X Metadata • Small files Improvement • Pools • Snapshots • Proxy Servers Lustre v1.6 • Backups • Disconnected Pools HPC Scalability mid-Late 2006 • Clustered MDS •Online Server Addition •Simple Configuration •Kerberos Lustre v1.4 •Lustre RAID •Patchless Client; NFS Export Enterprise Data Management Near & mid term 12 month strategy Beginning to see much higher quality Adding desirable features and improvements Some re -writes (recovery) Focus largely on doing existing installations better Much improved Manual Extended training More community activity InIn testingtesting nownow Improved NFS performance (1.4.7) Support for NFSv4 ( pNFS when available) Broader support for data center HW; Site -wide storage enablement Networking (1.4.7, 1.4.8) OpenFabrics (fmr . OpenIB Gen 2) Support, CISCO I/B, Myricom MX Much simpler configuration (1.6.0) Clients without patches (1.4.8) Ext3 large partition support (1.4.8) Linux software RAID5 improvements (1.6.0) More 2006 Metadata Improvements (several … from 1.4.7) Client -side Metadata & OSS read -ahead More parallel operations Recovery improvements Move from timeouts to health detection More recovery situations covered Kerberos5 Authentication, OSS capabilities Remote UID Handling WAN: Lustre Awarded SC|05 for long -distance efficiency; 5GB/s over 200 miles on HP system Late 2006 / early 2007 Manageability Backups – restore files with exact or similar striping Snapshots – use LVM with distributed control Storage Pools Trouble -shooting Better error messages Error analysis tools High Speed CIFS Services – pCIFS Windows client OS X Tiger client – no patches to OS X Lustre Network RAID 1 and RAID 5 across OSS Stripes Pools – naming a group of OSS’s Commodity SAN or disks Heterogeneous servers Old and New OSS ’s Homogeneous QOS OSS2 Pool A failover Stripe inside pools Pools – name an OST group Associate qualities with pool OSS3 Policies Subdirectory must place stripes in pool ( setstripe ) OSS 4 A Lustre network must use one Pool B pool ( lctl ) OSS 5 Destination pool depends on file extension Enterprise class Keep LAID stripes in pool OSS 6 Raid storage OSS 7 Lustre Network RAID Shared storage partitions for OSS targets (OST) OSS redundancy today 1 2 OSS1 OSS2 OSS1 – active for tgt1, standby for tgt2 OSS2 – active for tgt2, standby for tgt1 Lustre network RAID (LAID) Striping Redundancy Across multiple OSS nodes (late 2006) OSS1 OSS2 OSSX Filter driver for NT CIFS pCIFS deployment: client Windows Clients Traditional CIFS protocol Windows semantics obtain file layout pCIFS redirector Parallel I/O for clients using file layout samba samba samba Lustre Name Space …… Lustre Metadata Servers Lustre Object Storage Servers (OSS) (MDS) Longer term Future Enhancements Clustered metadata – towards 1M pure MD ops / sec FSCK free file systems – immense capacity Client memory MD WB cache – smoking performance Hot data migration – HSM & management Global namespace Proxy clusters – wide area gird applications Disconnected operation Clustered Metadata: v2.0 MDS Server OSS Server Pools Pools Clients Previous scalability + Enlarge MD Pool: enhance NetBench/SpecFS, client scalability Limits: 100’s of billions of files, M-ops / sec Load Balanced Clustered Metadata Modular Architecture Lustre Client File System Logical Object Volumes OSC OSC MDC Lock svc Network & Recovery Lustre 1.X Client Lustre Client File System Logical Object Volumes Logical MD Volume OSC OSC MDC MDC Lock svc Network & Recovery Lustre 2.X Client Preliminary Results - creations open(O_CREAT) - 0 objects 16000 14000 12000 10000 8000 6000 4000 2000 0 1M 2M 3M 4M 5M 6M 7M 8M 9M 10M Dirstripe 1 3700 3620 3350 3181 3156 2997 2970 2897 2824 2796 Dirstripe 2 7550 7356 6803 6478 6235 6070 5896 5171 5734 5618 Dirstripe 3 10802 10528 9660 9311 8953 8597 8284 8140 8094 7666 Dirstripe 4 14502 13966 12927 12450 11757 11459 11117 10714 10466 10267 # Files Preliminary Results – stats stat (random, 0 objects) 25000 20000 15000 10000 5000 0 1M 2M 3M 4M 5M 6M 7M 8M 9M 10M Dirstripe 1 4898 4879 4593 2502 1344 2564 1230 1229 1119 1064 Dirstripe 2 9456 9410 9411 9384 9365 4876 1731 1507 1449 1356 Dirstripe 3 14867 14817 14311 14828 14742 14773 14725 14736 4123 1747 Dirstripe 4 19886 19942 19914 19826 19843 19833 19868 19926 19803 19831 # Files Extremely large File Systems Remove limits (#files, sizes) & NO fsck ever Start moving towards “iron ” ext3, like ZFS Work from University of Wisconsin is a first step Checksum much of the data Replicate metadata Detect and repair corruption New component is to handle relational corruption Caused by accidental re -ordering of writes Unlike “port ZFS approach ” we expect A sequence of small fixes Each with their own benefits Lustre Memory Client Writeback Cache Quantum Leap in Performance and Capability AFS is the wrong approach Local server a.k.a. cache is slow Disk writes, context switches File I/O writeback cache exists and is step 1 Critical new element is asynchronous file creation File identifier difficult to create without network traffic. Everything gets flushed to the server asynchronously. Target: 1 client: 30GB/sec on mix of small / big files Lustre Writeback Cache: Nuts & Bolts Local FS does everything in memory No context switches Sub tree locks Client capability to allocate file identifiers Also required for proxy clusters 100% asynchronous server API Except cache miss Migration - preface to HSM… OBJECT MIGRATION RESTRIPING POOL MIGRATION • Coordinating client • Intercept cache miss • Control migration • Migrate objects • Control configuration change • Transparent to clients • Migrate groups of objects • Migrate groups of inodes virtual OST Virtual migrating MDS Pool Hot Object MDS Pool A MDS Pool B Migrator Virtual migrating OSS Pool OSS2 OSS2 OSS Pool A OSS Pool B Cache miss handling & migration - HSM SMFS: Storage Management File System Logs, Attributes, Cache miss interception Proxy Servers remote client remote client proxy client Lustre . client servers . master . Lustre . servers . remote client proxy remote client Lustre . servers . 1. Disconnected operation . 2. Asynchronous replay, capable of re-striping . 3. Conflict detection and handling 4. Mandatory or cooperative lock policy Local performance 5. Namespace sub-tree locks re synchronization avoids tree walk and server log with subtree versioning Wide Area Applications Real/Fast grid storage Worldwide R&D environments in single namespace National Security Internet Content Hosting / Searching Work at home, offline, employee network Much much more … Summary of Targets Total Transactions Client System Scale Throughput per second Throughput Today ~100 GB/s 9K ops/s 800 MB/s 300 Teraflops 250 OSSs POSIX 2008 1 TB/s 1M ops/s 30 GB/s 1 Petaflop 1000 OSSs Site wide, wb cache Secure 2010 10 TB/s 10 Petaflops 5000 OSSs Proxies, disconnected Cluster File Systems, Inc. History: Founded in 2001 Shipping production code since 2003 Financial Status: Privately held, self -funded, US -based corporation no external investment Profitable since the first day of business Company Size: 40+ full -time engineers Additional staff in management, finance, sales, and administrati on. Installed Base: ~200 commercially supported worldwide deployments Supporting the largest systems in 3 major continents [NA, Europe , Asia] 20% of the Top100 Supercomputers (according to the November ‘05 Top500 list) Growing list of private -sector users Thank You!.
Recommended publications
  • C12) United States Patent (10) Patent No.: US 7,418,439 B2 Wong (45) Date of Patent: Aug
    111111 1111111111111111111111111111111111111111111111111111111111111 US007418439B2 c12) United States Patent (10) Patent No.: US 7,418,439 B2 Wong (45) Date of Patent: Aug. 26, 2008 (54) MIRROR FILE SYSTEM 5,870,734 A * 2/1999 Kao .............................. 707/2 5,873,085 A * 2/1999 Enoki eta!. ................... 707/10 (75) Inventor: John P. Wong, Fremont, CA (US) 5,946,685 A * 8/1999 Cramer eta!. ................ 707/10 6,000,020 A * 12/1999 Chin eta!. .................. 7111162 (73) Assignee: Twin Peaks Software, Inc., Fremont, 6,115,741 A * 9/2000 Domenikos et al .......... 709/217 CA (US) 6,192,408 B1 * 212001 Vahalia eta!. .............. 709/229 6,216,140 B1 * 4/2001 Kramer ...................... 715/511 6,298,390 B1 * 10/2001 Matena eta!. .............. 719/315 ( *) Notice: Subject to any disclaimer, the term of this 6,366,988 B1 * 4/2002 Skiba eta!. ................. 7111165 patent is extended or adjusted under 35 6,466,978 B1 * 10/2002 Mukherjee et al ........... 709/225 U.S.C. 154(b) by 1179 days. 6,625,750 B1 * 9/2003 Duso eta!. .................... 714/11 6,697,846 B1 * 2/2004 Soltis ......................... 709/217 (21) Appl. No.: 09/810,246 OTHER PUBLICATIONS (22) Filed: Mar. 19, 2001 Veritas User Manual, File Replicator™ 3 .0.1 for Solaris®, Jan. 2001, (65) Prior Publication Data pp. 1-115. US 2001/0051955 Al Dec. 13, 2001 * cited by examiner Primary Examiner-Sana Al-Hashemi Related U.S. Application Data (74) Attorney, Agent, or Firm-Buchanan Ingersoll & (60) Provisional application No. 60/189,979, filed on Mar. Rooney PC 17,2000. (57) ABSTRACT (51) Int.
    [Show full text]
  • System Administration Guide Devices and File Systems
    System Administration Guide: Devices and File Systems Part No: 817–5093–24 January 2012 Copyright © 2004, 2012, Oracle and/or its affiliates. All rights reserved. This software and related documentation are provided under a license agreement containing restrictions on use and disclosure and are protected by intellectual property laws. Except as expressly permitted in your license agreement or allowed by law, you may not use, copy, reproduce, translate, broadcast, modify, license, transmit, distribute, exhibit, perform, publish or display any part, in any form, or by any means. Reverse engineering, disassembly, or decompilation of this software, unless required by law for interoperability, is prohibited. The information contained herein is subject to change without notice and is not warranted to be error-free. If you find any errors, please report them to us in writing. If this is software or related documentation that is delivered to the U.S. Government or anyone licensing it on behalf of the U.S. Government, the following notice is applicable: U.S. GOVERNMENT RIGHTS. Programs, software, databases, and related documentation and technical data delivered to U.S. Government customers are "commercial computer software" or "commercial technical data" pursuant to the applicable Federal Acquisition Regulation and agency-specific supplemental regulations. As such, the use, duplication, disclosure, modification, and adaptation shall be subject to the restrictions and license terms setforth in the applicable Government contract, and, to the extent applicable by the terms of the Government contract, the additional rights set forth in FAR 52.227-19, Commercial Computer Software License (December 2007). Oracle America, Inc., 500 Oracle Parkway, Redwood City, CA 94065.
    [Show full text]
  • New/Usr/Src/Common/Zfs/Zfs Prop.C 1
    new/usr/src/common/zfs/zfs_prop.c 1 new/usr/src/common/zfs/zfs_prop.c 2 ********************************************************** 61 #include <string.h> 13892 Wed Jan 31 21:21:15 2007 62 #include <ctype.h> new/usr/src/common/zfs/zfs_prop.c 63 #endif 1000002 gzip compression for ZFS ********************************************************** 65 typedef enum { 1 /* 66 prop_default, 2 * CDDL HEADER START 67 prop_readonly, 3 * 68 prop_inherit 4 * The contents of this file are subject to the terms of the 69 } prop_attr_t; 5 * Common Development and Distribution License (the "License"). ______unchanged_portion_omitted_ 6 * You may not use this file except in compliance with the License. 7 * 83 static prop_desc_t zfs_prop_table[ZFS_NPROP_ALL] = { 8 * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE 84 { "type", prop_type_string, 0, NULL, prop_readonly, 9 * or http://www.opensolaris.org/os/licensing. 85 ZFS_TYPE_ANY, "filesystem | volume | snapshot", "TYPE", B_TRUE }, 10 * See the License for the specific language governing permissions 86 { "creation", prop_type_number, 0, NULL, prop_readonly, 11 * and limitations under the License. 87 ZFS_TYPE_ANY, "<date>", "CREATION", B_FALSE }, 12 * 88 { "used", prop_type_number, 0, NULL, prop_readonly, 13 * When distributing Covered Code, include this CDDL HEADER in each 89 ZFS_TYPE_ANY, "<size>", "USED", B_TRUE }, 14 * file and include the License file at usr/src/OPENSOLARIS.LICENSE. 90 { "available", prop_type_number, 0, NULL, prop_readonly, 15 * If applicable, add the following below
    [Show full text]
  • Solaris Internals
    SOLARIS INTERNALS Core Kernel Components i SOLARIS INTERNALS Core Kernel Components Jim Mauro and Richard McDougall Sun Microsystems Press A Prentice Hall Title © 2000 Sun Microsystems, Inc. — Printed in the United States of America. 901 San Antonio Road, Palo Alto, California 94303 U.S.A. All rights reserved. This product and related documentation are protected by copyright and distributed under licenses restricting its use, copying, distribution and decompilation. No part of this product or related documentation may be reproduced in any form by any means without prior written authoriza- tion of Sun and its licensors, if any. RESTRICTED RIGHTS LEGEND: Use, duplication, or disclosure by the United States Government is subject to the restrictions as set forth in DFARS 252.227-7013 (c)(1)(ii) and FAR 52.227-19. The product described in this manual may be protected by one or more U.S. patents, foreign patents, or pending applications. TRADEMARKS—Sun, Sun Microsystems, the Sun logo, HotJava, Solaris, SunExpress, SunScreen, SunDocs, SPARC, SunOS, and SunSoft are trademarks or registered trademarks of Sun Microsystems, Inc. All other products or services mentioned in this book are the trademarks or service marks of their respective companies or organizations. 109 87654321 ISBN 0-13-022496-0 Sun Microsystems Press A Prentice Hall Title For Traci. .. for your love and encouragement .......................................... Richard For Donna, Frankie and Dominick. All my love, always... .......................................... Jim ACKNOWLEDGEMENTS It ‘s hard to thank all people that helped us with this book. As a minimum, we owe: • Thanks to Brian Wong, Adrian Cockcroft, Paul Strong, Lisa Musgrave and Fraser Gardiner for all your help and advise for the structure and content of this book.
    [Show full text]
  • Dell EMC Avamar Administration Guide
    Dell EMC Avamar Administration Guide 18.1 Dell Inc. June 2020 Rev. 04 Notes, cautions, and warnings NOTE: A NOTE indicates important information that helps you make better use of your product. CAUTION: A CAUTION indicates either potential damage to hardware or loss of data and tells you how to avoid the problem. WARNING: A WARNING indicates a potential for property damage, personal injury, or death. © 2001 - 2020 Dell Inc. or its subsidiaries. All rights reserved. Dell, EMC, and other trademarks are trademarks of Dell Inc. or its subsidiaries. Other trademarks may be trademarks of their respective owners. Contents Figures..........................................................................................................................................11 Tables.......................................................................................................................................... 12 Preface.........................................................................................................................................18 Chapter 1: Introduction.................................................................................................................. 21 Avamar system overview.................................................................................................................................................... 21 Avamar server.................................................................................................................................................................21 Avamar
    [Show full text]
  • SCO® Unixware® 2.1 Technical Summary
    SCO® UnixWare® 2.1 Technical Summary An SCO Technical White Paper February 1996 TM An SCO Technical White Paper Version 1.1 INTRODUCTION .......................................................................................................................................4 SCO UNIXWARE 2.1 STRENGTHS .......................................................................................................................................................5 WHAT’S NEW IN SCO UNIXWARE 2.1..................................................................................................................................................6 SCO UNIXWARE 2.1 PRODUCT LINE OVERVIEW .....................................................................................8 SCO UNIXWARE 2.1 SYSTEM OVERVIEW ..............................................................................................10 INSTALLATION REQUIREMENTS SCO UNIXWARE 2.1...............................................................................................................................10 APPLICATION SUPPORT .....................................................................................................................................................................10 THE SCO UNIXWARE 2.1 KERNEL: INSIDE A HIGH-PERFORMANCE ENGINE ..................................................................................................10 UnixWare 2.1 Symmetrical Multi-Processing and Threads ............................................................................... 11 SCO UnixWare
    [Show full text]
  • Downloaded for Free From
    The Design of the NetBSD I/O Subsystems SungWon Chung Pusan National University 2 This book is dedicated to the open-source code developers in the NetBSD community. The original copy of this publication is available in an electronic form and it can be downloaded for free from http://arXiv.org. Copyright (c) 2002 by SungWon Chung. For non-commercial and personal use, this publication may be reproduced, stored in a retrieval system, or transmitted in any form by any means, electronic, mechanical, photocopying, recording or otherwise. For commercial use, no part of this publication can be reproduced by any means without the prior written permission of the author. NetBSD is the registered trademark of The NetBSD Foundation, Inc. Contents Preface 14 I Basics to Learn Filesystem 15 1 Welcome to the World of Kernel ! 17 1.1 How Does a System Call Dive into Kernel from User Program ? . 17 1.1.1 Example: write system call . 17 1.1.2 Ultra SPARC 0x7c CPU Trap . 18 1.1.3 Jump to the File Descriptor Layer . 24 1.1.4 Arriving at Virtual Filesystem Operations . 28 1.2 General Data Structures in Kernel such as List, Hash, Queue, ... 30 1.2.1 Linked-Lists . 30 1.2.2 Tail Queues . 34 1.2.3 Hash . 38 1.3 Waiting and Sleeping in Kernel . 39 1.4 Kernel Lock Manager . 39 1.4.1 simplelock and lock . 39 1.4.2 Simplelock Interfaces . 40 1.4.3 Lock Interfaces . 40 1.5 Kernel Memory Allocation . 43 1.6 Resource Pool Manager .
    [Show full text]
  • HP-UX to Oracle Solaris Porting Guide Getting Started on the Move to Oracle Solaris
    An Oracle White Paper October 2011 HP-UX to Oracle Solaris Porting Guide Getting Started on the Move to Oracle Solaris HP-UX to Oracle Solaris Porting Guide Chapter 1 Introduction...................................................................... 1 Oracle Solaris ................................................................................. 1 The Advantages of Porting to Oracle Solaris.................................. 2 Chapter 2 The Porting Process........................................................ 4 Infrastructure and Application Porting Assessment ........................ 4 Build Environment Deployment....................................................... 5 Data Integration .............................................................................. 5 Source Code Porting....................................................................... 6 Application Verification.................................................................... 6 Commercial Applications and Third-Party Products ....................... 6 Chapter 3 Operating System Considerations................................... 7 Processor Endianness .................................................................... 7 Data Alignment ............................................................................... 7 Read/Write Structures..................................................................... 8 Storage Order and Alignment ......................................................... 8 64-Bit Data Models ........................................................................
    [Show full text]
  • Dell EMC Avamar Administration Guide
    Dell EMC Avamar Administration Guide 19.2 Dell Inc. June 2020 Rev. 04 Notes, cautions, and warnings NOTE: A NOTE indicates important information that helps you make better use of your product. CAUTION: A CAUTION indicates either potential damage to hardware or loss of data and tells you how to avoid the problem. WARNING: A WARNING indicates a potential for property damage, personal injury, or death. © 2001 - 2020 Dell Inc. or its subsidiaries. All rights reserved. Dell, EMC, and other trademarks are trademarks of Dell Inc. or its subsidiaries. Other trademarks may be trademarks of their respective owners. Contents Figures......................................................................................................................................... 12 Tables.......................................................................................................................................... 13 Preface........................................................................................................................................ 20 Chapter 1: Introduction..................................................................................................................23 Avamar system overview................................................................................................................................................... 23 Avamar server................................................................................................................................................................23 Avamar
    [Show full text]
  • Oracle Business Breakfast Oracle Solaris 11.4 Beta Jörg Möllenkamp Oracle Elite Engineering Exchange
    Oracle Business Breakfast Oracle Solaris 11.4 Beta Jörg Möllenkamp Oracle Elite Engineering Exchange V 0.13 21.03.2018 dd/jm Copyright Copyright © © 2018,2018,Oracle and/or its affiliates. All rights reserved. |Oracle and/or its affiliates. All rights reserved. Safe Harbor Statement The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 3 Scope What you can take away from this presentation? • How we will develop and deliver Oracle Solaris in the future. • Insight into the new and innovative features of Oracle Solaris 11.4 Beta • Practical command line examples you can try for yourself • No marketing slides. (unfortunately) Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 4 Oracle Solaris Strategy Best UNIX For Mission Critical WorkloaDs • Continuous Delivery Model – Innovation and critical fixes through dot releases, Quarterly Critical Patch Updates, and monthly Support Repository Updates • Secure and Stable – Integrated security and availability features to simplify deployments and operations • Oracle Database Integration – Data/systems management, networking, and performance features to enable optimal Oracle Database results on SPARC/Solaris Mission Critical Operating System Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 5 What Customers told us: Oracle Solaris Priorities A Non-disruptive stream of innovation • Consistent Operational Model • Operational Compatibility • Simple OS Deployment and Patching Methodology • Continuous Application Compatibility – No ISV Re-Qualification • Long support lifespan Copyright © 2018, Oracle and/or its affiliates.
    [Show full text]
  • Mac OS X Filesystems
    Mac OS X File Systems Had to give it a shot! Most ideas come from previous ideas. Alan Kay Looking Back This means that the return which is executed immediately after the call to aretu actually returns from the last routine which did the savu. They are not expected to understand this. You are not expected to understand this. UNIX The most important role of UNIX is to provide a file system. This means that the return which is executed immediately after the call to aretu actually returns from the last routine which did the savu. They are not expected to understand this. You are not expected to understand this. UNIX The most important Evroleerything of UNIX is ais file. to provide a file system. User Program User Kernel System Calls System Call Layer The File System User Program User Kernel System Calls System Call Layer Which? File System 1 File System 2 User Program User Kernel System Calls System Call Layer Which? FS1 FS2 FS3 ... FSn User Kernel Create() node 1 Remove() Read() ... node Write() n ... Some File System User Program User Kernel System Call Layer VNOP_CREATE() VNOP_REMOVE() ... ... v1 VNOP_READ() vn+1 VNOP_WRITE() ... vn vn+m Vnode/VFS hfs_create() ufs_create() h1 hfs_remove() u1 ufs_remove() ... ... hfs_read() ufs_read() hn hfs_write() um ufs_write() ... ... HFS+ UFS File Systems HFS+ AFP cddafs HFS SMB/CIFS devfs UFSManNFSy Filefdesc UDF WebDAV ISO9660SystemsFTP ZFS NTFS deadfs fifofs specfs MS-DOS synthfs union volfs ZFS Malloc/Free Abstraction No partitions (just storage pool) Automatically grows/shrinks Pool’s storage shared across file systems Transactional (no journal, no fsck) Copy-on-write, checksummed Inside HFS+ $ sudo hfsdebug -f -t 10 ..
    [Show full text]
  • File Systems
    Gregg.book Page 291 Wednesday, February 2, 2011 12:35 PM 5 File Systems File systems—an integral part of any operating system—have long been one of the most difficult components to observe when analyzing performance. This is largely because of the way file system data and metadata caching are implemented in the kernel but also because, until now, we simply haven’t had tools that can look into these kernel subsystems. Instead, we’ve analyzed slow I/O at the disk storage layer with tools such as iostat(1), even though this is many layers away from applica- tion latency. DTrace can be used to observe exactly how the file system responds to applications, how effective file system tuning is, and the internal operation of file system components. You can use it to answer questions such as the following. What files are being accessed, and how? By what or whom? Bytes, I/O counts? What is the source of file system latency? Is it disks, the code path, locks? How effective is prefetch/read-ahead? Should this be tuned? As an example, rwsnoop is a DTrace-based tool, shipping with Mac OS X and OpenSolaris, that you can use to trace read and write system calls, along with the filename for file system I/O. The following shows sshd (the SSH daemon) accept- ing a login on Solaris: # rwsnoop UID PID CMD D BYTES FILE 0 942611 sshd R 70 <unknown> 0 942611 sshd R 0 <unknown> continues 291 Gregg.book Page 292 Wednesday, February 2, 2011 12:35 PM 292 Chapter 5 File Systems 0 942611 sshd R 1444 /etc/gss/mech 0 942611 sshd R 0 /etc/gss/mech 0 942611 sshd R 0
    [Show full text]