Goodbye, Xfs: Building a New, Faster Storage Backend for Ceph

Total Page:16

File Type:pdf, Size:1020Kb

Goodbye, Xfs: Building a New, Faster Storage Backend for Ceph GOODBYE, XFS: BUILDING A NEW, FASTER STORAGE BACKEND FOR CEPH SAGE WEIL – RED HAT 2017.09.12 OUTLINE ● Ceph background and context – FileStore, and why POSIX failed us ● BlueStore – a new Ceph OSD backend ● Performance ● Recent challenges ● Current status, future ● Summary 2 MOTIVATION CEPH ● Object, block, and file storage in a single cluster ● All components scale horizontally ● No single point of failure ● Hardware agnostic, commodity hardware ● Self-manage whenever possible ● Open source (LGPL) ● “A Scalable, High-Performance Distributed File System” ● “performance, reliability, and scalability” 4 ● Released two weeks ago ● Erasure coding support for block and file (and object) ● BlueStore (our new storage backend) 5 CEPH COMPONENTS OBJECT BLOCK FILE RGW RBD CEPHFS A web services gateway A reliable, fully-distributed A distributed file system for object storage, block device with cloud with POSIX semantics and compatible with S3 and platform integration scale-out metadata Swift management LIBRADOS A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP) RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors 6 OBJECT STORAGE DAEMONS (OSDS) M OSD OSD OSD OSD M xfs btrfs ext4 FS FS FS FS DISK DISK DISK DISK M 7 OBJECT STORAGE DAEMONS (OSDS) M OSD OSD OSD OSD M xfs btrfs FileStore FileStore FileStore FileStore ext4 FS FS FS FS DISK DISK DISK DISK M 8 IN THE BEGINNING EBOFS 2004 2008 2012 2014 2016 FakeStore 9 OBJECTSTORE AND DATA MODEL ● ObjectStore ● Object – “file” – abstract interface for storing – data (file-like byte stream) local data – attributes (small key/value) – EBOFS, FakeStore – omap (unbounded key/value) ● Collection – “directory” – actually a shard of RADOS pool – Ceph placement group (PG) ● All writes are transactions – Atomic + Consistent + Durable – Isolation provided by OSD 10 LEVERAGE KERNEL FILESYSTEMS! EBOFS EBOFS removed 2004 2008 2012 2014 2016 FakeStore FileStore FileStore + XFS + BtrFS 11 FILESTORE ● FileStore ● /var/lib/ceph/osd/ceph-123/ – current/ – evolution of FakeStore ● meta/ – PG = collection = directory – osdmap123 – object = file – osdmap124 ● 0.1_head/ ● Leveldb – object1 – – large xattr spillover object12 ● 0.7_head/ – object omap (key/value) data – object3 – object5 ● 0.a_head/ – object4 – object6 ● omap/ – <leveldb files> 12 POSIX FAILS: TRANSACTIONS ● Most transactions are simple [ { "op_name": "write", – write some bytes to object (file) "collection": "0.6_head", "oid": "#0:73d87003:::benchmark_data_gnit_10346_object23:head#", – update object attribute (file "length": 4194304, "offset": 0, xattr) "bufferlist length": 4194304 }, – append to update log (kv insert) { "op_name": "setattrs", "collection": "0.6_head", ...but others are arbitrarily "oid": "#0:73d87003:::benchmark_data_gnit_10346_object23:head#", "attr_lens": { large/complex "_": 269, "snapset": 31 ● } Serialize and write-ahead txn to }, { journal for atomicity "op_name": "omap_setkeys", "collection": "0.6_head", – We double-write everything! "oid": "#0:60000000::::head#", "attr_lens": { – "0000000005.00000000000000000006": 178, Lots of ugly hackery to make "_info": 847 replayed events idempotent } } ] 13 POSIX FAILS: KERNEL CACHE ● Kernel cache is mostly wonderful – automatically sized to all available memory – automatically shrinks if memory needed – efficient ● No cache implemented in app, yay! ● Read/modify/write vs write-ahead – write must be persisted – then applied to file system – only then you can read it 14 POSIX FAILS: ENUMERATION ● Ceph objects are distributed by a 32-bit hash … ● Enumeration is in hash order DIR_A/ – scrubbing DIR_A/A03224D3_qwer – data rebalancing, recovery DIR_A/A247233E_zxcv – enumeration via librados client API … DIR_B/ ● POSIX readdir is not well-ordered DIR_B/DIR_8/ DIR_B/DIR_8/B823032D_foo ● Need O(1) “split” for a given shard/range DIR_B/DIR_8/B8474342_bar DIR_B/DIR_9/ DIR_B/DIR_9/B924273B_baz ● Build directory tree by hash-value prefix DIR_B/DIR_A/ – split big directories, merge small ones DIR_B/DIR_A/BA4328D2_asdf – read entire directory, sort in-memory … 15 THE HEADACHES CONTINUE ● New FileStore+XFS problems continue to surface – FileStore directory splits lead to throughput collapse when an entire pool’s PG directories split in unison – Cannot bound deferred writeback work, even with fsync(2) – QoS efforts thwarted by deep queues and periodicity in FileStore throughput – {RBD, CephFS} snapshots triggering inefficient 4MB object copies to create object clones 16 ACTUALLY, OUR FIRST PLAN WAS BETTER BlueStore EBOFS EBOFS NewStore removed 2004 2008 2012 2014 2016 FakeStore FileStore FileStore + XFS + BtrFS 17 BLUESTORE BLUESTORE ● BlueStore = Block + NewStore – consume raw block device(s) ObjectStore – key/value database (RocksDB) for metadata BlueStore – data written directly to block device data metadata – pluggable inline compression – full data checksums RocksDB BlueRocksEnv ● Target hardware BlueFS – Current gen HDD, SSD (SATA, SAS, NVMe) BlockDevice ● We must share the block device with RocksDB 19 ROCKSDB: BlueRocksEnv + BlueFS ● class BlueRocksEnv : public rocksdb::EnvWrapper ● Map “directories” to different block devices – passes “file” operations to BlueFS – db.wal/ – on NVRAM, NVMe, SSD – ● BlueFS is a super-simple “file system” db/ – level0 and hot SSTs on SSD – db.slow/ – cold SSTs on HDD – all metadata lives in the journal – all metadata loaded in RAM on start/mount ● – journal rewritten/compacted when it gets large BlueStore periodically balances free space – no need to store block free list superblock journal … data data more journal … data data 20 file 10 file 11 file 12 file 12 file 13 rm file 12 file 13 ... METADATA ONODE – OBJECT ● Per object metadata struct bluestore_onode_t { uint64_t size; – Lives directly in key/value pair map<string,bufferptr> attrs; – Serializes to 100s of bytes uint64_t flags; ● Size in bytes // extent map metadata struct shard_info { ● Attributes (user attr data) uint32_t offset; uint32_t bytes; ● Inline extent map (maybe) }; vector<shard_info> shards; bufferlist inline_extents; bufferlist spanning_blobs; }; 22 CNODE – COLLECTION ● Collection metadata struct spg_t { uint64_t pool; – Interval of object namespace uint32_t hash; }; pool hash name bits C<12,3d3e0000> “12.e3d3” = <19> struct bluestore_cnode_t { uint32_t bits; }; pool hash name snap O<12,3d3d880e,foo,NOSNAP> = … O<12,3d3d9223,bar,NOSNAP> = … ● Nice properties – Ordered enumeration of objects O<12,3d3e02c2,baz,NOSNAP> = … – We can “split” collections by adjusting O<12,3d3e125d,zip,NOSNAP> = … collection metadata only O<12,3d3e1d41,dee,NOSNAP> = … O<12,3d3e3832,dah,NOSNAP> = … 23 BLOB AND SHAREDBLOB ● Blob struct bluestore_blob_t { uint32_t flags = 0; – Extent(s) on device vector<bluestore_pextent_t> extents; – Checksum metadata uint16_t unused = 0; // bitmap – Data may be compressed uint8_t csum_type = CSUM_NONE; uint8_t csum_chunk_order = 0; bufferptr csum_data; uint32_t compressed_length_orig = 0; uint32_t compressed_length = 0; }; ● SharedBlob struct bluestore_shared_blob_t { uint64_t sbid; – Extent ref count on cloned bluestore_extent_ref_map_t ref_map; blobs }; 24 25 EXTENT MAP EXTENT ● ● ● Blobs stored inline in each shard in each inline stored Blobs in chunks Serialized extents → blob extents object Map – – – – onodekey “ boundaries shard across referenced isit unless keys adjacent in stored otherwise small if inonodevalue inline stored spanning” blobs stored in spanning”blobs stored logical offsets ... map extent inline + onode = O<,,baz,,> shard map extent = O<,,bar,,4> shard map extent = O<,,bar,,0> blobs spanning + onode = O<,,bar,,> map extent inline + onode = O<,,foo,,> ... blob blob blob shard #2 shard #1 spanning blob DATA PATH DATA PATH BASICS Terms Three ways to write ● TransContext ● New allocation – State describing an executing – Any write larger than min_alloc_size transaction goes to a new, unused extent on disk – ● Sequencer Once that IO completes, we commit the transaction – An independent, totally ordered ● Unused part of existing blob queue of transactions – One per PG ● Deferred writes – Commit temporary promise to (over)write data with transaction ● includes data! – Do async (over)write – Then clean up temporary k/v pair 27 IN-MEMORY CACHE ● OnodeSpace per collection – in-memory name → Onode map of decoded onodes ● BufferSpace for in-memory Blobs – all in-flight writes – may contain cached on-disk data ● Both buffers and onodes have lifecycles linked to a Cache – TwoQCache – implements 2Q cache replacement algorithm (default) ● Cache is sharded for parallelism – Collection → shard mapping matches OSD's op_wq – same CPU context that processes client requests will touch the LRU/2Q lists 28 PERFORMANCE HDD: RANDOM WRITE 450 ) 400 /s B 350 M ( 300 t 250 Bluestore vs Filestore HDD Random Write Throughput pu gh 200 ou 150 hr T 100 50 0 30 4 8 2000 1800 1600 16 1400 1200 S 32 P 1000 O I 800 Bluestore vs Filestore64 HDD Random Write IOPS 600 400 200 IO Size 128 0 256 4 512 8 1024 16 Filestore 2048 Bluestore (wip-bitmap-alloc-perf) BS (Master-a07452d) 32 4096 BS (Master-d62a4948) Bluestore (wip-bluestore-dw) 64 IO Size128 256 512 1024 2048 Filestore Bluestore (wip-bitmap-alloc-perf) BS (Master-a07452d) 4096 BS (Master-d62a4948) Bluestore (wip-bluestore-dw) HDD: SEQUENTIAL WRITE 450 ) 400 /s B 350 M ( 300 Bluestore vs Filestore HDD Sequential Write Throughput t 250 pu gh 200 ou 150 hr 100 T WTH 50 0 31 4 8 3500 3000 16 2500 S 32
Recommended publications
  • Optimizing the Ceph Distributed File System for High Performance Computing
    Optimizing the Ceph Distributed File System for High Performance Computing Kisik Jeong Carl Duffy Jin-Soo Kim Joonwon Lee Sungkyunkwan University Seoul National University Seoul National University Sungkyunkwan University Suwon, South Korea Seoul, South Korea Seoul, South Korea Suwon, South Korea [email protected] [email protected] [email protected] [email protected] Abstract—With increasing demand for running big data ana- ditional HPC workloads. Traditional HPC workloads perform lytics and machine learning workloads with diverse data types, reads from a large shared file stored in a file system, followed high performance computing (HPC) systems consequently need by writes to the file system in a parallel manner. In this case, to support diverse types of storage services. Ceph is one possible candidate for such HPC environments, as Ceph provides inter- a storage service should provide good sequential read/write faces for object, block, and file storage. Ceph, however, is not performance. However, Ceph divides large files into a number designed for HPC environments, thus it needs to be optimized of chunks and distributes them to several different disks. This for HPC workloads. In this paper, we find and analyze problems feature has two main problems in HPC environments: 1) Ceph that arise when running HPC workloads on Ceph, and propose translates sequential accesses into random accesses, 2) Ceph a novel optimization technique called F2FS-split, based on the F2FS file system and several other optimizations. We measure needs to manage many files, incurring high journal overheads the performance of Ceph in HPC environments, and show that in the underlying file system.
    [Show full text]
  • Proxmox Ve Mit Ceph &
    PROXMOX VE MIT CEPH & ZFS ZUKUNFTSSICHERE INFRASTRUKTUR IM RECHENZENTRUM Alwin Antreich Proxmox Server Solutions GmbH FrOSCon 14 | 10. August 2019 Alwin Antreich Software Entwickler @ Proxmox 15 Jahre in der IT als Willkommen! System / Netzwerk Administrator FrOSCon 14 | 10.08.2019 2/33 Proxmox Server Solutions GmbH Aktive Community Proxmox seit 2005 Globales Partnernetz in Wien (AT) Proxmox Mail Gateway Enterprise (AGPL,v3) Proxmox VE (AGPL,v3) Support & Services FrOSCon 14 | 10.08.2019 3/33 Traditionelle Infrastruktur FrOSCon 14 | 10.08.2019 4/33 Hyperkonvergenz FrOSCon 14 | 10.08.2019 5/33 Hyperkonvergente Infrastruktur FrOSCon 14 | 10.08.2019 6/33 Voraussetzung für Hyperkonvergenz CPU / RAM / Netzwerk / Storage Verwende immer genug von allem. FrOSCon 14 | 10.08.2019 7/33 FrOSCon 14 | 10.08.2019 8/33 Was ist ‚das‘? ● Ceph & ZFS - software-basierte Storagelösungen ● ZFS lokal / Ceph verteilt im Netzwerk ● Hervorragende Performance, Verfügbarkeit und https://ceph.io/ Skalierbarkeit ● Verwaltung und Überwachung mit Proxmox VE ● Technischer Support für Ceph & ZFS inkludiert in Proxmox Subskription http://open-zfs.org/ FrOSCon 14 | 10.08.2019 9/33 FrOSCon 14 | 10.08.2019 10/33 FrOSCon 14 | 10.08.2019 11/33 ZFS Architektur FrOSCon 14 | 10.08.2019 12/33 ZFS ARC, L2ARC and ZIL With ZIL Without ZIL ARC RAM ARC RAM ZIL ZIL Application HDD Application SSD HDD FrOSCon 14 | 10.08.2019 13/33 FrOSCon 14 | 10.08.2019 14/33 FrOSCon 14 | 10.08.2019 15/33 Ceph Network Ceph Docs: https://docs.ceph.com/docs/master/ FrOSCon 14 | 10.08.2019 16/33 FrOSCon
    [Show full text]
  • Red Hat Openstack* Platform with Red Hat Ceph* Storage
    Intel® Data Center Blocks for Cloud – Red Hat* OpenStack* Platform with Red Hat Ceph* Storage Reference Architecture Guide for deploying a private cloud based on Red Hat OpenStack* Platform with Red Hat Ceph Storage using Intel® Server Products. Rev 1.0 January 2017 Intel® Server Products and Solutions <Blank page> Intel® Data Center Blocks for Cloud – Red Hat* OpenStack* Platform with Red Hat Ceph* Storage Document Revision History Date Revision Changes January 2017 1.0 Initial release. 3 Intel® Data Center Blocks for Cloud – Red Hat® OpenStack® Platform with Red Hat Ceph Storage Disclaimers Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software, or service activation. Learn more at Intel.com, or from the OEM or retailer. You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. You agree to grant Intel a non-exclusive, royalty-free license to any patent claim thereafter drafted which includes subject matter disclosed herein. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade. Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725 or by visiting www.intel.com/design/literature.htm.
    [Show full text]
  • Serverless Network File Systems
    Serverless Network File Systems Thomas E. Anderson, Michael D. Dahlin, Jeanna M. Neefe, David A. Patterson, Drew S. Roselli, and Randolph Y. Wang Computer Science Division University of California at Berkeley Abstract In this paper, we propose a new paradigm for network file system design, serverless network file systems. While traditional network file systems rely on a central server machine, a serverless system utilizes workstations cooperating as peers to provide all file system services. Any machine in the system can store, cache, or control any block of data. Our approach uses this location independence, in combination with fast local area networks, to provide better performance and scalability than traditional file systems. Further, because any machine in the system can assume the responsibilities of a failed component, our serverless design also provides high availability via redundant data storage. To demonstrate our approach, we have implemented a prototype serverless network file system called xFS. Preliminary performance measurements suggest that our architecture achieves its goal of scalability. For instance, in a 32-node xFS system with 32 active clients, each client receives nearly as much read or write throughput as it would see if it were the only active client. 1. Introduction A serverless network file system distributes storage, cache, and control over cooperating workstations. This approach contrasts with traditional file systems such as Netware [Majo94], NFS [Sand85], Andrew [Howa88], and Sprite [Nels88] where a central server machine stores all data and satisfies all client cache misses. Such a central server is both a performance and reliability bottleneck. A serverless system, on the other hand, distributes control processing and data storage to achieve scalable high performance, migrates the responsibilities of failed components to the remaining machines to provide high availability, and scales gracefully to simplify system management.
    [Show full text]
  • Red Hat Data Analytics Infrastructure Solution
    TECHNOLOGY DETAIL RED HAT DATA ANALYTICS INFRASTRUCTURE SOLUTION TABLE OF CONTENTS 1 INTRODUCTION ................................................................................................................ 2 2 RED HAT DATA ANALYTICS INFRASTRUCTURE SOLUTION ..................................... 2 2.1 The evolution of analytics infrastructure ....................................................................................... 3 Give data scientists and data 2.2 Benefits of a shared data repository on Red Hat Ceph Storage .............................................. 3 analytics teams access to their own clusters without the unnec- 2.3 Solution components ...........................................................................................................................4 essary cost and complexity of 3 TESTING ENVIRONMENT OVERVIEW ............................................................................ 4 duplicating Hadoop Distributed File System (HDFS) datasets. 4 RELATIVE COST AND PERFORMANCE COMPARISON ................................................ 6 4.1 Findings summary ................................................................................................................................. 6 Rapidly deploy and decom- 4.2 Workload details .................................................................................................................................... 7 mission analytics clusters on 4.3 24-hour ingest ........................................................................................................................................8
    [Show full text]
  • Unlock Bigdata Analytic Efficiency with Ceph Data Lake
    Unlock Bigdata Analytic Efficiency With Ceph Data Lake Jian Zhang, Yong Fu, March, 2018 Agenda . Background & Motivations . The Workloads, Reference Architecture Evolution and Performance Optimization . Performance Comparison with Remote HDFS . Summary & Next Step 3 Challenges of scaling Hadoop* Storage BOUNDED Storage and Compute resources on Hadoop Nodes brings challenges Data Capacity Silos Costs Performance & efficiency Typical Challenges Data/Capacity Multiple Storage Silos Space, Spent, Power, Utilization Upgrade Cost Inadequate Performance Provisioning And Configuration Source: 451 Research, Voice of the Enterprise: Storage Q4 2015 *Other names and brands may be claimed as the property of others. 4 Options To Address The Challenges Compute and Large Cluster More Clusters Storage Disaggregation • Lacks isolation - • Cost of • Isolation of high- noisy neighbors duplicating priority workloads hinder SLAs datasets across • Shared big • Lacks elasticity - clusters datasets rigid cluster size • Lacks on-demand • On-demand • Can’t scale provisioning provisioning compute/storage • Can’t scale • compute/storage costs separately compute/storage costs scale costs separately separately Compute and Storage disaggregation provides Simplicity, Elasticity, Isolation 5 Unified Hadoop* File System and API for cloud storage Hadoop Compatible File System abstraction layer: Unified storage API interface Hadoop fs –ls s3a://job/ adl:// oss:// s3n:// gs:// s3:// s3a:// wasb:// 2006 2008 2014 2015 2016 6 Proposal: Apache Hadoop* with disagreed Object Storage SQL …… Hadoop Services • Virtual Machine • Container • Bare Metal HCFS Compute 1 Compute 2 Compute 3 … Compute N Object Storage Services Object Object Object Object • Co-located with gateway Storage 1 Storage 2 Storage 3 … Storage N • Dynamic DNS or load balancer • Data protection via storage replication or erasure code Disaggregated Object Storage Cluster • Storage tiering *Other names and brands may be claimed as the property of others.
    [Show full text]
  • XFS: There and Back ...And There Again? Slide 1 of 38
    XFS: There and Back.... .... and There Again? Dave Chinner <[email protected]> <[email protected]> XFS: There and Back .... and There Again? Slide 1 of 38 Overview • Story Time • Serious Things • These Days • Shiny Things • Interesting Times XFS: There and Back .... and There Again? Slide 2 of 38 Story Time • Way back in the early '90s • Storage exceeding 32 bit capacities • 64 bit CPUs, large scale MP • Hundreds of disks in a single machine • XFS: There..... Slide 3 of 38 "x" is for Undefined xFS had to support: • Fast Crash Recovery • Large File Systems • Large, Sparse Files • Large, Contiguous Files • Large Directories • Large Numbers of Files • - Scalability in the XFS File System, 1995 http://oss.sgi.com/projects/xfs/papers/xfs_usenix/index.html XFS: There..... Slide 4 of 38 The Early Years XFS: There..... Slide 5 of 38 The Early Years • Late 1994: First Release, Irix 5.3 • Mid 1996: Default FS, Irix 6.2 • Already at Version 4 • Attributes • Journalled Quotas • link counts > 64k • feature masks • • XFS: There..... Slide 6 of 38 The Early Years • • Allocation alignment to storage geometry (1997) • Unwritten extents (1998) • Version 2 directories (1999) • mkfs time configurable block size • Scalability to tens of millions of directory entries • • XFS: There..... Slide 7 of 38 What's that Linux Thing? • Feature development mostly stalled • Irix development focussed on CXFS • New team formed for Linux XFS port! • Encumberance review! • Linux was missing lots of bits XFS needed • Lot of work needed • • XFS: There and..... Slide 8 of 38 That Linux Thing? XFS: There and..... Slide 9 of 38 Light that fire! • 2000: SGI releases XFS under GPL • • 2001: First stable XFS release • • 2002: XFS merged into 2.5.36 • • JFS follows similar timeline • XFS: There and....
    [Show full text]
  • Kernel Boot-Time Tracing
    Kernel Boot-time Tracing Linux Plumbers Conference 2019 - Tracing Track Masami Hiramatsu <[email protected]> Linaro, Ltd. Speaker Masami Hiramatsu - Working for Linaro and Linaro members - Tech Lead for a Landing team - Maintainer of Kprobes and related tracing features/tools Why Kernel Boot-time Tracing? Debug and analyze boot time errors and performance issues - Measure performance statistics of kernel boot - Analyze driver init failure - Debug boot up process - Continuously tracing from boot time etc. What We Have There are already many ftrace options on kernel command line ● Setup options (trace_options=) ● Output to printk (tp_printk) ● Enable events (trace_events=) ● Enable tracers (ftrace=) ● Filtering (ftrace_filter=,ftrace_notrace=,ftrace_graph_filter=,ftrace_graph_notrace=) ● Add kprobe events (kprobe_events=) ● And other options (alloc_snapshot, traceoff_on_warning, ...) See Documentation/admin-guide/kernel-parameters.txt Example of Kernel Cmdline Parameters In grub.conf linux /boot/vmlinuz-5.1 root=UUID=5a026bbb-6a58-4c23-9814-5b1c99b82338 ro quiet splash tp_printk trace_options=”sym-addr” trace_clock=global ftrace_dump_on_oops trace_buf_size=1M trace_event=”initcall:*,irq:*,exceptions:*” kprobe_event=”p:kprobes/myevent foofunction $arg1 $arg2;p:kprobes/myevent2 barfunction %ax” What Issues? Size limitation ● kernel cmdline size is small (< 256bytes) ● A half of the cmdline is used for normal boot Only partial features supported ● ftrace has too complex features for single command line ● per-event filters/actions, instances, histograms. Solutions? 1. Use initramfs - Too late for kernel boot time tracing 2. Expand kernel cmdline - It is not easy to write down complex tracing options on bootloader (Single line options is too simple) 3. Reuse structured boot time data (Devicetree) - Well documented, structured data -> V1 & V2 series based on this. Boot-time Trace: V1 and V2 series V1 and V2 series posted at June.
    [Show full text]
  • Comparing Filesystem Performance: Red Hat Enterprise Linux 6 Vs
    COMPARING FILE SYSTEM I/O PERFORMANCE: RED HAT ENTERPRISE LINUX 6 VS. MICROSOFT WINDOWS SERVER 2012 When choosing an operating system platform for your servers, you should know what I/O performance to expect from the operating system and file systems you select. In the Principled Technologies labs, using the IOzone file system benchmark, we compared the I/O performance of two operating systems and file system pairs, Red Hat Enterprise Linux 6 with ext4 and XFS file systems, and Microsoft Windows Server 2012 with NTFS and ReFS file systems. Our testing compared out-of-the-box configurations for each operating system, as well as tuned configurations optimized for better performance, to demonstrate how a few simple adjustments can elevate I/O performance of a file system. We found that file systems available with Red Hat Enterprise Linux 6 delivered better I/O performance than those shipped with Windows Server 2012, in both out-of- the-box and optimized configurations. With I/O performance playing such a critical role in most business applications, selecting the right file system and operating system combination is critical to help you achieve your hardware’s maximum potential. APRIL 2013 A PRINCIPLED TECHNOLOGIES TEST REPORT Commissioned by Red Hat, Inc. About file system and platform configurations While you can use IOzone to gauge disk performance, we concentrated on the file system performance of two operating systems (OSs): Red Hat Enterprise Linux 6, where we examined the ext4 and XFS file systems, and Microsoft Windows Server 2012 Datacenter Edition, where we examined NTFS and ReFS file systems.
    [Show full text]
  • Linux Perf Event Features and Overhead
    Linux perf event Features and Overhead 2013 FastPath Workshop Vince Weaver http://www.eece.maine.edu/∼vweaver [email protected] 21 April 2013 Performance Counters and Workload Optimized Systems • With processor speeds constant, cannot depend on Moore's Law to deliver increased performance • Code analysis and optimization can provide speedups in existing code on existing hardware • Systems with a single workload are best target for cross- stack hardware/kernel/application optimization • Hardware performance counters are the perfect tool for this type of optimization 1 Some Uses of Performance Counters • Traditional analysis and optimization • Finding architectural reasons for slowdown • Validating Simulators • Auto-tuning • Operating System optimization • Estimating power/energy in software 2 Linux and Performance Counters • Linux has become the operating system of choice in many domains • Runs most of the Top500 list (over 90%) on down to embedded devices (Android Phones) • Until recently had no easy access to hardware performance counters, limiting code analysis and optimization. 3 Linux Performance Counter History • oprofile { system-wide sampling profiler since 2002 • perfctr { widely used general interface available since 1999, required patching kernel • perfmon2 { another general interface, included in kernel for itanium, made generic, big push for kernel inclusion 4 Linux perf event • Developed in response to perfmon2 by Molnar and Gleixner in 2009 • Merged in 2.6.31 as \PCL" • Unusual design pushes most functionality into kernel
    [Show full text]
  • Linux on IBM Z
    Linux on IBM Z Pervasive Encryption with Linux on IBM Z: from a performance perspective Danijel Soldo Software Performance Analyst Linux on IBM Z Performance Evaluation _ [email protected] IBM Z / Danijel Soldo – Pervasive Encryption with Linux on IBM Z: from a performance perspective / © 2018 IBM Corporation Notices and disclaimers • © 2018 International Business Machines Corporation. No part of • Performance data contained herein was generally obtained in a this document may be reproduced or transmitted in any form controlled, isolated environments. Customer examples are without written permission from IBM. presented as illustrations of how those • U.S. Government Users Restricted Rights — use, duplication • customers have used IBM products and the results they may have or disclosure restricted by GSA ADP Schedule Contract with achieved. Actual performance, cost, savings or other results in IBM. other operating environments may vary. • Information in these presentations (including information relating • References in this document to IBM products, programs, or to products that have not yet been announced by IBM) has been services does not imply that IBM intends to make such products, reviewed for accuracy as of the date of initial publication programs or services available in all countries in which and could include unintentional technical or typographical IBM operates or does business. errors. IBM shall have no responsibility to update this information. This document is distributed “as is” without any warranty, • Workshops, sessions and associated materials may have been either express or implied. In no event, shall IBM be liable for prepared by independent session speakers, and do not necessarily any damage arising from the use of this information, reflect the views of IBM.
    [Show full text]
  • Filesystem Considerations for Embedded Devices ELC2015 03/25/15
    Filesystem considerations for embedded devices ELC2015 03/25/15 Tristan Lelong Senior embedded software engineer Filesystem considerations ABSTRACT The goal of this presentation is to answer a question asked by several customers: which filesystem should you use within your embedded design’s eMMC/SDCard? These storage devices use a standard block interface, compatible with traditional filesystems, but constraints are not those of desktop PC environments. EXT2/3/4, BTRFS, F2FS are the first of many solutions which come to mind, but how do they all compare? Typical queries include performance, longevity, tools availability, support, and power loss robustness. This presentation will not dive into implementation details but will instead summarize provided answers with the help of various figures and meaningful test results. 2 TABLE OF CONTENTS 1. Introduction 2. Block devices 3. Available filesystems 4. Performances 5. Tools 6. Reliability 7. Conclusion Filesystem considerations ABOUT THE AUTHOR • Tristan Lelong • Embedded software engineer @ Adeneo Embedded • French, living in the Pacific northwest • Embedded software, free software, and Linux kernel enthusiast. 4 Introduction Filesystem considerations Introduction INTRODUCTION More and more embedded designs rely on smart memory chips rather than bare NAND or NOR. This presentation will start by describing: • Some context to help understand the differences between NAND and MMC • Some typical requirements found in embedded devices designs • Potential filesystems to use on MMC devices 6 Filesystem considerations Introduction INTRODUCTION Focus will then move to block filesystems. How they are supported, what feature do they advertise. To help understand how they compare, we will present some benchmarks and comparisons regarding: • Tools • Reliability • Performances 7 Block devices Filesystem considerations Block devices MMC, EMMC, SD CARD Vocabulary: • MMC: MultiMediaCard is a memory card unveiled in 1997 by SanDisk and Siemens based on NAND flash memory.
    [Show full text]