Improving Copy-On-Write Performance in Container Storage Drivers

Total Page:16

File Type:pdf, Size:1020Kb

Improving Copy-On-Write Performance in Container Storage Drivers Improving Copy-on-Write Performance in Container Storage Drivers Frank Zhao*, Kevin Xu, Randy Shain EMC Corp (Now Dell EMC) 2016 Storage Developer Conference. © EMC Corp All Rights Reserved. Disclaimer THE TECHNOLOGY CONCEPTS BEING DISCUSSED AND DEMONSTRATED ARE THE RESULT OF RESEARCH CONDUCTED BY THE ADVANCED RESEARCH & DEVELOPMENT (ARD) TEAM FROM THE EMC OFFICE OF THE CTO. ANY DEMONSTRATED CAPABILITY IS ONLY FOR RESEARCH PURPOSE AND AT A PROTOTYPE PHASE, THEREFORE : THERE ARE NO IMMEDIATE PLANS NOR INDICATION OF SUCH PLANS FOR PRODUCTIZATION OF THESE CAPABILITIES AT THE TIME OF PRESENTATION. THINGS MAY OR MAY NOT CHANGE IN THE FUTURE 2 2016 Storage Developer Conference. © EMC Corp All Rights Reserved. Outline Container (Docker) Storage Drivers Copy-on-Write Performance Drawback Our Solution: Data Relationship and Reference Design & Prototyping with DevMapper Test Results Launch storm Data access IO heatmap Summary and Future Work 3 2016 Storage Developer Conference. © EMC Corp All Rights Reserved. Container/Docker Image and layers Running container’s logical view An image references a list of read-only layers File 1 File 2 File 3 File 4 File 5 File 6 or differences (deltas) A container begins as a R/W snapshot of the underlying image L3 File 1 File 4 File 6 I/O in the container causes data to be copied up (CoW) from File 1 File 2 File 3 L2 image CoW granularity may L1 File 2 File 5 File 6 be file (e.g. AUFS), or block (e.g. DevMapper) Image and its layers (e.g. Ubuntu 15.04) CoW impacts I/O performance 4 2016 Storage Developer Conference. © EMC Corp All Rights Reserved. Current Docker Storage Drivers Pluggable storage driver design Configurable at start of daemon Shared for all containers/images Using snapshot and copy-on-write for space efficiency Driver Backing FS Linux distribution AUFS Any (EXT4, XFS) Ubuntu, Debian Device Mapper Any (EXT4, XFS) CentOS, RHEL, Fedora OverlayFS Any(EXT4, XFS) CoreOS ZFS ZFS (FS+vol, Block CoW) Extra install Btrfs Btrfs (block CoW) Not very stable, Docker has no commercial support 5 2016 Storage Developer Conference. © EMC Corp All Rights Reserved. Recommendations From Docker https://docs.docker.com/engine/userguide/storagedriver/selectadriver/ NO single driver is well suited to every use-case Considering: Stability, Expertise, Future-proofing “Use the default driver for your distribution” 6 2016 Storage Developer Conference. © EMC Corp All Rights Reserved. Deployment in Practical Environments Likely DM & AUFS are the most widely used Mature, stable DM is in kernel Expertise Widely available CentOS, RHEL, Fedora Ubuntu, Debian … 7 2016 Storage Developer Conference. © EMC Corp All Rights Reserved. CoW Performance Penalty DM/AUFS as example DevMapper (dm-thin) AUFS Lookup Good: Single layer (Btree split) Bad: Cross-layer traverse I/O Good: Block level CoW Good: Share page cache Bad: Can’t share (page) cache Bad: File level CoW Memory Bad: Multiple data copies Good: Single/shared data copy efficiency 8 2016 Storage Developer Conference. © EMC Corp All Rights Reserved. CoW Performance Penalty Example Initial copy-up from image to container adds I/O latency Sibling containers suffers copy-up again even accessing the same data First read by C1 penalized by CoW Re-read of the same data by C1 satisfied from C1’s page cache (20X faster) First read by C2 penalized again due to DM CoW (CentOS, HDD, DM-thin) MB/s C1: Read MB/s MB/s C2: Read MB/s 3000 3000 2500 2500 2500 2500 2000 2000 1500 1500 1000 1000 500 118 500 120 0 0 9 C1.first read C1.re-read C2 .first read C2. re-read 2016 Storage Developer Conference. © EMC Corp All Rights Reserved. Our Goals Target dense containers(snaps) environment Multiple containers from same image, similar workload Improve common CoW performance Speedup cross-layer lookup (akin to AUFS) Reduce disk IO (akin to DM) Future, facilitate single mem copy Software atop of existing CoW infrastructure NO extra data cache: but efficient metadata Cross-container: C1C2, C2C3, … Cross-image: e.g. between Ubuntu and Tensorflow As long as derived from the same base layer Good scalability 10 2016 Storage Developer Conference. © EMC Corp All Rights Reserved. DRR: Data Relationship & Reference *Research project (i.e. not mature enough for production) A metadata describes latest ownership of data Layer relationship: A tree for base/snap relationship, R/W containers reside at leaf Lookup layer tree: when driver indicates data is shared Update layer tree: when create/delete a layer E.g. image pull, commit changes, or delete image Data access record: tracks latest IO on shared data Record: {LBN, Len, srcLayer, targetLayer, pageBitmap} DRR common operations: Add/update: add new record when IO done on shared data Lookup: for new IO (read, small write) on shared data to reference from recent valid page cache, instead of disk Remove: write IO invalidates and removes the record since data is now owned by that container 11 2016 Storage Developer Conference. © EMC Corp All Rights Reserved. DRR I/O Diagram File-A File-A More containers C3C3C3 Kernel C1 3 C2 6 5 4 Global Shared DRR per host 2 DRR metadata (logically per snapFamily) 1 CoW infrastructure (DevMapper, AUFS, ….) 1. C1 first time accesses file data, loads from disk 2. Adds record to DRR metadata after IO 3. Return data to C1 4. C2 to access the same data 5. Look up DRR metadata for valid reference 6. Memory copy data from C1 page cache via normal file interface at corresponding offset, update DRR to reflect most recent reference 12 2016 Storage Developer Conference. © EMC Corp All Rights Reserved. Linux DM-thin One of DM target drivers, in kernel for 5Y, mature enough Shared block pool for multiple thin devices Internally single layer (metadata CoW) so no traverse Nice support many snaps in depth! (snap-of-snap) Metadata (Btree) and data CoW Granularity: default 64KB chunk 2-level Btree Dev4 Dev-ID Dev Tree Dev3 ThinDev2 ThinDev5 FS-LBN Base Dev(ID=1) Block Tree Thin Pool PBN (in 64KB) 13 2016 Storage Developer Conference. © EMC Corp All Rights Reserved. DRR Key Structures: LayerTree, AccessTable Layer_Tree Access_Table (hierarchy) (LBN -> recent access entry) PoolBase (one per host) DevID1 bio.LBN SnapFamily1 Hash ID2 ID3 RO layer Chunk Chunk1 Chunk2 ID5 ID4 Versions & Ref ID6 R/W layer OwnerDevID OwnerDevID refDevID[] refDevID[] ID7 ID8 ID9 ID10 pageRefList pageRefList state state Ubuntu Tensorflow 14 2016 Storage Developer Conference. © EMC Corp All Rights Reserved. DRR with DM-Thin Transparent to Docker/App Kernel module, between Docker/FS and DM-thin No change to core DM-thin CoW essential Leverage existing metadata, hierarchy, addressing … Determine block is shared or not. dm_thin_find_block() Break block sharing: break_sharing() File data, and shared block only Current Btree Mapping Actual-PBN (40bit) CTime(24bit) BIO Denote owner dev; page DevID Address_space 1:1 w/ DevID LBN address_space *mapping inode *host page fileOffset … 15 2016 Storage Developer Conference. © EMC Corp All Rights Reserved. New I/O Flow with DRR App, FileSystem (XFS, Ext4) bio(dev, LBN, page ) DRR + DM-thin Lookup Btree Shared Blk? && N (Read || small-write) RD WR Y Lookup DRR Miss (In: DevID, LBN) Disk IO Break-sharing Hit Memcpy from recent Add DRR entry Remove page cache (LBN, curDevID, DRR entry (file ino, offset from ownerDevID) current bio, clean) 16 2016 Storage Developer Conference. © EMC Corp All Rights Reserved. DRR: Preliminary Benchmark of PoC Kernel driver integrated w/ DM-thin 1800+ LOC, No Docker code change Env: CentOS7, Docker1.10, DM-thin (64KB chunk), Ext4 HDD based system E5410 @ 2,33GHz, 8 cores, 16GB DRAM 600GB SATA disk configured as LVM-direct CentOS7, kernel 3.10.0-327 PCIe SSD based system E5-2665 @ 2.40GHz, 32 cores, 48 GB DRAM OCZ PCIe SSD RM84 (1.2TB) as LVM-direct CentOS7, Kernel 3.10.229 17 2016 Storage Developer Conference. © EMC Corp All Rights Reserved. DRR: Launch Storm Test (HDD) Launch TensorFlow containers Each TensorFlow container needs to load ~104MB trained model & parameters Results: up to 47% faster with DRR Time (Sec) Baseline DRR 200 200 150 142 98 100 Seconds 66 49 50 24 26 13 14 6 5 7 0 2 4 8 16 32 64 18 Number of containers 2016 Storage Developer Conference. © EMC Corp All Rights Reserved. Container Launch Internals Steps to start a container 1. Create new thin dev: DM-thin locking/serialization 2. Mount Ext4 FS Out of current DRR scope FS metadata blocks, not file data 3. Binary file and libraries: Handled through DRR 4. Config file, parameters data …: 19 2016 Storage Developer Conference. © EMC Corp All Rights Reserved. DRR: Container I/O Test (HDD) 10X faster (~120MB/s -> 1.2GB/s) Launch C1, C2 & C3 in order; then rm C1 and launch C4 Issue same IO to read image big file on each container local page cache hit gets 2.5GB/s (1.2/2.5 = 48%, CPU: 2%) MB/s CPU 2500 Read I/O BW12% & CPU2500 2500 12% 10% 10% 2000 10% 2000 8% 8% 1500 1500 1200 1100 1100 6% 6% 1000 5% 5% 1000 5% 4% PC hit 4% 500 500 2% 2% 2% 118 rm C1 0 warmup 0% 0 0% 1 20 1C1 C22 C33 C44 C4.re-read 20 2016 Storage Developer Conference. © EMC Corp All Rights Reserved. DRR: Computing + IO test (HDD) 2.7X faster in execution time: md5sum <big file> Higher CPU (+4% vs. local PC hit, due to block bio + DRR) Md5sum (Time & CPU) Time(Sec) CPU 6 18% 16% 16% 5 4.8 4.8 14% 4 12% 12% 10% 3 8% 2 1.78 1.58 6% 5% 5% 4% 1 Baseline DRR ON DRR hit PC hit DRR off Warm up 2% 0 0% 21 C1.md5 C2.md5 C3.md5 C3.md5 re-run 2016 Storage Developer Conference.
Recommended publications
  • Copy on Write Based File Systems Performance Analysis and Implementation
    Copy On Write Based File Systems Performance Analysis And Implementation Sakis Kasampalis Kongens Lyngby 2010 IMM-MSC-2010-63 Technical University of Denmark Department Of Informatics Building 321, DK-2800 Kongens Lyngby, Denmark Phone +45 45253351, Fax +45 45882673 [email protected] www.imm.dtu.dk Abstract In this work I am focusing on Copy On Write based file systems. Copy On Write is used on modern file systems for providing (1) metadata and data consistency using transactional semantics, (2) cheap and instant backups using snapshots and clones. This thesis is divided into two main parts. The first part focuses on the design and performance of Copy On Write based file systems. Recent efforts aiming at creating a Copy On Write based file system are ZFS, Btrfs, ext3cow, Hammer, and LLFS. My work focuses only on ZFS and Btrfs, since they support the most advanced features. The main goals of ZFS and Btrfs are to offer a scalable, fault tolerant, and easy to administrate file system. I evaluate the performance and scalability of ZFS and Btrfs. The evaluation includes studying their design and testing their performance and scalability against a set of recommended file system benchmarks. Most computers are already based on multi-core and multiple processor architec- tures. Because of that, the need for using concurrent programming models has increased. Transactions can be very helpful for supporting concurrent program- ming models, which ensure that system updates are consistent. Unfortunately, the majority of operating systems and file systems either do not support trans- actions at all, or they simply do not expose them to the users.
    [Show full text]
  • The Page Cache Today’S Lecture RCU File System Networking(Kernel Level Syncmem
    2/21/20 COMP 790: OS Implementation COMP 790: OS Implementation Logical Diagram Binary Memory Threads Formats Allocators User System Calls Kernel The Page Cache Today’s Lecture RCU File System Networking(kernel level Syncmem. Don Porter management) Memory Device CPU Management Drivers Scheduler Hardware Interrupts Disk Net Consistency 1 2 1 2 COMP 790: OS Implementation COMP 790: OS Implementation Recap of previous lectures Background • Page tables: translate virtual addresses to physical • Lab2: Track physical pages with an array of PageInfo addresses structs • VM Areas (Linux): track what should be mapped at in – Contains reference counts the virtual address space of a process – Free list layered over this array • Hoard/Linux slab: Efficient allocation of objects from • Just like JOS, Linux represents physical memory with a superblock/slab of pages an array of page structs – Obviously, not the exact same contents, but same idea • Pages can be allocated to processes, or to cache file data in memory 3 4 3 4 COMP 790: OS Implementation COMP 790: OS Implementation Today’s Problem The address space abstraction • Given a VMA or a file’s inode, how do I figure out • Unifying abstraction: which physical pages are storing its data? – Each file inode has an address space (0—file size) • Next lecture: We will go the other way, from a – So do block devices that cache data in RAM (0---dev size) physical page back to the VMA or file inode – The (anonymous) virtual memory of a process has an address space (0—4GB on x86) • In other words, all page
    [Show full text]
  • The Kernel Report
    The kernel report (ELC 2012 edition) Jonathan Corbet LWN.net [email protected] The Plan Look at a year's worth of kernel work ...with an eye toward the future Starting off 2011 2.6.37 released - January 4, 2011 11,446 changes, 1,276 developers VFS scalability work (inode_lock removal) Block I/O bandwidth controller PPTP support Basic pNFS support Wakeup sources What have we done since then? Since 2.6.37: Five kernel releases have been made 59,000 changes have been merged 3069 developers have contributed to the kernel 416 companies have supported kernel development February As you can see in these posts, Ralink is sending patches for the upstream rt2x00 driver for their new chipsets, and not just dumping a huge, stand-alone tarball driver on the community, as they have done in the past. This shows a huge willingness to learn how to deal with the kernel community, and they should be strongly encouraged and praised for this major change in attitude. – Greg Kroah-Hartman, February 9 Employer contributions 2.6.38-3.2 Volunteers 13.9% Wolfson Micro 1.7% Red Hat 10.9% Samsung 1.6% Intel 7.3% Google 1.6% unknown 6.9% Oracle 1.5% Novell 4.0% Microsoft 1.4% IBM 3.6% AMD 1.3% TI 3.4% Freescale 1.3% Broadcom 3.1% Fujitsu 1.1% consultants 2.2% Atheros 1.1% Nokia 1.8% Wind River 1.0% Also in February Red Hat stops releasing individual kernel patches March 2.6.38 released – March 14, 2011 (9,577 changes from 1198 developers) Per-session group scheduling dcache scalability patch set Transmit packet steering Transparent huge pages Hierarchical block I/O bandwidth controller Somebody needs to get a grip in the ARM community.
    [Show full text]
  • Rootless Containers with Podman and Fuse-Overlayfs
    CernVM Workshop 2019 (4th June 2019) Rootless containers with Podman and fuse-overlayfs Giuseppe Scrivano @gscrivano Introduction 2 Rootless Containers • “Rootless containers refers to the ability for an unprivileged user (i.e. non-root user) to create, run and otherwise manage containers.” (https://rootlesscontaine.rs/ ) • Not just about running the container payload as an unprivileged user • Container runtime runs also as an unprivileged user 3 Don’t confuse with... • sudo podman run --user foo – Executes the process in the container as non-root – Podman and the OCI runtime still running as root • USER instruction in Dockerfile – same as above – Notably you can’t RUN dnf install ... 4 Don’t confuse with... • podman run --uidmap – Execute containers as a non-root user, using user namespaces – Most similar to rootless containers, but still requires podman and runc to run as root 5 Motivation of Rootless Containers • To mitigate potential vulnerability of container runtimes • To allow users of shared machines (e.g. HPC) to run containers without the risk of breaking other users environments • To isolate nested containers 6 Caveat: Not a panacea • Although rootless containers could mitigate these vulnerabilities, it is not a panacea , especially it is powerless against kernel (and hardware) vulnerabilities – CVE 2013-1858, CVE-2015-1328, CVE-2018-18955 • Castle approach : it should be used in conjunction with other security layers such as seccomp and SELinux 7 Podman 8 Rootless Podman Podman is a daemon-less alternative to Docker • $ alias
    [Show full text]
  • Dm-X: Protecting Volume-Level Integrity for Cloud Volumes and Local
    dm-x: Protecting Volume-level Integrity for Cloud Volumes and Local Block Devices Anrin Chakraborti Bhushan Jain Jan Kasiak Stony Brook University Stony Brook University & Stony Brook University UNC, Chapel Hill Tao Zhang Donald Porter Radu Sion Stony Brook University & Stony Brook University & Stony Brook University UNC, Chapel Hill UNC, Chapel Hill ABSTRACT on Amazon’s Virtual Private Cloud (VPC) [2] (which provides The verified boot feature in recent Android devices, which strong security guarantees through network isolation) store deploys dm-verity, has been overwhelmingly successful in elim- data on untrusted storage devices residing in the public cloud. inating the extremely popular Android smart phone rooting For example, Amazon VPC file systems, object stores, and movement [25]. Unfortunately, dm-verity integrity guarantees databases reside on virtual block devices in the Amazon are read-only and do not extend to writable volumes. Elastic Block Storage (EBS). This paper introduces a new device mapper, dm-x, that This allows numerous attack vectors for a malicious cloud efficiently (fast) and reliably (metadata journaling) assures provider/untrusted software running on the server. For in- volume-level integrity for entire, writable volumes. In a direct stance, to ensure SEC-mandated assurances of end-to-end disk setup, dm-x overheads are around 6-35% over ext4 on the integrity, banks need to guarantee that the external cloud raw volume while offering additional integrity guarantees. For storage service is unable to remove log entries documenting cloud storage (Amazon EBS), dm-x overheads are negligible. financial transactions. Yet, without integrity of the storage device, a malicious cloud storage service could remove logs of a coordinated attack.
    [Show full text]
  • ECE 598 – Advanced Operating Systems Lecture 19
    ECE 598 { Advanced Operating Systems Lecture 19 Vince Weaver http://web.eece.maine.edu/~vweaver [email protected] 7 April 2016 Announcements • Homework #7 was due • Homework #8 will be posted 1 Why use FAT over ext2? • FAT simpler, easy to code • FAT supported on all major OSes • ext2 faster, more robust filename and permissions 2 btrfs • B-tree fs (similar to a binary tree, but with pages full of leaves) • overwrite filesystem (overwite on modify) vs CoW • Copy on write. When write to a file, old data not overwritten. Since old data not over-written, crash recovery better Eventually old data garbage collected • Data in extents 3 • Copy-on-write • Forest of trees: { sub-volumes { extent-allocation { checksum tree { chunk device { reloc • On-line defragmentation • On-line volume growth 4 • Built-in RAID • Transparent compression • Snapshots • Checksums on data and meta-data • De-duplication • Cloning { can make an exact snapshot of file, copy-on- write different than link, different inodles but same blocks 5 Embedded • Designed to be small, simple, read-only? • romfs { 32 byte header (magic, size, checksum,name) { Repeating files (pointer to next [0 if none]), info, size, checksum, file name, file data • cramfs 6 ZFS Advanced OS from Sun/Oracle. Similar in idea to btrfs indirect still, not extent based? 7 ReFS Resilient FS, Microsoft's answer to brtfs and zfs 8 Networked File Systems • Allow a centralized file server to export a filesystem to multiple clients. • Provide file level access, not just raw blocks (NBD) • Clustered filesystems also exist, where multiple servers work in conjunction.
    [Show full text]
  • How to Create a Custom Live CD for Secure Remote Incident Handling in the Enterprise
    How to Create a Custom Live CD for Secure Remote Incident Handling in the Enterprise Abstract This paper will document a process to create a custom Live CD for secure remote incident handling on Windows and Linux systems. The process will include how to configure SSH for remote access to the Live CD even when running behind a NAT device. The combination of customization and secure remote access will make this process valuable to incident handlers working in enterprise environments with limited remote IT support. Bert Hayes, [email protected] How to Create a Custom Live CD for Remote Incident Handling 2 Table of Contents Abstract ...........................................................................................................................................1 1. Introduction ............................................................................................................................5 2. Making Your Own Customized Debian GNU/Linux Based System........................................7 2.1. The Development Environment ......................................................................................7 2.2. Making Your Dream Incident Handling System...............................................................9 2.3. Hardening the Base Install.............................................................................................11 2.3.1. Managing Root Access with Sudo..........................................................................11 2.4. Randomizing the Handler Password at Boot Time ........................................................12
    [Show full text]
  • Dynamically Tuning the JFS Cache for Your Job Sjoerd Visser Dynamically Tuning the JFS Cache for Your Job Sjoerd Visser
    Dynamically Tuning the JFS Cache for Your Job Sjoerd Visser Dynamically Tuning the JFS Cache for Your Job Sjoerd Visser The purpose of this presentation is the explanation of: IBM JFS goals: Where was Journaled File System (JFS) designed for? JFS cache design: How the JFS File System and Cache work. JFS benchmarking: How to measure JFS performance under OS/2. JFS cache tuning: How to optimize JFS performance for your job. What do these settings say to you? [E:\]cachejfs SyncTime: 8 seconds MaxAge: 30 seconds BufferIdle: 6 seconds Cache Size: 400000 kbytes Min Free buffers: 8000 ( 32000 K) Max Free buffers: 16000 ( 64000 K) Lazy Write is enabled Do you have a feeling for this? Do you understand the dynamic cache behaviour of the JFS cache? Or do you just rely on the “proven” cachejfs settings that the eCS installation presented to you? Do you realise that the JFS cache behaviour may be optimized for your jobs? November 13, 2009 / page 2 Dynamically Tuning the JFS Cache for Your Job Sjoerd Visser Where was Journaled File System (JFS) designed for? 1986 Advanced Interactive eXecutive (AIX) v.1 based on UNIX System V. for IBM's RT/PC. 1990 JFS1 on AIX was introduced with AIX version 3.1 for the RS/6000 workstations and servers using 32-bit and later 64-bit IBM POWER or PowerPC RISC CPUs. 1994 JFS1 was adapted for SMP servers (AIX 4) with more CPU power, many hard disks and plenty of RAM for cache and buffers. 1995-2000 JFS(2) (revised AIX independent version in c) was ported to OS/2 4.5 (1999) and Linux (2000) and also was the base code of the current JFS2 on AIX branch.
    [Show full text]
  • In Search of the Ideal Storage Configuration for Docker Containers
    In Search of the Ideal Storage Configuration for Docker Containers Vasily Tarasov1, Lukas Rupprecht1, Dimitris Skourtis1, Amit Warke1, Dean Hildebrand1 Mohamed Mohamed1, Nagapramod Mandagere1, Wenji Li2, Raju Rangaswami3, Ming Zhao2 1IBM Research—Almaden 2Arizona State University 3Florida International University Abstract—Containers are a widely successful technology today every running container. This would cause a great burden on popularized by Docker. Containers improve system utilization by the I/O subsystem and make container start time unacceptably increasing workload density. Docker containers enable seamless high for many workloads. As a result, copy-on-write (CoW) deployment of workloads across development, test, and produc- tion environments. Docker’s unique approach to data manage- storage and storage snapshots are popularly used and images ment, which involves frequent snapshot creation and removal, are structured in layers. A layer consists of a set of files and presents a new set of exciting challenges for storage systems. At layers with the same content can be shared across images, the same time, storage management for Docker containers has reducing the amount of storage required to run containers. remained largely unexplored with a dizzying array of solution With Docker, one can choose Aufs [6], Overlay2 [7], choices and configuration options. In this paper we unravel the multi-faceted nature of Docker storage and demonstrate its Btrfs [8], or device-mapper (dm) [9] as storage drivers which impact on system and workload performance. As we uncover provide the required snapshotting and CoW capabilities for new properties of the popular Docker storage drivers, this is a images. None of these solutions, however, were designed with sobering reminder that widespread use of new technologies can Docker in mind and their effectiveness for Docker has not been often precede their careful evaluation.
    [Show full text]
  • MINCS - the Container in the Shell (Script)
    MINCS - The Container in the Shell (script) - Masami Hiramatsu <[email protected]> Tech Lead, Linaro Ltd. Open Source Summit Japan 2017 LEADING COLLABORATION IN THE ARM ECOSYSTEM Who am I... Masami Hiramatsu - Linux kernel kprobes maintainer - Working for Linaro as a Tech Lead LEADING COLLABORATION IN THE ARM ECOSYSTEM Demo # minc top # minc -r /opt/debian/x86_64 # minc -r /opt/debian/arm64 --arch arm64 LEADING COLLABORATION IN THE ARM ECOSYSTEM What Is MINCS? My Personal Fun Project to learn how linux containers work :-) LEADING COLLABORATION IN THE ARM ECOSYSTEM What Is MINCS? Mini Container Shell Scripts (pronounced ‘minks’) - Container engine implementation using POSIX shell scripts - It is small (~60KB, ~2KLOC) (~20KB in minimum) - It can run on busybox - No architecture dependency (* except for qemu/um mode) - No need for special binaries (* except for libcap, just for capsh --exec) - Main Features - Namespaces (Mount, PID, User, UTS, Net*) - Cgroups (CPU, Memory) - Capabilities - Overlay filesystem - Qemu cross-arch/system emulation - User-mode-linux - Image importing from dockerhub And all are done by CLI commands :-) LEADING COLLABORATION IN THE ARM ECOSYSTEM Why Shell Script? That is my favorite language :-) - Easy to understand for *nix administrators - Just a bunch of commands - Easy to modify - Good for prototyping - Easy to deploy - No architecture dependencies - Very small - Able to run on busybox (+ libcap is perfect) LEADING COLLABORATION IN THE ARM ECOSYSTEM MINCS Use-Cases For Learning - Understand how containers work For Development - Prepare isolated (cross-)build environment For Testing - Test new applications in isolated environment - Test new kernel features on qemu using local tools For products? - Maybe good for embedded devices which has small resources LEADING COLLABORATION IN THE ARM ECOSYSTEM What Is A Linux Container? There are many linux container engines - Docker, LXC, rkt, runc, ..
    [Show full text]
  • Kernel Boot-Time Tracing
    Kernel Boot-time Tracing Linux Plumbers Conference 2019 - Tracing Track Masami Hiramatsu <[email protected]> Linaro, Ltd. Speaker Masami Hiramatsu - Working for Linaro and Linaro members - Tech Lead for a Landing team - Maintainer of Kprobes and related tracing features/tools Why Kernel Boot-time Tracing? Debug and analyze boot time errors and performance issues - Measure performance statistics of kernel boot - Analyze driver init failure - Debug boot up process - Continuously tracing from boot time etc. What We Have There are already many ftrace options on kernel command line ● Setup options (trace_options=) ● Output to printk (tp_printk) ● Enable events (trace_events=) ● Enable tracers (ftrace=) ● Filtering (ftrace_filter=,ftrace_notrace=,ftrace_graph_filter=,ftrace_graph_notrace=) ● Add kprobe events (kprobe_events=) ● And other options (alloc_snapshot, traceoff_on_warning, ...) See Documentation/admin-guide/kernel-parameters.txt Example of Kernel Cmdline Parameters In grub.conf linux /boot/vmlinuz-5.1 root=UUID=5a026bbb-6a58-4c23-9814-5b1c99b82338 ro quiet splash tp_printk trace_options=”sym-addr” trace_clock=global ftrace_dump_on_oops trace_buf_size=1M trace_event=”initcall:*,irq:*,exceptions:*” kprobe_event=”p:kprobes/myevent foofunction $arg1 $arg2;p:kprobes/myevent2 barfunction %ax” What Issues? Size limitation ● kernel cmdline size is small (< 256bytes) ● A half of the cmdline is used for normal boot Only partial features supported ● ftrace has too complex features for single command line ● per-event filters/actions, instances, histograms. Solutions? 1. Use initramfs - Too late for kernel boot time tracing 2. Expand kernel cmdline - It is not easy to write down complex tracing options on bootloader (Single line options is too simple) 3. Reuse structured boot time data (Devicetree) - Well documented, structured data -> V1 & V2 series based on this. Boot-time Trace: V1 and V2 series V1 and V2 series posted at June.
    [Show full text]
  • Linux Perf Event Features and Overhead
    Linux perf event Features and Overhead 2013 FastPath Workshop Vince Weaver http://www.eece.maine.edu/∼vweaver [email protected] 21 April 2013 Performance Counters and Workload Optimized Systems • With processor speeds constant, cannot depend on Moore's Law to deliver increased performance • Code analysis and optimization can provide speedups in existing code on existing hardware • Systems with a single workload are best target for cross- stack hardware/kernel/application optimization • Hardware performance counters are the perfect tool for this type of optimization 1 Some Uses of Performance Counters • Traditional analysis and optimization • Finding architectural reasons for slowdown • Validating Simulators • Auto-tuning • Operating System optimization • Estimating power/energy in software 2 Linux and Performance Counters • Linux has become the operating system of choice in many domains • Runs most of the Top500 list (over 90%) on down to embedded devices (Android Phones) • Until recently had no easy access to hardware performance counters, limiting code analysis and optimization. 3 Linux Performance Counter History • oprofile { system-wide sampling profiler since 2002 • perfctr { widely used general interface available since 1999, required patching kernel • perfmon2 { another general interface, included in kernel for itanium, made generic, big push for kernel inclusion 4 Linux perf event • Developed in response to perfmon2 by Molnar and Gleixner in 2009 • Merged in 2.6.31 as \PCL" • Unusual design pushes most functionality into kernel
    [Show full text]