<<

Operating Systems 11/8/2018

Characteristics of I/O Devices

• Data transfer mode – block vs. character Disk Storage • Access method – sequential vs. random • Transfer schedule – synchronous vs. and File Systems asynchronous • Sharing mode – dedicated vs. sharable CS 256/456 • Device speed – latency, seek time, transfer rate, Dept. of Computer Science, University occupancy/delay between operations of Rochester • I/O direction – R, W, R/W

11/8/2018 CSC 2/456 1 11/8/2018 CSC 2/456 2

Recap: Disk Storage Disk Management

• Disk drive • Formatting – mechanical parts (cylinders, tracks, sectors) and how they move to access disk data – Header: sector number etc. – electronic part (disk controller main) exposes a one- – Footer/tail: ECC codes dimensionally addressable set of blocks – Gap – large seek/rotation time – Initialize mapping from logical block number to defect- free sectors • Logical – One or more groups of cylinders – Sector 0: master boot record loaded by BIOS firmware, which contains partition information – Boot record points to boot partition

11/8/2018 CSC 2/456 3 11/8/2018 CSC 2/456 4

CSC 256/456 1 Operating Systems 11/8/2018

Disk Drive – Mechanical Parts Disk Structure

Cylinder • Disk drives are addressed as large 1- dimensional arrays of logical blocks, (set of tracks) Cylinder Track where the logical block is the smallest unit of transfer. (set of tracks)

• The 1-dimensional array of logical blocks is mapped into the sectors of the disk sequentially. – Sector 0 is the first sector of the first track on the outermost cylinder. – Mapping proceeds in order through Multi-surface Disk Disk Surface Cylinders that track, then the rest of the tracks in that cylinder, and then through the rest of the cylinders 11/8/2018 CSC 2/456 5 11/8/2018from outermost to innermost.CSC 2/456 6

Disk Performance Characteristics https://en.wikipedia.org/wiki/Hard_d

12 isk_drive

10 Improvement of HDD characteristics over time • A disk operation has three Started with Developed to 8 Parameter Improvement (1957) (2017) major components 6 [13 Capacity 3.75 megabytes [14] 3.73-million-to- ] 14 terabytes [15] – Seek – moving the heads 4 (formatted) one 2.1 cubic 68 cubic 3 [16][ [17] to the cylinder 2 Physical volume 3 [c][6] inches(34 cm ) 56,000-to-one feet(1.9 m ) d] Seek Seek time(millisecond) 0 containing the desired Seek timeSeek (millisecond) distance 2,000 pounds(910 2.2 ounces (62 g)[ Weight 15,000-to-one[18] sector kg)[6] 16] 2.5 ms to 10 ms; Average access approx. about RW RAM – Rotation – rotating the time 600 milliseconds[6] 200-to-one[19] dependent 60 desired sector to the US$9,200 per US$0.032 per 300-million-to- Price megabyte gigabyte by [22] disk head [20] [21] one 40 (1961) 2015 1.3 terabits per – Transfer – sequentially 2,000 bits per squ 650-million-to- Data density [23] square inch in [25] 20 are inch [24] one moving data to or from A Seagate SCSI drive 2015 c. 2,500,000 hrs An IBM SCSI drive c. 2000 hrs disk [27] 0 Average lifespan [citation needed] (~285 years) 1250-to-one Starting transfer address MTBF [26] Readthroughput(MB/sec) Transfer throughput (MB/sec) MTBF 11/8/2018 CSC 2/456 7 11/8/2018 CSC 2/456 8

CSC 256/456 2 Operating Systems 11/8/2018

Disk FCFS (First-Come-First-Serve)

• Disk scheduling – choose from outstanding disk requests when the disk is ready for a new request – can be done in both disk controller and the – Disk scheduling non-preemptible

• Goals of disk scheduling – overall efficiency – small resource consumption for completing disk I/O workload – fairness – prevent starvation

• Illustration shows the total head movement is 640. • Starvation? 11/8/2018 CSC 2/456 9 11/8/2018 CSC 2/456 10

SSTF (Shortest-Seek-Time-First) SCAN • The disk arm starts at one end of the disk, and moves toward the • Selects the request with the minimum seek time from the current other end, servicing requests until it gets to the other end, where head position. the head movement is reversed and servicing continues. • SSTF scheduling is a form of SJF scheduling. • Sometimes called the elevator algorithm. • Illustration shows the total head movement is 236. • Illustration shows the total head movement is 208.

Starvation? Starvation?

11/8/2018 CSC 2/456 11 11/8/2018 CSC 2/456 12

CSC 256/456 3 Operating Systems 11/8/2018

C-SCAN (Circular-SCAN) C-LOOK • Provides a more uniform wait time than SCAN. • Variation of C-SCAN • The head moves from one end of the disk to the other. servicing • Arm only goes as far as the last request in each direction, then requests as it goes. When it reaches the other end, however, it reverses direction immediately, without first going all the way to immediately returns to the beginning of the disk, without servicing the end of the disk. any requests on the return trip.

Starvation?

11/8/2018 CSC 2/456 13 11/8/2018 CSC 2/456 14

Deadline Scheduling in Concurrent I/O

• A regular elevator-style scheduler similar to C-LOOK • Consider two request handlers in a Web server • Additionally, all I/O requests are put into a FIFO queue with an – each accesses a different stream of sequential data (a file) on disk; expiration time (e.g., 500ms) – each reads a chunk (the buffer size) at a time; does a little CPU • When the head request in the FIFO queue expires, it will be processing; and reads the next chunk executed next (even if it is not next in line according to C-LOOK). • What happens?

• A mix of performance and fairness.

A /

A thread/process

A thread/process

Timeline

Disk I/O CPU or waiting for I/O

11/8/2018 CSC 2/456 15 11/8/2018 CSC 2/456 16

CSC 256/456 4 Operating Systems 11/8/2018

How to Deal with It? Two Disks: Disk Striping

• Aggressive prefetching

• Anticipatory scheduling [Iyer & Druschel, SOSP 2001] • Blocks divided into subblocks – at the completion of an I/O request, the disk scheduler will wait a bit (despite the fact that there is other work to do), in anticipation • Subblocks stored on different disks that a new request with strong locality will be issued; schedule another request if no such new request appears before timeout – included in Linux 2.6 Disk 1 Disk 2

CPU

11/8/2018 CSC 2/456 17 18

Two Disks: Mirroring Multiple Disks: Parity Block • Have one disk contain parity bits of • Make a copy of each block on each disk blocks on other devices • Provides redundancy • Provides redundancy without full copy

Disk 1 Disk 2 Disk 1 Disk 2 Disk 3

CPU CPU

19 20

CSC 256/456 5 Operating Systems 11/8/2018

Exploiting Concurrency Solid State Drives

• RAID: Redundant Arrays of Independent Disks • No mechanical component (moving parts) – RAID 0: data striping at block level, no redundancy • Lower energy requirements – RAID 1: mirrored disks (100% overhead) • Speed – RAID 2: bit-level striping with parity bits, synchronized writes – Reads and writes in the order of 10s of – RAID 3: data striping at the bit level with parity disk, microseconds (reading faster than writing) synchronized writes – Erase on the order of a millisecond – RAID 4: data striping at block level with parity disk • Finite number of erase and write cycles, – RAID 5: scattered parity requiring what is called “” – RAID 6: handles multiple disk failures

11/8/2018 CSC 2/456 21 11/8/2018 CSC 2/456 22

Flash Memory (Based on Charge) File Systems • Based on floating-gate transistor • A is the OS abstraction for storage resources – File is a logical storage unit in the OS abstract interface for storage resources • Extension of address space (temporary files) • Non-volatile storage that survives the execution of an individual program (persistent files) – Directory is a logical “container” for a group of files

11/8/2018 CSC 2/456 23 11/8/2018 CSC 2/456 24

CSC 256/456 6 Operating Systems 11/8/2018

Operations Supported File System Issues

• Create – associate a name with a file • File naming and other attributes: • Delete – remove the file – name, size, access time, sharing/protection, • Rename – associate a new name with a file location • Open – create cached context that is associated • Intra-file structure implicitly with future reads and writes – None - sequence of words, bytes • Write – store data in a file – Complex Structures • Read – access the data associated with a file • records/formatted document/executable • Close – discard cached context • File system organization: efficiency of disk access • Seek – random access to any record or byte • Concurrent access: allow multiple processes to read/write • Map – place in address space for convenience (memory- • Reliability: integrity in the presence of failures based loads and stores), speed; disadvantages: lengths • Protection: sharing/protection attributes and access control that are not multiples of the page size, consistency with lists (ACLs) open/read/write interface

11/8/2018 CSC 2/456 25 11/8/2018 CSC 2/456 26

Naming Files Using Directory File Naming Structures • Fixed vs. variable length • Directory: maps names to files; directories may – Fixed: 8-255 characters themselves be files – Single level (flat): no two files may have the – Variable: length:value encoding same name • File extensions – system supported vs. – Two level: per-user single-level directory convention – Hierarchical: generalization of two level; each file system is assigned the root of a tree – Acyclic (or cyclic) graph: allow sharing of files across directories; hard versus soft (symbolic) links

11/8/2018 CSC 2/456 27 11/8/2018 CSC 2/456 28

CSC 256/456 7 Operating Systems 11/8/2018

Shared Files: Links File Types

• File appears simultaneously in different • Control operations allowed on files directories • Use file name extensions to indicate type (in • File system is now a directed acyclic graph , this is just a convention) (DAG) • Structured vs. unstructured data • – directory points to file , which maintains a count of pointers – None - sequence of words, bytes • Soft link – new file type, containing the path of – Complex Structures the file to which it is linked, along with • records/formatted document/executable permissions (symbolic linking) – no pointer to • Sequential, random, or key-based (indexed) inode access

11/8/2018 CSC 2/456 29 11/8/2018 CSC 2/456 30

File Space Organization Contiguous File Allocation • Disk basic allocation unit is a sector (e.g., 512 bytes) • Each file occupies a set of • File system may choose to use a larger block size (e.g., contiguous blocks on the disk 4KB) • Advantage: • File allocation methods – Simple – only starting – How disk blocks are allocated for files location (block #) and length (number of blocks) • Contiguous allocation are required • Linked allocation – Fast sequential; also quite • Indexed allocation fast random access – Metrics: • Disadvantage: • Access speed (sequential & random) – External fragmentation • Space utilization – Inflexible when appending to a file

11/8/2018 CSC 2/456 31 11/8/2018 CSC 2/456 32

CSC 256/456 8 Operating Systems 11/8/2018

Linked File Allocation • Each file is a linked list of disk blocks – each block contains a next pointer – directory only needs to store the pointer to the first block – blocks may be scattered anywhere on the disk • Advantage – Space efficient – Flexible in appending • Disadvantage:

11/8/2018 – Poor access speed (sequentialCSC 2/456 & random) 33 11/8/2018 CSC 2/456 34

Multi-level Indexed File Allocation Indexed File Allocation • Brings all pointers together into the index block. ()

outer-index

index table file

11/8/2018 CSC 2/456 35 11/8/2018 CSC 2/456 36

CSC 256/456 9 Operating Systems 11/8/2018

UNIX (4K bytes per block) Indexed Allocation (pros and cons)

• Space efficiency – no external fragmentation – overhead of index blocks

• Access speed – random access – sequential access

11/8/2018 CSC 2/456 37 11/8/2018 CSC 2/456 38

Directory on the Disk Where to put file attributes?

• Directory is a container of files • File control block – data structure including all attributes for a file • For space management, similar to files • But for directory, file system does care about its content • Where to put the file control block? – Linear list of file names and attributes (including – In the directory data structure pointers to the data blocks) • Hard to share files through links • time-consuming to search an item – In the system-level dedicated data structure – Hash Table – using a link list to chain all files hashed to • inode the same value • Pro: decreases directory search time • Con: increased complexity, a little waste of space • how much benefit does it really provide?

11/8/2018 CSC 2/456 39 11/8/2018 CSC 2/456 40

CSC 256/456 10 Operating Systems 11/8/2018

Free-Space Management Device Space Management head pointer

• Free-space management for memory • Block size: internal fragmentation/wasted space vs. allocation efficiency and access latency • Bit map and linked free block list • Space overhead: bit vs. word • Free space management • Efficiency • Reducing disk arm motion – getting the address of one free block – getting the addresses of a number of free blocks

• Alternative: Grouping/clustering

… … 11/8/2018 CSC 2/456 41 11/8/2018 CSC 2/456 42

File System Issues File Sharing and Protection • Sharing of files on multi-user systems is desirable • File naming and other attributes: – name, size, access time, sharing/protection, • Sharing must be accompanied by a protection scheme location – In general, a protection scheme specifies whether any • Intra-file structure specific user can access any specific file – None - sequence of words, bytes • Access control lists (ACL) – Complex Structures • User, group, other permissions • records/formatted document/executable • File system organization: efficiency of disk access • Concurrent access: allow multiple processes to read/write • Reliability: integrity in the presence of failures • Protection: sharing/protection attributes and access control lists (ACLs)

11/8/2018 CSC 2/456 43 11/8/2018 CSC 2/456 44

CSC 256/456 11 Operating Systems 11/8/2018

File System Layout In-Memory Structures

entire disk • Used for file system management and performance Disk partitions improvement via caching Partition table – Mount table (info on each mounted volume)

MBR – Directory-structure cache – System-wide open file table • Copy of FCB (file control block) of each open file – Per-process open file table • Pointer to entry in system-wide table along with process- Boot blk Super blk Root dir specific information Reserved “Real” usable space: • Open system call returns a pointer to the appropriate management space: • Files • Free space mgmt • Directories entry in per-process file table (file descriptor or file • File attr. blocks • Free space handle)

11/8/2018 CSC 2/456 45 11/8/2018 CSC 2/456 46

Delayed Writes and Data Loss at Swap Space Management Machine

• Part of file system? • Writes are commonly delayed for better performance – Requires navigating directory structure – data to be written is cached

– Disk allocation data structures • A sudden machine crash may result in a loss of data • Separate disk partition – a completed write does not mean the data is safely – No file system or directory structure stored on storage

– Optimize for speed rather than storage • fsync() – flush all delayed writes to disk efficiency – fsync() may not even be totally safe with delayed – When is swap space created? writes on disk controller buffer cache

11/8/2018 CSC 2/456 47 11/8/2018 CSC 2/456 48

CSC 256/456 12 Operating Systems 11/8/2018

Consistency: Weaker Form of Reliability Journaling • File system operations are not atomic; a sudden machine crash may leave the file system in an • : inconsistent state – maintain a dedicated journal that logs all operations • (In-)Consistency – the logging happens before the real – Missing blocks operation – Duplicate free blocks – each logging is made to be atomic – Duplicate data blocks – after the completion of an operation, its entry is removed from the journal • Consistency checking and fix (, scandisk) – at the recovery time, only journal entries – use redundant data on disk to recover need to be examined ⇒ fast recovery consistency – similar to transactions in systems – E.g., free block cannot be on the free list and in a file 11/8/2018 CSC 2/456 49 11/8/2018 CSC 2/456 50

Log-Structured File Systems Log-Structured vs. Unix • With CPUs faster, memory larger – buffer caches can also be larger – most of read requests can come from the memory cache – thus, most disk accesses will be writes – poor disk performance when most writes are small

• LFS Strategy [Rosenblum&Ousterhout SOSP1991] – structures entire disk as a log – always write to the end of the disk log – when updates are needed, simply add new copies with updated content; old copies of the blocks are still in the earlier portion of the log – periodically purge out useless blocks 11/8/2018 CSC 2/456 51 11/8/2018 CSC 2/456 52

CSC 256/456 13 Operating Systems 11/8/2018

Linux and the Extended File Journaling Systems • LFS is a dynamic journal ext Volume Size File Size Filename length • Physical journal () • Logical journal (NTFS) 2-32 TiB 16 GiB-2TiB 255 bytes

• Snapshotting (ZFS) ext3 4-32 TiB 16 GiB-2TiB 255 bytes

16TiB-1EiB 16 TiB 255 bytes

Ext 3 and 4: Journaling

11/8/2018 CSC 2/456 53 11/8/2018 CSC 2/456 54

“New” Motivations Solid State Drives

• Fast recovery • No mechanical component (moving parts) – Compared to fsck/scandisk • Lower energy requirements • Persistency • Speed – Availability – Reads and writes in the order of 10s of microseconds (reading faster than writing) – Erase on the order of a millisecond • Finite number of erase and write cycles, requiring what is called “wear leveling”

11/8/2018 CSC 2/456 55 11/8/2018 CSC 2/456 56

CSC 256/456 14 Operating Systems 11/8/2018

Solid State Drives: File System (Based on Charge) Implications? • Based on floating-gate transistor • No need to “cluster” data to reduce seek time • Need to avoid writes to the same block • File system cache less useful due to lower speed mismatch • Log-structured file system for SSD – Provides wear leveling

11/8/2018 CSC 2/456 57 11/8/2018 CSC 2/456 58

Flash File Systems for Solid State Read and Write Bandwidths Drives • E.g., JFFS, YAFFS, LogFS • Log-structure file systems

11/8/2018 CSC 2/456 59 11/8/2018 CSC 2/456 60

CSC 256/456 15 Operating Systems 11/8/2018

Example File Systems Miscellaneous Issues

• MS-DOS/Windows – file allocation table (FAT), • Directory structure – acyclic versus cyclic graphs NTFS • File system mounting and the • Berkeley-FFS • File protection • Linux – VFS, ext2fs, ext3, ext4 – Types of access – • NFS • File - r,w,x,append, delete, list • JFFS (Journaling ) • Directory – search for/rename/create/delete file, list directory, traverse file system • … – Access control • Efficiency and performance – Positioning of blocks 11/8/2018 CSC 2/456 61 11/8/2018 CSC 2/456 62 – Buffer and page caches

Virtual File System Buffering, Caching, Spooling

• Techniques for handling speed mismatch – Buffering: e.g., wait for buffer to fill from slow device before writing to disk – Caching: copy in memory for fast access relative to a slower device (e.g., disk) – Spooling: buffer to hold output for a device such as a printer that cannot accept multiplex/handle more than one request at a time

11/8/2018 CSC 2/456 63 11/8/2018 CSC 2/456 64

CSC 256/456 16 Operating Systems 11/8/2018

I/O Software Layers I/O System Layers Application Program • Software Program to manage device High-level OS controller software • System software (part of OS)

Device driver Software in the machine the in Software

Device controller Device Controller • Contains control logic, command • Device-dependent OS I/O software; directly interacts with registers, status registers, and on- controller hardware board buffer space Device • Interface to upper-layer OS code is standardized • Firmware/hardware 11/8/2018 CSC 2/456 65 11/8/2018 CSC 2/456 66

Device Driver Reliability High-level I/O Software

• Device driver is the device-specific part of the kernel- • Device independence space I/O software; It also includes handlers – reuse software as much as possible across different types of devices • Device drivers must run in kernel mode • Buffering ⇒ The crash of a device driver typically brings down the whole system – data coming off a device is stored in an intermediate • Device drivers are probably the buggiest part of the OS buffer – purpose: access speed/granularity matching with • How to make the system more reliable by isolating the I/O devices faults of device drivers? • caching – Run most of the device driver code at user • speculative I/O level – Restrict and limit device driver operations in the kernel

11/8/2018 CSC 2/456 67 11/8/2018 CSC 2/456 68

CSC 256/456 17 Operating Systems 11/8/2018

File System Caching File System Prefetching

• File content is cached in memory buffer for later reuse • File content is read ahead of time for anticipated use in – what is the basic unit of such caching? the near future • Disk blocks vs. clusters vs. pages • Often sequential (based on past access history on the • Replacement policy for file system buffer cache file) – LRU replacement is one possibility; but sequential • What is the advantage of file prefetching? access is very likely in file system I/O • What is the danger of file prefetching? – MRU or free-behind • A balanced scheme that provides competitive performance to the optimal scheme [Li et al. EuroSys 2007]

11/8/2018 CSC 2/456 69 11/8/2018 CSC 2/456 70

Informed Prefetching Buffer Cache in Main Memory

• Informed prefetching – prefetching while utilizing some information about application data access pattern - file system • Memory-mapped I/O memory mapped I/O direct I/O • Application I/O hints [Cao et al. 1994] [Patterson et al. 1995] naturally share page cache with the virtual memory • Automatic I/O hints based on speculative execution virtual memory file system [Chang&Gibson 2000], [Fraser&Chang 2003] system page cache block cache

• Problems: – double buffering – inconsistencies disk

11/8/2018 CSC 2/456 71 11/8/2018 CSC 2/456 72

CSC 256/456 18 Operating Systems 11/8/2018

Unified Buffer Cache & Unified Virtual Memory Multi-level I/O Buffer

• buffer cache in the main • A unified buffer cache uses virtual memory- file system memory mapped I/O direct I/O memory Host machine the same page cache to memory store [Pai et al. 1999] • track cache on the disk – virtual memory pages controller

– memory-mapped pages unified buffer – file system direct I/O (page-based) Disk controller buffer cache data

disk Disk magnetic media

11/8/2018 CSC 2/456 73 11/8/2018 CSC 2/456 74

Protection in UNIX Disclaimer

• Protection domains: users • Parts of the lecture slides contain original work of • Access matrix for files: Abraham Silberschatz, Peter B. Galvin, Greg Gagne, Andrew – a simplified access control list S. Tanenbaum, and Gary Nutt. The slides are intended for the sole purpose of instruction of operating systems at the • Protection commands for files: University of Rochester. All copyrighted materials belong – each user can change protection on files it owns to their original owner(s). – superuser can do everything

11/8/2018 CSC 2/456 75 11/8/2018 CSC 2/456 76

CSC 256/456 19