: Introduction Unix Family Tree (Simplified) • Unix first developed in 1969 at Bell Labs (Thompson & Ritchie) 1969 First Edition 1973 Fifth Edition • Originally written in PDP-7 asm, but then (1973) 1974 rewritten in the ‘new’ high-level language C 1975 1976 Sixth Edition ⇒ easy to port, alter, read, etc. 1977 1978 Seventh Edition th 32V • 6 edition (“V6”) was widely available (1976). 1979 3BSD 1980 4.0BSD – source avail ⇒ people could write new tools. 1981 4.1BSD – nice features of other OSes rolled in promptly. 1982 System III 1983 System V Eighth Edition 4.2BSD SunOS • By 1978, V7 available (for both the 16-bit PDP-11 1984 SVR2 and the new 32-bit VAX-11). 1985 1986 Mach 4.3BSD SunOS 3 • Since then, two main families: 1987 SVR3 Ninth Edition 1988 4.3BSD/Tahoe – AT&T: “System V”, currently SVR4. 1989 SVR4 Tenth Edition 4.3BSD/Reno – Berkeley: “BSD”, currently 4.3BSD/4.4BSD. 1990 OSF/1 SunOS 4 1991 • Standardisation efforts (e.g. POSIX, X/OPEN) to 1992 Solaris homogenise. 1993 4.4BSD Solaris 2 • Best known “UNIX” today is probably , but also get FreeBSD, NetBSD, and (commercially) Solaris, OSF/1, IRIX, and Tru64.

OS Fdns Part 4: Case Studies — UNIX Introduction 149 OS Fdns Part 4: Case Studies — UNIX Introduction 150

Design Features Structural Overview Ritchie and Thompson writing in CACM, July 74, identified the following (new) features of UNIX: Application Application Application (Process) (Process) (Process) 1. A hierarchical file system incorporating demountable volumes.

2. Compatible file, device and inter-process I/O. User System Call Interface 3. The ability to initiate asynchronous processes. Kernel Memory Process File System 4. System command language selectable on a Management Management per-user basis. Block I/O Char I/O 5. Over 100 subsystems including a dozen languages.

6. A high degree of portability. Device Driver Device Driver Device Driver Device Driver Hardware Features which were not included: • real time • Clear separation between user and kernel portions. • multiprocessor support • Processes are unit of scheduling and protection. Fixing the above is pretty hard. • All I/O looks like operations on files.

OS Fdns Part 4: Case Studies — UNIX Overview 151 OS Fdns Part 4: Case Studies — UNIX Overview 152 File Abstraction Directory Hierarchy • A file is an unstructured sequence of bytes. / • Represented in user-space by a file descriptor (fd) • Operations on files are: bin/ dev/ etc/ home/ usr/

– fd = open (pathname, mode) hda hdb tty – fd = creat(pathname, mode)) steve/ jean/ – bytes = read(fd, buffer, nbytes) – count = write(fd, buffer, nbytes) unix.ps index.html – reply = seek(fd, offset, whence) – reply = close(fd) • Directories map names to files (and directories). • Devices represented by special files: • Have distinguished root directory called ’/’ – support above operations, although perhaps • Fully qualified pathnames ⇒ perform traversal with bizarre semantics. from root. – also have ioctl’s: allow access to • Every directory has ’.’ and ’..’ entries: refer to self device-specific functionality. and parent respectively. • Hierarchical structure supported by directory files. • Shortcut: current working directory (cwd). • In addition shell provides access to home directory as ~username (e.g. ~steve/)

OS Fdns Part 4: Case Studies — UNIX Files and the Filesystem 153 OS Fdns Part 4: Case Studies — UNIX Files and the Filesystem 154

Aside: Password File File System Implementation • /etc/passwd holds list of password entries. data • Each entry roughly of the form: type mode userid groupid data user-name:encrypted-passwd:home-directory:shell size nblocks nlinks flags data • Use one-way function to encrypt passwords. timestamps (x3) data

– i.e. a function which is easy to compute in one direct direction, but has a hard to compute inverse. direct blocks (x12) blocks (512) • To login: data

1. Get user name single indirect double indirect to block with 512 2. Get password single indirect entries triple indirect data 3. Encrypt password to block with 512 4. Check against version in /etc/password double indirect entries 5. If ok, instantiate login shell. • Inside kernel, a file is represented by a data • Publicly readable since lots of useful info there. structure called an index-node or i-node. • Problem: off-line attack. • Holds file meta-data: • Solution: shadow passwords (/etc/shadow) 1. Owner, permissions, reference count, etc. 2. Location on disk of actual data (file contents). • Where is the filename kept?

OS Fdns Part 4: Case Studies — UNIX Files and the Filesystem 155 OS Fdns Part 4: Case Studies — UNIX Files and the Filesystem 156 Directories and Links On-Disk Structures

Filename I-Node Hard Disk . 13 .. 2 Partition 1 Partition 2 hello.txt 107 / unix.ps 78 Filename I-Node . 56 .. 214 unix.ps 78 home/ bin/ doc/ Inode Data Inode Data index.html 385 Table Blocks Table Blocks misc 47 Boot-Block Super-Block Super-Block

0 1 2 i i+1 j j+1 j+2 l l+1 m steve/ jean/ hello.txt

• A disk is made up of a boot block followed by one misc/ index.html unix.ps or more partitions. • (a partition is just a contiguous range of N • Directory is a file which maps filenames to i-nodes. fixed-size blocks of size k for some N and k). • An instance of a file in a directory is a (hard) link. • A Unix file-system resides within a partition. • (this is why have reference count in i-node). • Superblock contains info such as: • Directories can have at most 1 (real) link. Why? – number of blocks in file-system – number of free blocks in file-system • Also get soft- or symbolic-links: a ‘normal’ file – start of the free-block list which contains a filename. – start of the free-inode list. – various bookkeeping information.

OS Fdns Part 4: Case Studies — UNIX Files and the Filesystem 157 OS Fdns Part 4: Case Studies — UNIX Files and the Filesystem 158

Mounting File-Systems In-Memory Tables

Process A process-specific Root File-System file tables / Mount 0 11 Point 1 3 2 25 Process B 3 17 4 1 0 2 bin/ dev/ etc/ usr/ home/ File-System 1 27 on /dev/hda2 2 62 3 5 4 19 N 6 hda1 hda2 hdb1 /

steve/ jean/ N 32 User Space

Kernel Space 0 47 acitve inode table 1 135 17 • Entire file-systems can be mounted on an existing system-wide 19 78 directory in an already mounted filesystem. open file table Inode 78 (includes any process 78 -specific info e.g. file pointer) • At very start, only ‘/’ exists ⇒ need to mount a root file-system.

• Subsequently can mount other file-systems, e.g. • Recall process sees files as file descriptors mount("/dev/hda2", "/home", options) • In implementation these are just indices into • Provides a unified name-space: e.g. access process-specific open file table /home/steve/ directly. • Entries point to system-wide open file table. Why? • Cannot have hard links across mount points: why? • These in turn point to (in memory) inode table. • What about soft links?

OS Fdns Part 4: Case Studies — UNIX Files and the Filesystem 159 OS Fdns Part 4: Case Studies — UNIX Files and the Filesystem 160 Access Control Consistency Issues

Owner Group World Owner Group World • To delete a file, use the unlink system call. R W E R W E R W E R W E R W E R W E • From the shell, this is rm

= 0640 = 0755 • Procedure is: 1. check if user has sufficient permissions on the • Access control information held in each inode. file (must have write access). • Three bits for each of owner, group and world: 2. check if user has sufficient permissions on the read, write and execute. directory (must have write access). 3. if ok, remove entry from directory. • What do these mean for directories? 4. Decrement reference count on inode. • In addition have setuid and setgid bits: 5. if now zero: – normally processes inherit permissions of (a) free data blocks. invoking user. (b) free inode. – setuid/setgid allow user to “become” someone • If crash: must check entire file-system: else when running a given program. – e.g. prof owns both executable test (0711 and – check if any block unreferenced. setuid), and score file (0600) – check if any block double referenced. ⇒ any user can run it. ⇒ it can update score file. ⇒ but users can’t cheat.

OS Fdns Part 4: Case Studies — UNIX Files and the Filesystem 161 OS Fdns Part 4: Case Studies — UNIX Files and the Filesystem 162

Unix File-System: Summary Unix Processes • Files are unstructured byte streams. Kernel Address Space • Everything is a file: ‘normal’ files, directories, Unix (shared by all) symbolic links, special files. Kernel • Hierarchy built from root (‘/’). Stack Segment Address Space • Unified name-space (multiple file-systems may be grows downward as per Process mounted on any leaf directory). functions are called Free • Low-level implementation based around inodes. Space

• grows upwards as more Disk contains list of inodes (along with, of course, memory allocated actual data blocks). Data Segment • Processes see file descriptors: small integers which map to system file table. Text Segment • Permissions for owner, group and everyone else. • Setuid/setgid allow for more flexible control. • Recall: a process is a program in execution. • Care needed to ensure consistency. • Have three segments: text, data and stack. • Unix processes are heavyweight.

OS Fdns Part 4: Case Studies — UNIX Files and the Filesystem 163 OS Fdns Part 4: Case Studies — UNIX Processes 164 Unix Process Dynamics Start of Day • Kernel (/vmunix) loaded from disk (how?) and parent process parent process (potentially) continues execution starts. fork wait • Root file-system mounted.

child zombie • process process Process 1 (/etc/init) hand-crafted. program executes execve exit • init reads file /etc/inittab and for each entry: 1. opens terminal special file (e.g. /dev/tty0) 2. duplicates the resulting fd twice. • Process represented by a process id (pid) 3. forks an /etc/tty process. • Hierarchical scheme: parents create children. • each tty process next: • Four basic primitives: 1. initialises the terminal – pid = fork () 2. outputs the string “login:” & waits for input – reply = execve(pathname, argv, envp) 3. execve()’s /bin/login – exit(status) • login then: – pid = wait (status) 1. outputs “password:” & waits for input • fork() nearly always followed by exec() 2. encrypts password and checks it against ⇒ vfork() and/or COW. /etc/passwd. 3. if ok, sets uid & gid, and execve()’s shell. • Patriarch init resurrects /etc/tty on exit.

OS Fdns Part 4: Case Studies — UNIX Processes 165 OS Fdns Part 4: Case Studies — UNIX Processes 166

The Shell Shell Examples

# pwd /home/steve # ls -F IRAM.micro.ps gnome_sizes prog-nc.ps write issue prompt Mail/ ica.tgz rafe/ repeat OSDI99_self_paging.ps.gz lectures/ rio107/ ad TeX/ linbot-1.0/ src/ infinitum adag.pdf manual.ps store.ps.gz read get command line docs/ past-papers/ wolfson/ emacs-lisp/ pbosch/ xeno_prop/ fs.html pepsi_logo.tif # cd src/ child # pwd fork execve process /home/steve/src # ls -F no program cdq/ emacs-20.3.tar.gz misc/ read_mem.c fg? emacs-20.3/ ispell/ read_mem* rio007.tgz executes yes # wc read_mem.c 95 225 2262 read_mem.c # ls -lF r* zombie wait process exit -rwxrwxr-x 1 steve user 34956 Mar 21 1999 read_mem* -rw-rw-r-- 1 steve user 2262 Mar 21 1999 read_mem.c -rw------1 steve user 28953 Aug 27 17:40 rio007.tgz # ls -l /usr/bin/X11/xterm -rwxr-xr-x 2 root system 164328 Sep 24 18:21 /usr/bin/X11/xterm* • Shell just a process like everything else. • • Uses path for convenience. Prompt is ‘#’. • • Conventionally ‘&’ specifies background. Use man to find out about commands. • • Parsing stage (omitted) can do lots. . . User friendly?

OS Fdns Part 4: Case Studies — UNIX Processes 167 OS Fdns Part 4: Case Studies — UNIX Processes 168 Standard I/O Pipes • Every process has three fds on creation: free space old data – stdin: where to read input from. new data – stdout: where to send output. Process A Process B – stderr: where to send diagnostics. • Normally inherited from parent, but shell allows redirection to/from a file, e.g.: – ls >listing.txt – ls >&listing.txt write(fd, buf, n) read(fd, buf, n) – sh temp.txt; Logically consists of a pair of fds wc results • e.g. reply = pipe( int fds[2] ) • Pipeline is better (e.g. ls | wc >results) • Concept of “full” and “empty” pipes. • Most Unix commands are filters ⇒ can build • Only allows communication between processes with almost arbitrarily complex command lines. a common ancestor (why?). • Redirection can cause some buffering subtleties. • Named pipes address this.

OS Fdns Part 4: Case Studies — UNIX Processes 169 OS Fdns Part 4: Case Studies — UNIX Interprocess Communication 170

Signals I/O Implementation • Problem: pipes need planning ⇒ use signals. User Kernel • Similar to a (software) interrupt. Generic File System Layer • Examples: Buffer – SIGINT : user hit Ctrl-C. Cache Cooked – SIGSEGV : program error. Character I/O

– SIGCHLD : a death in the family. . . Raw Character I/O Raw Block I/O

– SIGTERM : . . . or closer to home. Device Driver Device Driver Device Driver Device Driver Kernel • Unix allows processes to catch signals. Hardware • e.g. Job control: • – SIGTTIN, SIGTTOU sent to bg processes Recall: – SIGCONT turns bg to fg. – everything accessed via the file system. – SIGSTOP does the reverse. – two broad categories: block and char. • Cannot catch SIGKILL (hence kill -9) • Low-level stuff gory and machdep ⇒ ignore. • Signals can also be used for timers, window resize, • Character I/O low rate but complex ⇒ most process tracing, . . . functionality in the “cooked” interface. • Block I/O simpler but performance matters ⇒ emphasis on the buffer cache.

OS Fdns Part 4: Case Studies — UNIX Interprocess Communication 171 OS Fdns Part 4: Case Studies — UNIX I/O Subsystem 172 The Buffer Cache Unix Process Scheduling • Basic idea: keep copy of some parts of disk in • Priorities 0–127; user processes ≥ PUSER = 50. memory for speed. • Round robin within priorities, quantum 100ms. • On read do: • Priorities are based on usage and nice, i.e. 1. Locate relevant blocks (from inode) 2. Check if in buffer cache. CP U (i − 1) P (i) = Base + j + 2 × nice 3. If not, read from disk into memory. j j 4 j 4. Return data from buffer cache. gives the priority of process j at the beginning of • On write do same first three, and then update interval i where: version in cache, not on disk. • 2 × loadj “Typically” prevents 85% of implied disk transfers. CP Uj(i) = CP Uj(i − 1) + nicej (2 × loadj) + 1 • Question: when does data actually hit disk?

• Answer: call sync every 30 seconds to flush dirty and nicej is a (partially) user controllable buffers to disk. adjustment parameter ∈ [−20, 20].

• Can cache metadata too — problems? • loadj is the sampled average length of the run queue in which process j resides, over the last • need mutual exclusion and condition minute of operation synchronisation e.g. WAIT for a buffer • so if e.g. load is 1 ⇒ ∼ 90% of 1 seconds CPU e.g. WAIT for full (data transfer comlete. usage “forgotten” within 5 seconds.

OS Fdns Part 4: Case Studies — UNIX I/O Subsystem 173 OS Fdns Part 4: Case Studies — UNIX Process Scheduling 174

Unix Process States Summary • Main Unix features are: ru – file abstraction syscall ∗ a file is an unstructured sequence of bytes ∗ (not really true for device and directory files) interrupt return return – hierarchical namespace ∗ z rk p directed acyclic graph (if exclude soft links) exit preempt ∗ can recursively mount filesystems – heavy-weight processes sleep schedule – IPC: pipes & signals same state – I/O: block and character

sl wakeup rb – dynamic priority scheduling ∗ base priority level for all processes fork() c ∗ priority is lowered if process gets to run ∗ over time, the past is forgotten ru = running (user-mode) rk = running (kernel-mode) • But V7 had inflexible IPC, inefficient memory z = zombie p = pre-empted management, and poor kernel concurrency. sl = sleeping rb = runnable c = created • Later versions address these issues.

• Note: above is simplified — see CS section 23.14 for detailed descriptions of all states/transitions.

OS Fdns Part 4: Case Studies — UNIX Process Scheduling 175 OS Fdns Part 4: Case Studies — UNIX Summary 176 Windows NT: History Windows NT: Evolution After OS/2, MS decide they need “New Technology”: • Feb 2000: NT 5.0 aka • 1988: Dave Cutler recruited from DEC. – borrows from windows 98 look ’n feel – both server and workstation versions, latter of • 1989: team (∼ 10 people) starts work on a new which starts to get wider use OS with a micro-kernel architecture. – big push to finally kill DOS/Win 9x family • July 1993: first version (3.1) introduced – (fails due to internal politicking) Bloated and suckful (smh22) ⇒ • Windows XP (NT 5.1)launched Oct 2001 • NT 3.5 released in September 1994: mainly size – home and professional⇒finally kills win 9x and performance optimisations. – various ”editions” (media center[2003] 64-bit[2005]) and service packs (SP1, SP2). • Followed in May 1995 by NT 3.51 (support for the Power PC, and more performance tweaks) • Server product 2K3 (NT 5.2)released 2003 • July 1996: NT 4.0 – basically the same, modulo registry tweaks, support contract and, of course, cost – new (windows 95) look ’n feel – a plethora of editions . . . – various functions pushed back into kernel (most notably graphics rendering functions) • Windows Vista (NT 6.0) due Q4 2006 – ongoing upgrades via service packs – new Aero UI, new WinFX API – missing Longhorn bits like WinFS, Msh • Longhorn Server (NT x.y?) probably 2007 . . .

OS Fdns Part 4: Case Studies — NT Introduction & Overview 177 OS Fdns Part 4: Case Studies — NT Introduction & Overview 178

NT Design Principles Structural Overview

Key goals for the system were: Posix Logon Win16 Win32 MS-DOS Applications Process Applications Applications Applications OS/2 • portability Applications • security Security Win16 MS-DOS POSIX Subsytem Subsytem Subsytem Subsytem OS/2 • POSIX compliance Subsytem Win32 • multiprocessor support Subsytem

• User Mode extensibility Native NT Interface (Sytem Calls) Kernel Mode EXECUTIVE • international support I/O VM Object Process Manager Manager Manager Manager • compatibility with MS-DOS/Windows applications File System Cache Security LPC This led to the development of a system which was: Drivers Manager Manager Facility

• DEVICE DRIVERS KERNEL written in high-level languages (C and C++) Hardware Abstraction Layer (HAL) • based around a micro-kernel, and Hardware • constructed in a layered/modular fashion. • Kernel Mode: HAL, Kernel, & Executive • User Mode: – environmental subsystems – protection subsystem

OS Fdns Part 4: Case Studies — NT Introduction & Overview 179 OS Fdns Part 4: Case Studies — NT Introduction & Overview 180 HAL Processes and Threads • Layer of software (HAL.DLL) which hides details of NT splits the “virtual processor” into two parts: underlying hardware 1. A process is the unit of resource ownership. • e.g. interrupt mechanisms, DMA controllers, Each process has: multiprocessor communication mechanisms • a security token, • Many HALs exist with same interface but different • a virtual address space, implementation (often vendor-specific) • a set of resources (object handles), and • one or more threads. Kernel 2. A thread is the unit of dispatching. Each thread has: • Foundation for the executive and the subsystems • a scheduling state (ready, running, etc.), • Execution is never preempted. • other scheduling parameters (priority, etc), • Four main responsibilities: • a context slot, and • (generally) an associated process. 1. CPU scheduling 2. interrupt and exception handling Threads are: 3. low-level processor synchronisation • co-operative: all threads in a process share the 4. recovery after a power failure same address space & object handles. • Kernel is objected-oriented; all objects either • lightweight: require less work to create/delete than dispatcher objects and control objects processes (mainly due to shared VAS).

OS Fdns Part 4: Case Studies — NT Low-level Functions 181 OS Fdns Part 4: Case Studies — NT Low-level Functions 182

CPU Scheduling Object Manager • Hybrid static/dynamic priority scheduling: Process 1 – Priorities 16–31: “real time” (static priority). Object Name Process Object Directory 2 Process Security Descriptor 3 – Priorities 1–15: “variable” (dynamic) priority. Quota Charges Object Open Handle Count Header Open Handles List Type Object • Default quantum 2 ticks (∼20ms) on Workstation, Temporary/Permanent Type Object Pointer Type Name 12 ticks (∼120ms) on Server. Reference Count Common Info. Methods: Open • Threads have base and current (≥ base) priorities. Object Object-Specfic Data Close Body (perhaps including Delete a kernel object) Parse Security – On return from I/O, current priority is boosted Query Name by driver-specific amount. – Subsequently, current priority decays by 1 after • Every resource in NT is represented by an object each completed quantum. – Also get boost for GUI threads awaiting input: • The Object Manager (part of the Executive) is current priority boosted to 14 for one quantum responsible for: (but quantum also doubled) – creating objects and object handles – Yes, this is true. – performing security checks • On Workstation also get quantum stretching: – tracking which processes are using each object – “. . . performance boost for the foreground • Typical operation: application” (window with focus) – handle = open(objectname, accessmode) – fg thread gets double or triple quantum. – result = service(handle, arguments)

OS Fdns Part 4: Case Studies — NT Low-level Functions 183 OS Fdns Part 4: Case Studies — NT Executive Functions 184 Object Namespace Process Manager \ • Provides services for creating, deleting, and using threads and processes. • Very flexible: ??\ driver\ device\ BaseNamedObjects\ – no built in concept of parent/child relationships Floppy0\ Serial0\ Harddisk0\ A: C: COM1: or process hierarchies

doc\ Partition1\ Partition2\ – processes and threads treated orthogonally.

exams.tex ⇒ can support Posix, OS/2 and Win32 models. winnt\ temp\

• Recall: objects (optionally) have a name Virtual Memory Manager • Object Manager manages a hierarchical namespace: • NT employs paged virtual memory management – shared between all processes ⇒ sharing • The VMM provides processes with services to: – implemented via directory objects – allocate and free virtual memory – each object protected by an access control list. – modify per-page protections – naming domains (implemented via parse) mean file-system namespaces can be integrated • Can also share portions of memory: • Also get symbolic link objects: allow multiple – use section objects (≈ software segments) names (aliases) for the same object. – based versus non-based. – also used for memory-mapped files • Modified view presented at API level. . .

OS Fdns Part 4: Case Studies — NT Executive Functions 185 OS Fdns Part 4: Case Studies — NT Executive Functions 186

Security Reference Manager I/O Manager • uniform NTs object-oriented design supports a I/O Requests mechanism for runtime access and audit checks File – every time a process opens a handle to an object, System Driver check process security token and object’s ACL

– cf. Unix (file-system, networking, shared I/O Intermediate memory etc . . . ) Manager Driver

Device Local Procedure Call Facility Driver HAL • LPC (or IPC) passes requests and results between client and server processes within a single machine. • The I/O Manager is responsible for: • Used to request services from the various NT environmental subsystems. – file systems – cache management • Three variants of LPC channels: – device drivers ≤ 1. small messages ( 256 bytes): copy messages • Basic model is asynchronous: between processes 2. zero copy: avoid copying large messages by – each I/O operation explicitly split into a request pointing to a shared memory section object and a response created for the channel. – I/O Request Packet (IRP) used to hold 3. quick LPC : used by the graphical display parameters, results, etc. portions of the Win32 subsystem. • File-system & device drivers are stackable. . .

OS Fdns Part 4: Case Studies — NT Executive Functions 187 OS Fdns Part 4: Case Studies — NT Executive Functions 188 Cache Manager File Systems: FAT16

0 • Cache Manager caches “virtual blocks”: 1 Disk Info A 2 8 Attribute Bits n-2 – viz. keeps track of cache “lines” (blocks) as B EOF File Name (8 Bytes) A: Archive 3 D: Directory offsets within a file rather than a volume. 6 Free V: Volume Label 7 4 Extension (3 Bytes) S: System – disk layout & volume concept abstracted away. 8 7 ADVSHR H: Hidden 9 Free R: Read-Only ⇒ Reserved no translation required for cache hit. (10 Bytes) ⇒ can get more intelligent prefetching Time (2 Bytes) Date (2 Bytes) • Completely unified cache: Free First Cluster (2 Bytes) EOF n-1 Free File Size (4 Bytes) – cache blocks all in virtual address space. – decouples physical & virtual cache systems: e.g. • ∗ virtually cache in 256K blocks, A file is a linked list of clusters: a cluster is a set n ≥ ∗ physically cluster up to 64K. of 2 contiguous disk blocks, n 0. – NT virtual memory manager responsible for • Each entry in the FAT contains either: actually doing the I/O. – the index of another entry within the FAT, or – allows lots of FS cache when VM system lightly – a special value EOF meaning “end of file”, or loaded, little when system is thrashing. – a special value Free meaning “free”. • NT/2K also provides some user control: • Directory entries contain index into the FAT – if specify temporary attrib when creating file ⇒ • FAT16 could only handle partitions up to (216 × c) will never be flushed to disk unless necessary. bytes ⇒ max 2Gb partition with 32K clusters. – if specify write through attrib when opening a file ⇒ all writes will synchronously complete. • (and big cluster size is bad)

OS Fdns Part 4: Case Studies — NT Executive Functions 189 OS Fdns Part 4: Case Studies — File Systems 190

File Systems: FAT32 File-Systems: NTFS • Obvious extension: instead of using 2 bytes per Master File 0 $Mft entry, FAT32 uses 4 bytes per entry Table (MFT) 1 $MftMirr File Record 2 $LogFile ⇒ can support e.g. 8Gb partition with 4K clusters 3 $Volume 4 $AttrDef Standard Information • Further enhancements with FAT32 include: 5 \ Filename 6 $Bitmap Data... – can locate the root directory anywhere on the 7 $BadClus

partition (in FAT16, the root directory had to 15 immediately follow the FAT(s)). 16 user file/directory 17 – can use the backup copy of the FAT instead of user file/directory the default (more fault tolerant) – improved support for demand paged executables • Fundamental structure of NTFS is a volume: (consider the 4K default cluster size . . . ). – based on a logical disk partition • VFAT on top of FAT32 does long name support: – may occupy a portion of a disk, an entire disk, unicode strings of up to 256 characters. or span across several disks. – want to keep same directory entry structure for • An NTFS file is described by an array of records in compatibility with e.g. DOS a special file called the Master File Table (MFT). ⇒ use multiple directory entries to contain • The MFT is indexed by a file reference (a 64-bit successive parts of name. unique identifier for a file) – abuse V attribute to avoid listing these • A file itself is a structured object consisting of a set Still pretty primitive. . . of attribute/value pairs of variable length. . .

OS Fdns Part 4: Case Studies — Microsoft File Systems 191 OS Fdns Part 4: Case Studies — Microsoft File Systems 192 NTFS: Recovery NTFS: Fault Tolerance • To aid recovery, all file system data structure Hard Disk A Hard Disk B

updates are performed inside transactions: Partition A1 Partition B1 – before a data structure is altered, the Partition A2 transaction writes a log record that contains Partition A3 Partition B2 redo and undo information. – after the data structure has been changed, a • FtDisk driver allows multiple partitions to be commit record is written to the log to signify combined into a logical volume: that the transaction succeeded. – after a crash, the file system can be restored to – logically concatenate multiple disks to form a a consistent state by processing the log records. large logical volume, a volume set. – based on the concept of RAID = Redundant • Does not guarantee that all the user file data can Array of Inexpensive Disks: be recovered after a crash — just that metadata – e.g. RAID level 0: interleave multiple partitions files will reflect some prior consistent state. round-robin to form a stripe set • The log is stored in the third metadata file at the – e.g. RAID level 1 increases robustness by using a beginning of the volume ($Logfile) mirror set: two equally sized partitions on two disks with identical data contents. – NT has a generic log file service – (other more complex RAID levels also exist) ⇒ could in principle be used by e.g. database • FtDisk can also handle sector sparing where the • Overall makes for far quicker recovery after crash underlying SCSI disk supports it • (modern Unix fs [ext3, xfs] use similar scheme) • (if not, NTFS supports s/w cluster remapping)

OS Fdns Part 4: Case Studies — Microsoft File Systems 193 OS Fdns Part 4: Case Studies — Microsoft File Systems 194

NTFS: Other Features Environmental Subsystems • Security: • User-mode processes layered over the native NT – security derived from the NT object model. executive services to enable NT to run programs – each file object has a security descriptor developed for other operating systems. attribute stored in its MFT record. • NT uses the Win32 subsystem as the main – this atrribute contains the access token of the operating environment; Win32 is used to start all owner of the file plus an access control list processes. It also provides all the keyboard, mouse • Compression: and graphical display capabilities. • – NTFS can divide a file’s data into compression MS-DOS environment is provided by a Win32 units (blocks of 16 contiguous clusters) application called the virtual dos machine (VDM), – NTFS also has support for sparse files a user-mode process that is paged and dispatched like any other NT thread. ∗ clusters with all zeros not allocated or stored ∗ instead, gaps are left in the sequences of • 16-Bit Windows Environment: VCNs kept in the file record – Provided by a VDM that incorporates Windows ∗ when reading a file, gaps cause NTFS to on Windows zero-fill that portion of the caller’s buffer. – Provides the Windows 3.1 kernel routines and • Encryption: stub routings for window manager and GDI – Use symmetric key to encrypt files; file attribute functions. holds this key encrypted with user public key • The POSIX subsystem is designed to run POSIX – Problems: private key pretty easy to obtain; and applications following the POSIX.1 standard which administrator can bypass entire thing anyhow. is based on the UNIX model.

OS Fdns Part 4: Case Studies — Microsoft File Systems 195 OS Fdns Part 4: Case Studies — User Mode Components 196 Summary Course Review • Main Windows NT features are: • Part I: Computer Organisation – layered/modular architecture: – fetch-execute cycle, data representation, etc – generic use of objects throughout – multi-threaded processes – mainly for getting up to speed for h/w courses – multiprocessor support • Part II: Functions – asynchronous I/O subsystem – OS structures: h/w support, kernel vs. µ-kernel – NTFS filing system (vastly superior to FAT32) – preemptive priority-based scheduling – Processes: states, structures, scheduling • Design essentially more advanced than Unix. – Memory: virtual addresses, sharing, protection • Implementation of lower levels (HAL, kernel & – Filing: directories, meta-data, file operations. executive) actually rather decent. • But: has historically been crippled by • Part III: Concurrency Control – almost exclusive use of Win32 API – multithreaded processes – legacy device drivers (e.g. VXDs) – mutual exclusion and condition synchronisation – lack of demand for “advanced” features – implementation of concurrency control • Continues to evolve: – Windows Vista (NT 6.0) due Q4 2006 • Part IV: Case Studies – Longhorn due 2007–2009 – UNIX and Windows NT – Singularity research OS. . .

OS Fdns Part 4: Case Studies — NT Summary 197 OS Fdns Part 4: Case Studies — NT Summary 198