<<

System Issues Role of Files

• What is the role of files? • Persistance - long-lived - for What is the file abstraction? posterity • File naming. How to the file we want? à non-volitile storage media Sharing files. Controlling access to files. à semantically meaningful (memorable) • Performance issues - how to deal with the names bottleneck of disks? What is the “right” way to optimize file access?

Abstractions *File Abstractions User view Addressbook, record for Duke CPS • -like files Application – Sequence of bytes – Operations: (create), , , , fid addrfile ->fid, byte range* seek bytes • Memory mapped files block# device, block # – Sequence of bytes Disk Subsystem – Mapped into address space surface, cylinder, sector – Page fault mechanism does data transfer • Named, Possibly typed

1 Functions of File System Functions of Device Subsystem

• ( subsystem) Map to fileids- In general, deal with device characteristics open (create) syscall. Create kernel data structures. • Translate block numbers (the abstraction of Maintain naming structure (, , ) device shown to file system) to physical disk addresses. • Determine layout of files and metadata on disk in terms of blocks. Disk block allocation. Bad Device specific (subject to change with upgrades blocks. in technology) intelligent placement of blocks. • Handle read and write system calls • Schedule (reorder?) disk operations • Initiate I/O operations for movement of blocks to/from disk. • Maintain buffer cache

Know your Workload! Unix File Syscalls User grp others • File usage patterns should influence design int fd, num, success, bufsize; rwx rwx rwx decisions. Do things differently depending: char data[bufsize]; long offset, pos; 111 100 000 – How large are most files? How long-lived? fd = open (, mode [,permissions]); Read vs. write activity. Shared often? success = close (fd); O_RDONLY – Different levels “see” a different workload. O_WRONLY pos = lseek (fd, offset, mode); O_RDWR • Feedback loop O_CREAT num = read (fd, data, bufsize); O_APPEND num = write (fd, data, bufsize); ... Usage patterns File System Relative to design and impl observed today beginning, current position, end of file

2 File System Data Structures UNIX System-wide System-wide Data Block Addr Open file table table File Data blocks Process 3 ... 3 3 ... 3 descriptor Attributes

in-memory -w pos, mode copy of File data Addr ... ptr to on-disk inode

r-w pos, mode Block 1 ... stdin 2 ... array

stdout ... 2 stderr ... process ptr

- Decoupling meta-data 1 pos 2

per file from directory entries 2 pos 1

Pathname Resolution cps110 Between Parent/Child Directory node “cps110/current/Proj/proj3” File current main(int argc, char *argv[]) { current inode# Attributes Directory node File char c; Attributes int fdrd , fdwt, fdpriv; Proj inode# if (( fdrd = open(argv[1], O_RDONLY)) == -1) (1); if (( fdwt = creat([argv[2], 0666)) == -1) File exit(1); Attributes File (); Attributes if (( fdpriv = open([ argv[3], O_RDONLY)) == -1) exit(1); Proj proj3 while (TRUE) { data file Directory node if (read(fdrd , &c, 1) != 1) exit(0); proj3 inode# write(fdwt, &c, 1); } index node of wd }

3 Sharing Open File Instances File System Data Structures System-wide System-wide user ID Open file table File descriptor table process ID parent process group ID shared seek parent PID offset in shared Process signal state siblings file table entry descriptor children

in-memory r-w pos, mode copy of inode ptr to on-disk

inode user ID shared file r-w pos, mode process ID (inodeorvnode) process group ID child parent PID stdin signal state siblings children array stdout process file system open stderr process process ptr descriptors file table - forked process’s objects per file openafterfork Process descriptor

Memory Mapped Files Nachos File fd = open (somefile, consistent_mode); Syscalls/Operations pa = mmap(addr, len, prot, flags, fd, Create(“zot”); offset); R, W, X, OpenFileId fd; none fd + offset fd = Open(“zot”); Close(fd); pa Shared, len Private, Fixed, char data[bufsize]; Limitations: len Noreserve Write(data, count, fd); 1. small, fixed-size files and directories Read(data, count, fd); 2. single disk with a single directory 3. stream files only: no seek syscall 4. is specified creation VAS Reading performed by Load instr. 5. no access control, etc.

4 Goals of File Naming Naming Structures • Foremost function - to find files (e.g., in open() ), • Flat name space - 1 system-wide table, Map file name to file object. – Unique naming with multiple users is hard. • To store meta-data about files. Name conflicts. • To allow users to choose their own file names – Easy sharing, need for protection without undue name conflict problems. • Per-user name space • To allow sharing. – Protection by isolation, no sharing • Convenience: short names, groupings. – Easy to avoid name conflicts • To avoid implementation complications – Register identifies with directory to use to resolve names, possibility of user-settable ()

Naming Structures Full Naming Network* Terry A Naming network • /Jamie/joe/project/D • Component names - pathnames grp1 root • /Jamie/d – Absolute pathnames - from a designated root Jo • /Jamie/joe/jam/proj1 – Relative pathnames - from a working directory TA /C – Each name carries how to resolve it. joe Jamie • (relative from Terry) • Short names to files anywhere in the network project jam A produce cycles, but convenience in naming B d proj1 • (relative from Jamie) things. D E C d D project * not Unix

5 Full Naming Network* Meta-Data Terry A • /Jamie/joe/project/D • File size • Location of file - grp1 root • /Jamie/d • File which device Jo • /Jamie/joe/jam/proj1 • Protection - access • Location of TA /C control information individual blocks of joe Jamie the file on disk. project • (relative from Terry) • History: jam A creation time, • Owner of file B d last modification, • Group(s) of users proj1 • (relative from Jamie) D E last access. associated with file C d D project Why? * Unix

Operations on Directories Restricting to a Hierarchy (UNIX) • Problems with full naming network • (oldpathname, newpathname) - – What does it mean to “delete” a file? entry pointing to file – Meta-data interpretation • unlink (filename) - remove entry pointing to file • Eliminating cycles • mknod (, type, device) - used – allows use of reference counts for (e.g. by mkdir utility function) to create a reclaiming file space directory (or named pipe, or special file) – avoids garbage collection • getdents(fd, buf, structsize) - reads dir entries

6 A Typical Unix File Garbage Collection Each is a set of directories and files; a host’s file tree Terry is the set of directories and files visible to processes on A a given host. File trees are built by grafting / grp1 X root volumes from different devices or from network servers. Jo bin etc tmp usr vmunix X TA In Unix, the graft operation is joe XJamie Series of the privileged mount , sh project users project unlinks and each volume is afilesystem. packages jam mount point B d proj1 coveredDir D E mount (coveredDir, volume) C coveredDir: directory pathname D project volume: device volume root contents become visible at pathname coveredDir

A Typical Unix File Tree Each volume is a set of directories and files; a host’s file tree Reclaiming Convenience is the set of directories and files visible to processes on a given host. • Symbolic links - indirect files File trees are built by grafting / volumes from different devices filename maps, not to file object, but to or from network servers. bin etc tmp usr vmunix another pathname In Unix, the graft operation is – allows short aliases the privileged mount system call, ls sh project users and each volume is afilesystem. – slightly different semantics packages mount point • Search rules (volumecoveredDir root) mount (coveredDir, volume) tex /usr/project/packages/coveredDir/emacs

7 Unix File Naming (Hard Links) Unix Symbolic (Soft) Links

directory A directory B A Unix file may have multiple names. 0 wind: 18 • Unix files may also be named by symbolic (soft) links. rain: 32 0 – A soft link is a file containing a pathname of some other file. hail: 48 sleet: 48 Each directory entry naming the file is called a . symlink system call symlink (existing name, new name) inodelink directory A directory B Each inode contains a reference count allocate a new file (inode) with type symlink count = 2 0 wind: 18 showing how many hard links name it. initialize file contents with existing name 0 inode48 rain: 32 create directory entry for new file with new name hail: 48 sleet: 67

linksystem call unlinksystem call (“remove”) The target of the link may be link (existing name, new name) unlink(name) removed at any time, leaving create a new name for an existing file destroy directory entry inodelink ../A/hail/0 a dangling reference. increment inode link count decrement inode link count count = 1 if count = 0 and file is not in active use inode48 inode67 How should the kernel free blocks (recursively) and on -disk inode handle recursive soft links? Convenience, but not performance!

Access Control for Files UNIX access control

• Access control lists - detailed list • Each file carries its access control with it. attached to file of users allowed rwx rwx rwx setuid (denied) access, including kind of access allowed/denied. Owner Group Everybody else When bit set, it • UNIX RWX - owner, group, everyone UID GID allows process executing object to assume UID of • Owner has , rights owner temporarily - (granting, revoking) enter owner domain (rights amplification)

8 Two Representations The Access Model • ACL - Access Control Lists • Authorization problems can be represented – Columns of previous matrix abstractly by of an access model. – Permissions attached to Objects – each row represents a subject/principal/domain – each column represents an object – ACL for file hotgossip: Terry, rw; Lynn, rw – each cell: accesses permitted for the {subject, • Capabilities object}pair – Rows of previous matrix • read, write, delete, execute, search, control, or any other method – Permissions associated with Subject • In real systems, the access matrix is sparse – Tickets, Namespace (what it is that one can name) and dynamic. – Capabilities held by Lynn: luvltr, rw; hotgossip,rw • need a flexible, efficient representation

37

Access Control Lists Capabilities

• Approach: represent the access matrix by • Approach: represent the access matrix by storing its rows with the subjects. storing its columns with the objects. • Tag each subject with a list of capabilities for the objects • Tag each object with an access control list (ACL) of it is permitted to access. authorized subjects/principals. – A capability is an unforgeable object reference, • To authorize an access requested by S for O like a pointer. – It endows the holder with permission to operate on – search O’s ACL for an entry matching S the object – compare requested access with permitted access • e.g., permission to invoke specific methods – access checks are often made only at bind time – Typically, capabilities may be passed from one subject to another. • Rights propagation and confinement problems

9 Dynamics of Protection Dynamics of Protection Schemes Schemes • How to endow modules with • How to revoke privileges? appropriate privilege? • What about adding new subjects or new – What mechanism exists to bind principals with objects? subjects? • How to dynamically change the set of objects • e.g., setuid syscall, setuidbit accessible (or vulnerable) to different – What principals should a software module bind to? processes run by the same user? • privilege of creator: but may not be sufficient to perform – Need-to-know principle / Principle of minimal the service privilege • privilege of owner or system: dangerous – How do subjects change identity to execute a privileged module? • protection domain, protection domain switch (enter) 41

• If domain contains copy Protection Domains on right to some object, then it can transfer that • Processes execute in a right to the object to protection domain, initially another domain. inherited from subject • If domain is owner of hotgossip hotgossip Domain0 Domain0 solutions proj1 luvltr solutions proj1 luvltr • Goal: to be able to gradefile some object, it can gradefile change protection grant that right to the TA rw rwo rxc r ctl TA rw rwo rc r ctl domains object, with or without copy to another domain • Introduce a level of grp r rwx enter grp r rwo enter indirection • If domain is owner or rc • Domains become Terry rw has ctl right to a Terry rw domain, it can remove protected objects with Lynn rw rw Lynn rw rw operations defined on right to object from that them: owner, copy, domain Domain0 r Domain0 r r control • Rights propagation. 42 43

10 Naming Her local Distributed File Systems directory tree usr • \\His\d\pictures\castle.jpg m_pt • Naming client – Not location transparent - both – Location machine and drive embedded transparency/ server in name. for_export independence • NFS mounting A B His local network • Caching – Remote directory mounted dir tree over local directory in local – Consistency client Her local His after naming hierarching. tree after • Replication mount – /usr/m_pt/A mount A B usr on B – Availability and client server – No global view usr updates m_pt m_pt A B

Global Name Space VFS: the Filesystem Switch Sun Microsystems introduced the framework in 1985 to accommodate the Network File Example: System cleanly. / • VFS allows diverse specific file systems to coexist in a file tree, isolating all FS-dependencies inpluggable filesystem modules.

afs user space VFS was an internal kernel restructuring syscall layer (file, uio, etc.) with no effect on the syscall interface. tmp bin lib network protocol Virtual File System (VFS) Incorporates object-oriented concepts: stack a generic procedural interface with local files (TCP/IP) NFS FFS LFS *FS etc. etc. multiple implementations. device drivers shared files - Other abstract interfaces in the kernel: device drivers, looks identical to file objects, executable files, memory objects. all clients

11 Vnodes Example: In the VFS framework, every file or directory in active (NFS) use is represented by a vnode object in kernel client server syscall layer memory. user programs VFS free vnodes syscall layer syscall layer VFS NFS server Each vnodehas a standard file attributes struct. Generic vnodepoints at UFS Activevnodes are reference- filesystem-specific struct counted by the structures that (e.g., inode, rnode), seen UFS NFS hold pointers to them, e.g., only by the filesystem. client the system open file table. NFS UFS Vnodeoperations are Each specific file macros that vector to network system maintains a filesystem-specific hash of its resident procedures. vnodes.

Vnode Operations and Pathname Traversal vnode/file attributes (vattrAttributesor fattr) directories only type (VREG, VDIR, VLNK, etc.) vop_lookup (OUT vpp, name) • When a pathname is passed as an argument to a mode (9+ bits of permissions) vop_create (OUT vpp, name, vattr ) system call, the syscall layer must “convert it to a nlink (hard link count) vop_remove (vp, name) vnode”. owner user ID vop_link (vp, name) owner group ID vop_rename (vp, name, tdvp, tvp, name) • Pathname traversal is a sequence of vop_lookup calls filesystem ID vop_mkdir (OUT vpp, name, vattr ) to descend the tree to theIssues: named file or directory. unique file ID vop_rmdir (vp, name) open(“/tmp/zot”) 1. crossing mount points file size (bytes and blocks) vop_readdir (uio, cookie) vp = getvnode for / (rootdir) 2. obtaining root vnode (or current dir) access time vop_symlink (OUT vpp, name, vattr, contents) vp->vop_lookup(&cvp, “tmp”); 3. finding resident vnodes in memory modify time vop_readlink (uio) vp = cvp; 4. caching name- >vnode translations generation number vp->vop_lookup(&cvp, “zot”); 5. symbolic (soft) links files only 6. disk implementation of directories generic operations vop_getpages (page**, count, offset) 7. locking/referencing to handle races vop_getattr (vattr ) vop_putpages (page**, count, , offset) with name create and delete operations vop_setattr (vattr ) vop_fsync () vhold() vholdrele()

12 Hints Prefix Tables

• A valuable distributed systems design technique that / can be illustrated in naming. A /A/m_pt1 -> blue • Definition: information that is not guaranteed to be m_pt1 correct. If it is, it can improve performance. If not, things /A/m_pt1/usr/B -> pink will still work OK. Must be able to validate information. usr /A/m_pt1/usr/m_pt2 -> pink • Example: Sprite prefix tables B m_pt2

/A/m_pt1/usr/m_pt2/stuff.below

13