File Systems and NFS Representing Files on Disk
Total Page:16
File Type:pdf, Size:1020Kb
Representing Files On Disk: Nachos An OpenFile represents a file in active use, with a seek pointer and OpenFile(sector) read/write primitives for arbitrary OpenFile Seek(offset) byte ranges. Read(char* data, bytes) Write(char* data, bytes) logical once upo n a time File Systems and NFS block 0 /nin a l A file header describes an on-disk logical and far file as an ordered sequence of block 1 far away FileHdr sectors with a length, mapped by ,/nlived t a logical-to-physical block map. logical he wise block 2 and sage wizard. bytes sectors Allocate(..., filesize) OpenFile* ofd = filesys->Open(“tale”); length =FileLength () ofd ->Read(data, 10) gives ‘ once upon ‘ sector = ByteToSector(offset) ofd ->Read(data, 10) gives ‘ a time/nin ‘ File Metadata Representing Large Files On disk, each file is represented by a FileHdr structure. The Nachos FileHdr occupies exactly one inode disk sector, limiting the maximum file size. The FileHdr object is an in-memory copy of this structure. sector size = 128 bytes direct 120 bytes of block map = 30 entries block The FileHdr is a file system “bookeeping” structure each entry maps a 128-byte sector file attributes : may include owner, that supplements the file data itself: these kinds of max file size = 3840 bytes map access control, time of structures are called filesystem metadata . (12 entries) create/modify/access, etc. indirect In Unix, theFileHdr (called an index - bytes block sectors A Nachos FileHdr occupies node or inode) represents large files using etc. exactly one disk sector. a hierarchical block map. logical-physical block map (like a translation table) Each file system block is a clump of sectors (4KB, 8KB, 16KB). To operate on the file (e.g., Inodes are 128 bytes, packed into blocks. to open it), theFileHdr must Each inodehas 68 bytes of attributes and 15 block map entries. be read into memory. double suppose block size = 8KB indirect physical block pointers in the 12 direct block map entries in the inodecan map 96KB of data. block map are sector IDs Any changes to the attributes One indirect block (referenced by theinode) can map 16MB of data. block or block map must be written One double indirect block pointer in inodemaps 2K indirect blocks. FileHdr* hdr = new FileHdr(); back to the disk to make them hdr->FetchFrom(sector) permanent. maximum file size is 96KB + 16MB + (2K*16MB) + ... hdr->WriteBack(sector) Representing Small Files Basics of Directories Internal fragmentation in the file system blocks can waste A directory is a set of file names, supporting lookup by symbolic name. significant space for small files. E.g., 1KB files waste 87% of disk space (and bandwidth) in a naive In Nachos, each directory is a file containing file system with an 8KB block size. a set of mappings from name->FileHdr. Most files are small: one study [Irlam93] shows a median of 22KB. Directory(entries) wind: 18 sector = Find(name) directory fileHdr 0 FFS solution: optimize small files for space efficiency. Add(name, sector) • Subdivide blocks into 2/4/8 fragments (or just frags). Remove(name) snow: 62 0 • Free block maps contain one bit for each fragment. Each directory entry is a fixed -size rain: 32 To determine if a block is free, examine bits for all its fragments. slot with space for a FileNameMaxLen byte name. hail: 48 • The last block of a small file is stored on fragment(s). Entries or slots are found by a linear scan. If multiple fragments they must be contiguous. A directory entry may hold a pointer to another directory, sector 32 forming a hierarchical name space. 1 A Nachos Filesystem On Disk A Typical Unix File Tree An allocation bitmap file maintains A directory maintains the free/allocated state of each physical name->FileHdr mappings for Each volume is a set of directories and files; a host’s file tree is the set of block; its FileHdr is always stored in all existing files; its FileHdr is directories and files visible to processes on a given host. sector 0. sector 0 sector 1 always stored in sector 1. allocation File trees are built by grafting / bitmap file volumes from different volumes directory wind: 18 or from network servers. 11100010 file 0 bin etc tmp usr vmunix 00101101 snow: 62 10111101 In Unix, the graft operation is 0 the privileged mount system call, ls sh project users once upo rain: 32 and each volume is a filesystem. 10011010 n a time hail: 48 00110001 /n in a l packages 00010101 mount point mount ( coveredDir, volume) (volume root) Every box in this diagram coveredDir: directory pathname 00101110 and far represents a disk sector. volume: devicespecifier or network volume 00011001 far away volume root contents become visible at pathname coveredDir 01000100 , lived th tex emacs Filesystems VFS: the Filesystem Switch Each file volume (filesystem) has a type, determined by its Sun Microsystems introduced the virtual file system interface disk layout or the network protocol used to access it. in 1985 to accommodate diverse filesystemtypes cleanly. ufs (ffs), lfs, nfs, rfs, cdfs, etc. VFS allows diverse specific file systems to coexist in a file tree, Filesystemsare administered independently. isolating all FS-dependencies in pluggable filesystem modules. Modern systems also include “logical” pseudo-filesystems in user space VFS was an internal kernel restructuring the naming tree, accessible through the file syscalls. with no effect on the syscall interface. syscall layer (file, uio, etc.) procfs: the /proc filesystem allows access to process internals. network protocol Virtual File System (VFS) Incorporates object-oriented concepts: stack a generic procedural interface with mfs: the memory file system is a memory -based scratch store. (TCP/IP) NFS FFS LFS *FS etc. etc. multiple implementations. Processes access filesystems through common system calls. device drivers Based on abstract objects with dynamic Other abstract interfaces in the kernel: device drivers, method binding by type...in C. file objects, executable files, memory objects. Vnodes Vnode Operations and Attributes In the VFS framework, every file or directory in active use is represented by a vnode object in kernel memory. vnodeattributes ( vattr) directories only type (VREG, VDIR, VLNK, etc.) vop_lookup (OUT vpp, name) mode (9+ bits of permissions) vop_create (OUT vpp, name,vattr ) nlink (hard link count) vop_remove ( vp, name) freevnodes owner user ID vop_link ( vp, name) syscall layer owner group ID vop_rename ( vp, name, tdvp, tvp, name) filesystem ID vop_mkdir (OUT vpp, name, vattr ) Each vnodehas a standard unique file ID vop_rmdir (vp, name) file attributes struct. file size (bytes and blocks) vop_symlink (OUT vpp, name, vattr , contents) access time vop_readdir (uio , cookie) Generic vnodepoints at modify time vop_readlink (uio ) filesystem-specific struct generation number (e.g., inode, rnode), seen files only generic operations only by thefilesystem . vop_getpages (page**, count, offset) Each specific file system vop_getattr (vattr ) vop_putpages (page**, count, sync, offset) maintains a cacheof its vop_setattr (vattr ) Vnodeoperations are vop_fsync() resident vnodes. vhold () macros that vector to vholdrele() filesystem-specific NFS UFS procedures. 2 V/ Inode Cache Network File System (NFS) VFS free list head HASH(fsid, fileid) Active vnodesare reference-counted server by the structures that hold pointers to client them. user programs syscall layer VFS - system open file table syscall layer - process current directory NFS VFS - file system mount points server - etc. Each specific file system maintains its own hash of vnodes(BSD). UFS - specific FS handles initialization UFS NFS - free list is maintained by VFS vget(vp): reclaim cached inactivevnodefrom VFS free list client vref (vp): increment reference count on an active vnode vrele(vp): release reference count on avnode vgone(vp):vnodeis no longer valid (file is removed) network NFS Protocol NFS Vnodes NFS is a network protocol layered above TCP/IP. The NFS protocol has an operation type for (almost) every • Original implementations (and most today) use UDP vnode operation, with similar arguments/results. datagram transport for low overhead. struct nfsnode* np = VTONFS(vp); Maximum IP datagram size was increased to match FS block syscall layer size, to allow send/receive of entire file blocks. VFS nfs _vnodeops Some newer implementations use TCP as a transport. NFS • The NFS protocol is a set of message formats and types. NFS client stubs server Client issues a request message for a service operation. nfsnode RPC/UDP Server performs requested operation and returns a reply message with status and (perhaps) requested data. The nfsnode holds information network UFS needed to interact with the server to operate on the file. File Handles Pathname Traversal Question: how does the client tell the server which file or When a pathname is passed as an argument to a system call, directory the operation applies to? the syscall layer must “convert it to a vnode”. • Similarly, how does the server return the result of a lookup? Pathname traversal is a sequence ofvop_lookup calls to descend More generally, how to pass a pointer or an object reference as the tree to the named file or directory. an argument/result of an RPC call? Issues: In NFS, the reference is a file handle or fhandle, a 32-byte open(“/tmp/zot”) 1. crossing mount points vp = get vnode for / (rootdir) token/ticket whose value is determined by the server. vp->vop_lookup(&cvp, “tmp”); 2. obtaining root vnode (or current dir) 3. finding resident vnodes in memory • Includes all information needed to identify the file/object on vp = cvp; vp->vop_lookup(&cvp, “zot”); 4. caching name->vnode translations the server, and get a pointer to it quickly. 5. symbolic (soft) links 6. disk implementation of directories 7. locking/referencing to handle races volume ID inode # generation # with name create and delete operations 3 From Servers to Services NFS: From Concept to Implementation Are Web servers and RPC servers scalable? Available? Now that we understand the basics, how do we make it work A single server process can only use one machine.