USENIX Association

Proceedings of the BSDCon 2002 Conference

San Francisco, California, USA February 11-14, 2002

THE ADVANCED COMPUTING SYSTEMS ASSOCIATION

© 2002 by The USENIX Association All Rights Reserved For more information about the USENIX Association: Phone: 1 510 528 8649 FAX: 1 510 548 5738 Email: [email protected] WWW: http://www.usenix.org Rights to individual papers remain with the author or the author's employer. Permission is granted for noncommercial reproduction of the work for educational or research purposes. This copyright notice must be included in the reproduced paper. USENIX acknowledges all trademarks herein. Rethinking /devand devices in the UNIX kernel

Poul-Henning Kamp The FreeBSD Project

Abstract An outstanding novelty in UNIX at its introduction was the notion of ‘‘a file is a file is a file and evenadevice is a file.’’ Going from ‘‘hardware only changes when the DEC Field engineer is here’’to‘‘my toaster has USB’’has put serious strain on the rather crude implementation of the ‘‘devices as files’’concept, an implementation which has survivedpractically unchanged for 30 years in most UNIX variants. Starting from a high-levelviewofdevices and the semantics that have grown around them overthe years, this paper takes the audience on a grand tour of the redesigned FreeBSD device-I/O system, to convey anoverviewofhow itall fits together,and to explain whythings ended up as theydid, howtouse the newfeatures and in particular hownot to.

1. Introduction tax and meaning, so that a program expecting a file name as a parameter can be passed a device name; There are really only twofundamental ways to concep- finally,special files are subject to the same protec- tualise I/O devices in an operating system: The usual tion mechanism as regular files. wayand the UNIX way. At the time, this was quite a strange concept; it was The usual way is to treat I/O devices as their own class totally accepted for instance, that neither the system of things, possibly several classes of things, and pro- administrator nor the users were able to interact with a vide APIs tailored to the semantics of the devices. In disk as a disk. Operating systems simply did not pro- practice this means that a program must knowwhat it is vide access to disk other than as a filesystem. Most dealing with, it has to interact with disks one way,tapes vendors did not evenrelease a program to initialise a another and rodents yet a third way,all of which are disk-pack with a filesystem: selling pre-initialised and different from howitinteracts with a plain disk file. ‘‘quality tested’’disk-packs was quite a profitable busi- The UNIX way has neverbeen described better than in ness. the very first paper published on UNIX by Ritchie and In manycases some kind of API for reading and writ- Thompson [Ritchie74]: ing individual sectors on a disk pack did exist in the Special files constitute the most unusual feature of operating system, but more often than not it was not the UNIX filesystem. Each supported I/O device listed in the public documentation. is associated with at least one such file. Special files are read and written just likeordinary disk files, but requests to read or write result in acti- 1.1. The traditional implementation vation of the associated device. An entry for each The initial implementation used hardcoded inode num- special file resides in directory /dev, although a bers [Ritchie98]. The console device would be inode link may be made to one of these files just as it number 5, the paper-tape-punch number 6 and so on, may to an ordinary file. Thus, for example, to ev enifthose inodes were also actual regular files in the write on a magnetic tape one may write on the file filesystem. /dev/mt. Forreasons one can only too vividly imagine, this was Special files exist for each communication line, changed and Thompson [Thompson78] describes how each disk, each tape drive,and for physical main the implementation nowused ‘‘major and minor’’ memory.Ofcourse, the active disks and the mem- device numbers to indexthough the devsw array to the ory special files are protected from indiscriminate correct device driver. access. Forall intents and purposes, this is the implementation There is a threefold advantage in treating I/O de- which survivesinmost UNIX-likesystems eventothis vices this way: file and device I/O are as similar as day.Apart from the access control and timestamp possible; file and device names have the same syn- information which is found in all inodes, the special inodes in the filesystem contain only one piece of infor- hardware had changed since last reboot. This boot pro- mation: the major and minor device numbers, often log- cedure would amongst other things create the necessary ically OR’ed to one field. special files in the filesystem, based on an intricate sys- When a program opens a special file, the kernel uses tem of per device driverconfiguration files. the major number to find the entry points in the device In the recent years, we have become used to hardware driver, and passes the combined major and minor num- which changes configuration at anytime: people plug bers as a parameter to the device driver. USB, Firewire and PCCard devices into their comput- ers. These devices can be anything from modems and 2. The challenge disks to GPS receivers and fingerprint authentication hardware. Suddenly maintaining the correct set of spe- Now, wedid not talk much about where the special cial devices in ‘‘/dev’’became a major headache. inodes came from to begin with. Theywere created by Along the way,UNIX kernels had learned to deal with hand, using the mknod(2) system call, usually through multiple filesystem types [Heidemann91a] and a the mknod(8) program. ‘‘device-pseudo-filesystem’’was a pretty obvious idea. In those days a computer had a very static hardware The device drivers have a pretty good idea which configuration1 and it certainly did not change while the devices theyhav e found in the configuration, so all that system was up and running, so creating device nodes is needed is to present this information as a filesystem by hand was certainly an acceptable solution. filled with just the right special files. Experience has The first sign that this would not hold up as a solution shown that this likemost other ‘‘pseudo filesystems’’ came with the advent of TCP/IP and the telnet(1) pro- sound a lot simpler in theory than in practice. gram, or more precisely with the telnetd(8) daemon. In order to support remote login a ‘‘pseudo-tty’’device 3. Truly understanding devices driverwas implemented, basically as tty driverwhich Before we continue, we need to fully understand the instead of hardware had another device which would ‘‘device special file’’inUNIX. allowaprocess to ‘‘act as hardware’’for the tty.The telnetd(8) daemon would read and write data on the First we need to realize that a special file has the nature ‘‘master’’side of the pseudo-tty and the user would be of a pointer from the filesystem into a different names- running on the ‘‘slave’’ side, which would act just like pace; a little understood fact with far reaching conse- anyother tty: you could change the erase character if quences. you wanted to and all the signals and all that stuff One implication of this is that several special files can worked. exist in the filename namespace all pointing to the same Obviously with a device requiring no hardware, you device but each having their own access and timestamp can compile as manyinstances into the kernel as you attributes: like, as long as you do not use too much memory.As guest# ls -l /dev/fd0 /tmp/fd0 system after system was connected to the ARPANet, crw-r----- 1 root operator 9, 0 Sep 27 19:21 /dev/fd0 crw-rw-rw- 1 root wheel 9, 0 Sep 27 19:24 /tmp/fd0 ‘‘increasing number of ptys’’became a regular task for system administrators, and part of this task was to cre- Obviously,the administrator needs to be on top of this: ate more special nodes in the filesystem. one popular way to exploit an unguarded root prompt is to create a replica of the special file /dev/kmem in a Several UNIX vendors also noticed an issue when they location where it will not be noticed. Since /dev/kmem sold minicomputers in manydifferent configurations: givesaccess to the kernel memory,gaining anyparticu- explaining to system administrators just which special lar privilege can be arranged by suitably modifying the nodes theywould need and howtocreate them were a kernel’sdata structures through the illicit special file. significant documentation hassle. Some opted for the simple solution and pre-populated /devwith every con- When NFS appeared it opened a newavenue for this ceivable device node, resulting in a predictable slow- attack: People may have root privilege on one machine down on access to filenames in /dev. butnot another.Since device nodes are not interpreted on the NFS server but rather on the local computer,a System V UNIX provided a band-aid solution: a special user with root privilege on a NFS client computer can boot sequence would takeeffect if the kernel or the create a device node to his liking on a filesystem mounted from an NFS server.This device node can in 1 Unless your assigned field engineer was present on site. turn be used to circumvent the security of other com- puters which mount that filesystem, including the server,unless theyprotect themselves by not trusting on the disk special file, not by polluting the filesystem anydevice entries on untrusted filesystem by mounting code with another file type. such filesystems with the nodev mount-option. In FreeBSD block devices were not evenimplemented The fact that the device itself does not actually exist in a fashion which would be of anyuse, since anywrite inside the filesystem which holds the special file makes errors would neverbereported to the writing process. it possible to perform boot-strapping stunts in the spirit Forthis reason, and since no applications were found to of Baron Von Münchausen [raspe1785], where a be in existence which relied on block devices and since filesystem is (re)mounted using one of its own device historical usage was indeed historical [Mckusick2000], vnodes: block devices were removedfrom the FreeBSD system. guest# mount -o ro /dev/fd0 /mnt This greatly simlified the task of keeping track of guest# fsck /mnt/dev/fd0 open(2) reference counts for disks and removedmuch guest# mount -u -o rw /mnt/dev/fd0 /mnt magic special-case code throughout. Other interesting details are (2) and jail(2) 4. Files, sockets, pipes, SVID IPC and [Kamp2000] which provide filesystem isolation for devices process-trees. Whereas chroot(2) was not implemented as a security tool [Mckusick1999] (although it has been It is an instructive lesson in inconsistencytolook at the widely used as such), the jail(2) security facility in various types of ‘‘things’’aprocess can access in FreeBSD provides a pretty convincing ‘‘virtual UNIX-likesystems today. machine’’where eventhe root privilege is isolated and restricted to the designated area of the machine. Obvi- First there are normal files, which are our reference ously chroot(2) and jail(2) may require access to a yardstick here: theyare accessed with open(2), read(2), well-defined subset of devices like/dev/null, /dev/zero write(2), (2), close(2) and various other auxiliary and /dev/tty,whereas access to other devices such as system calls. /dev/kmem or anydisks could be used to compromise Sockets and pipes are also accessed via file handles but the integrity of the jail(2) confinement. each has its own namespace. That means you cannot 4 Foralong time FreeBSD, likealmost all UNIX-like open(2) a socket, butyou can read(2) and write(2) to systems had twokinds of devices, ‘‘block’’and ‘‘char- it. Sockets and pipes vector offatthe file descriptor acter’’special files, the difference being that ‘‘block’’ leveland do not get in touch with the vnode based part devices would provide caching and alignment for disk of the kernel at all. device access. This was one of those minor architec- Devices land somewhere in the middle between pipes tural mistakes which took forevertocorrect. and sockets on one side and normal files on the other. The argument that block devices were a mistakeis Theyuse the filesystem namespace, are implemented really very very simple: Manydevices other than disks with vnodes, and can be operated on likenormal files, have multiple modes of access which you select by butdon’tactually live inthe filesystem. choosing which special file to use. Devices are in fact special-cased all the way through Pick anyold timer and he will be able to recite painful the vnode system. Forone thing devices break the sagas about the crucial difference between the /dev/rmt ‘‘one file-one vnode’’rule, making it necessary to chain and /dev/nrmt devices for tape access.2 all vnodes for the same device together in order to be able to find ‘‘the canonical vnode for this device node’’, Tapes, asynchronous ports, line printer ports and many butmore importantly,manyoperations have tobe other devices have implemented submodes, selectable specifically denied on special file vnodes since theydo by the user at a special filename level, but that has not not makeany sense. earned them their own special file types. Only disks3 Fortrue inconsistency, consider the SVID IPC mecha- have enjoyed the privilege of getting an entire file type nisms - not only do theynot operate via file handles, dedicated to a a minor device mode. buttheyalso sport a singularly illconceived32bit Caching and alignment modes should have been numeric namespace and a dedicated set of system calls enabled by setting some bit in the minor device number for access.

2 Makeabsolutely sure you knowthe difference before you 4 This is particularly bizarre in the case of UNIX domain sock- takeimportant data on a multi-file 9-track tape to remote locations. ets which use the filesystem as their namespace and appear in directo- 3 Well, OK: and some 9-track tapes. ry listings. Several people have convincingly argued that this is an solvethis problem does not improve the architectural inconsistent mess, and have proposed and implemented qualities further and certainly the KISS principle has more consistent operating systems likethe Plan9 from been violated by now. Bell Labs [Pike90a] [Pike92a]. Unfortunately reality is The final and in the end only satisfactory solution is to that people are not interested in learning a newoperat- write a ‘‘DEVFS’’which mounts on ‘‘/dev’’. ing system when the one theyhav e is pretty darn good, and consequently research into better and more consis- The good news is that it does solvethe problem with tent ways is a pretty frustrating [Pike2000] but by no chroot(2) and jail(2): just mount a DEVFS instance on means irrelevant topic. the ‘‘dev’’directory inside the filesystem subtree where the chroot or jail lives. Having a mountpoint givesusa convenient place to keep track of the local state of this 5. Solving the /devmaintenance problem DEVFS mount. There are a number of obvious, simple but wrong ways The bad news is that it takes a lot of cleanup and care to one could go about solving the ‘‘/dev’’maintenance implement a DEVFS into a UNIX kernel. problem. The very straightforward way is to hack the namei() 6. DEVFS architectural decisions kernel function responsible for filename translation and lookup. It is only a minor matter of programming to Before implementing a DEVFS, it is necessary to add code to special-case anylookup which ends up in decide on a range of corner cases in behaviour,and ‘‘/dev’’. But this leads to problems: in the case of some of these choices have provedsurprisingly hard to chroot(2) or jail(2), the administrator will want to pre- settle for the FreeBSD project. sent only a subset of the available devices in ‘‘/dev’’, so some kind of state will have tobekept per 6.1. The ``persistence''issue chroot(2)/jail(2) about which devices are visible and When DEVFS in FreeBSD was initially presented at a which devices are hidden, but no obvious location for BoF at the 1995 USENIX Technical Conference in this information is available in the absence of a mount NewOrleans, a group of people demanded that it pro- data structure. vide ‘‘persistence’’for administrative changes. It also leads to some unpleasant issues because of the When trying to get a definition of ‘‘persistence’’, peo- fact that ‘‘/dev/foo’’isasynthesised directory entry ple can generally agree that if the administrator changes which may or may not actually be present on the the access control bits of a device node, theywant that filesystem which seems to provide ‘‘/dev’’. The vnodes mode to survive across reboots. either have tobelong to a filesystem or theymust be special-cased throughout the vnode layer of the kernel. Once more trickyexamples of the sort of manipulations one can do on special files are proposed, people rapidly Finally there is the simple matter of generality: hard- disagree about what should be supported and what coding the string "/dev" in the kernel is very general. should not. Acruder solution is to leave ittoadaemon: makea Forinstance, imagine a system with one floppydrive special device driver, hav e adaemon read messages which appears in DEVFS as ‘‘/dev/fd0’’. Nowthe from it and create and destroynodes in ‘‘/dev’’in administrator,inorder to get some badly written soft- response to these messages. ware to run, links this to ‘‘/dev/fd1’’: The main drawback to this idea is that nowwehav e ln /dev/fd0 /dev/fd1 added IPC to the mix introducing newand interesting race conditions. This works as expected and with persistence in DEVFS, the link is still there after a reboot. But what Otherwise this solution is a surprisingly effective,but if after a reboot another floppydrive has been con- chroot(2)/jail(2) requirements prevents a simple imple- nected to the system? This drive would naturally have mentation and running a daemon per jail would become the name ‘‘/dev/fd1’’, but this name is nowoccupied by an administrative nightmare. the administrators hard link. Should the link be bro- Another pitfall of this approach is that we are not able ken? Should the newfloppydrive becalled to remount the root filesystem read-write at boot until ‘‘/dev/fd2’’? Nobody can agree on anything but the we have a device node for the root device, but if this ugliness of the situation. node is missing we cannot create it with a daemon Giventhat we are no longer dependent on DEC Field since the root filesystem (and hence /dev) is read-only. engineers to change all four wheels to see which one is Adding a read-write memory-filesystem mount /devto flat, the basic assumption that the machine has a con- Avery common way to report errors in fact: stant hardware configuration is simply no longer true. #define LPT_NAME "lpt" /* our official name */ The newassumption one should start from when [...] printf(LPT_NAME analysing this issue is that when the system boots, we ": cannot alloc ppbus (%d)!", error); cannot knowwhat devices we will find, and we can not knowifthe devices we do find are the same ones we So despite the user renaming the device node pointing had when the system was last shut down. to the printer to ‘‘myprinter’’, this has absolutely no And in fact, this is very much the case with laptops effect in the kernel and can be considered a userland today: if Iattach my IOmega Zip drive tomylaptop it aliasing operation. appears likea SCSI disk named ‘‘/dev/da0’’, but so The decision was therefore made that it should not be does the RAID-5 array attached to the PCI SCSI con- possible to rename device nodes since it would only troller installed in my laptop’sdocking station. If I lead to confusion and because the desired effect could change mode to ‘‘a+rw’’onthe Zip drive,doIwant be attained by giving the user the ability to create sym- that mode to apply to the RAID-5 as well? Unlikely. links in DEVFS. And what if we have persistent information about the mode of device ‘‘/dev/sio0’’, but we boot and do not 6.3. On-demand device creation find anysio devices? Do we keep the information in Pseudo-devices likepty,tun and bpf, but also some real our device-persistence registry? Howlong do we keep devices, may not pre-emptively create entries for all it? If Iborrowaamodem card, set the permissions to possible device nodes. It would be a pointless waste of some non-standard value like0666, and then attach resources to always create 1000 ptys just in case they some other serial device a year from now-doIwant are needed, and in the worst case more than 1800 some old permissions changes to come back and haunt device nodes would be needed per physical disk to rep- me, just because theyboth happened to be ‘‘/dev/sio0’’? resent all possible slices and partitions. Unlikely. Forpseudo-devices the task at hand is to makeamagic The fact that more people have laptop computers today device node, ‘‘/dev/pty’’, which when opened will mag- than fiveyears ago, and the fact that nobody has been ically transmogrify into the first available pty subde- able to credibly propose where a persistent DEVFS vice, maybe ‘‘/dev/pty123’’. would actually store the information about these things in the first place has settled the issue. Device submodes, on the other hand, work by having multiple entries in /dev, each with a different minor Persistence may be the right answer,but to the wrong number,asaway to instruct the device driverinaspects question: persistence is not a desirable property for a of its operation. The most widespread example is prob- DEVFS when the hardware configuration may change ably ‘‘/dev/mt0’’and ‘‘/dev/nmt0’’, where the node literally at anytime. with the extra ‘‘n’’instructs the tape device driverto not rewind on close.5 6.2. Who decides on the names? Some UNIX systems have solved the problem for In a DEVFS-enabled system, the responsibility for cre- pseudo-devices by creating magic cloning devices like ating nodes in /devshifts to the device drivers, and con- ‘‘/dev/tcp’’. When acloning device is opened, it finds a sequently the device drivers get to choose the names of free instance and through vnode and file descriptor the device files. In addition an initial value for owner, mangling return this newdevice to the opening process. group and mode bits are provided by the device driver. This scheme has twodisadvantages: the complexity of But should it be possible to rename ‘‘/dev/lpt0’’to switching vnodes in midstream is non-trivial, but even ‘‘/dev/myprinter’’? While the obvious affirmative worse is the fact that it does not work for submodes for answer is easy to arrive at, it leavesa lot to be desired adevice because it only reacts to one particular /dev once the implications are unmasked. entry. Most device drivers knowtheir own name and use it The solution for both needs is a more flexible on- purposefully in their debug and log messages to iden- demand device creation, implemented in FreeBSD as a tify themselves. Furthermore, the ‘‘NewBus’’[New- two-levellookup. When afilename is looked up in Bus] infrastructure facility,which ties hardware to device drivers, identifies things by name and unit num- 5 This is the answer to the question in footnote number 2. bers. DEVFS, a match in the existing device nodes is sought 6.5. Jail(2), chroot(2) and DEVFS first and if found, returned. If no match is found, The primary requirement from facilities likejail(2) and device drivers are polled in turn to ask if theywould be chroot(2) is that it must be possible to control the con- able to synthesise a device node of the givenname. tents of a DEVFS mount point. The device drivergets a chance to modify the name and Obviously,itwould not be desirable for dynamic create a device with make_dev(). If one of the drivers devices to pop into existence in the carefully pruned succeeds in this, the lookup is started overand the /devofjails so it must be possible to mark a DEVFS newly found device node is returned: mountpoint as ‘‘no newdevices’’. And in the same pty_clone() way, the jailed root should not be able to recreate if (name != "pty") return(NULL); /* no luck */ device nodes which the real root has removed. n=find_next_unit(); dev = make_dev(...,n,"pty%d",n); These behaviours will be controlled with mount name = dev->name; options, but these have not yet been implemented return(dev); because FreeBSD has run out of bitmap flags for mount options, and a newunlimited mount option implemen- An interesting mixed use of this mechanism is with the tation is still not in place at the time of writing. sound device drivers. Modern sound devices have mul- tiple channels, presumably to allowthe user to listen to One mount option ‘‘jaildevfs’’, will restrict the contents CNN, Napstered MP3 files and Quakesound effects at of the DEVFS mountpoint to the ‘‘normal set’’of the same time. The only problem is that all applica- devices for a jail and automatically hide all future tions attempt to open ‘‘/dev/dsp’’since theyhav e no devices and makeitimpossible for a jailed root to un- concept of multiple sound devices. The sound device hide hidden entries while letting an un-jailed root do so. drivers use the cloning facility to direct ‘‘/dev/dsp’’to Mounting or remounting read-only,will prevent all the first available sound channel completely transpar- future devices from appearing and will makeitimpos- ently to the process. sible to hide or un-hide entries in the mountpoint. This There are very fewdrawbacks to this mechanism, the is probably only useful for chroots or jails where no tty major one being that ‘‘ls /dev’’now errs on the sparse access is intended since cloning will not work either. side instead of the rich when used as a system device More mount options may be needed as more experience inventory,apractice which has always been of dubious is gained. precision at best. 6.6. Default mode, owner & group 6.4. Deleting and recreating devices When a device drivercreates a device node, and a Deleting device nodes is no problem to implement, but DEVFS mount adds it to its directory tree, it needs to as likely as not, some people will want a method to get have some values for the access control fields: mode, them back. Since only the device driverknowhow to owner and group. create a givendevice, recreation cannot be performed Currently,the device driverspecifies the initial values solely on the basis of the parameters provided by a pro- in the make_dev() call, but this is far from optimal. For cess in userland. one thing, embedding magic UIDs and GIDs in the ker- In order to not complicate the code which updates the nel is simply bad style unless theyare numerically zero. directory structure for a mountpoint to reflect changes More seriously,theyrepresent compile-time defaults in the DEVFS inode list, a deleted entry is merely which in these enlightened days is rather old-fashioned. marked with DE_WHITEOUT instead of being removedentirely.Otherwise a separate list would be 7. Cleaning up beforewebuild: struct needed for inodes which we had deleted so that they specinfoand dev_t would not be mistaken for newinodes. Most of the rest of the paper will be about the various The obvious way to recreate deleted devices is to let challenges and issues in the implementation of DEVFS mknod(2) do it by matching the name and disregarding in FreeBSD. All of this should be applicable to other the major/minor arguments. Recreating the device with systems derivedfrom 4.4BSD-Lite as well. mknod(2) will simply remove the DE_WHITEOUT POSIX has defined a type called ‘‘dev_t’’which is the flag. identity of a device. This is mainly for use in the few system calls which knows about devices: stat(2), fstat(2) and mknod(2). Adev_t is constructed by logically OR’ing the major# and minor# for the device. reference the same device (i.e. theyhav e the same Since those have been defined as having no overlapping major# and minor#). This linkage is used to determine bits, the major# and minor# can be retrievedfrom the which vnode is the ‘‘chosen one’’for this device, and to dev_t by a simple masking operation. keep track of open(2)/close(2) against this device. The Although the kernel had a well-defined concept of any actual implementation was an inefficient hash imple- particular device it did not have a data structure to rep- mentation, which depending on the vnode reclamation resent "a device". The device driverhas such a struc- rate and /devdirectory lookup traffic, may become a ture, traditionally called ‘‘softc’’but the high kernel measurable performance liability. does not (and should not!) have access to the device driver’sprivate data structures. 7.1. The new vnode/inode/dev_t layout It is an interesting tale howthings got to be this way,6 In the newlayout (Fig. 2) the specinfo structure takes a butfor nowjust record for a fact howthe actual rela- central role. There is only one instanace of struct tionship between the data structures was in the 4.4BSD specinfo per device (i.e. unique major# and minor# release (Fig. 1). [44BSDBook] combination) and all vnodes referencing this device As for all other files, a vnode references a filesystem point to this structure directly. inode, but in addition it points to a ‘‘specinfo’’struc- In userland, a dev_t is still the logical OR of the major# ture. In the inode we find the dev_t which is used to and minor#, but this entity is nowcalled a udev_t in the reference the device driver. kernel. In the kernel a dev_t is nowapointer to a struct Access to the device driverhappens by extracting the specinfo. major# from the dev_t, indexing through the global All vnodes referencing a device are linked to a list devsw[] array to locate the device driver’sentry point. hanging directly offthe specinfo structure, removing The device driverwill extract the minor# from the the need for the hash table and consequently simplify- dev_t and use that as the indexinto the softc array of ing and speeding up a lot of code dealing with vnode private data per device. instantiation, retirement and name-caching. The ‘‘specinfo’’structure is a little sidekick vnodes The entry points to the device driverare stored in the grewunderway,and is used to find all vnodes which specinfo structure, removing the need for the devsw[] array and allowing device drivers to use separate entry- points for various minor numbers. file file This is is very convenient for devices which have a handle handle ‘‘control’’device for management and tuning. The control device, almost always have entirely separate open/close/ioctl implementations [MD.C]. vnode specinfo vnode specinfo In addition to this, twodata elements are included in the specinfo structure but ‘‘owned’’bythe device driver. Typically the device driverwill store a pointer devsw[] inode inode [major#] file file handle handle device softc[] driver [minor#] vnode specinfo vnode Fig. 1 - Data structures in 4.4BSD

6 device Basically,devices should have been movedupwith sockets inode inode and pipes at the file descriptor levelwhen the VFS layering was intro- driver duced, rather than have all the special casing throughout the vnode system. Fig. 2 - The newFreeBSD data structures. to the softc structure in one of these, and unit number 8. DEVFS or mode information in the other. By nowwehav e all the relevant information about each This removesthe need for drivers to find the softc using device node collected in struct specinfo but we still array indexing based on the minor#, and at the same have one problem to solvebefore we can add the time has obliviated the need for the compiled-in DEVFS filesystem on top of it. ‘‘NFOO’’constants which traditionally determined howmanysoftc structures and therefore devices the 8.1. The interrupt problem drivercould support.7 Some device drivers, notably the CAM/SCSI subsys- There are some trivial technical issues relating to allo- tem in FreeBSD will discoverchanges in the device cating the storage for specinfo early in the boot configuration inside an interrupt routine. sequence and howtofind a specinfo from the udev_t/major#+minor#, but theywill not be discussed This imposes some limitations on what can and should here. do be done: first one should minimise the amount of work done in an interrupt routine for performance rea- sons; second, to avoid deadlocks, vnodes and mount- 7.2. Creating and destroying devices points should not be accessed from an interrupt routine. Ideally,devices should only be created and destroyed Also, in addition to the locking issue, a machine can by the device drivers which knowwhat devices are pre- have manyinstances of DEVFS mounted: for a jail(8) sent. This is accomplished with the make_dev() and based virtual-machine web-server several hundred destroy_dev() function calls. instances is not unheard of, making it far too expensive Life is seldom quite that simple. The operating system to update all of them in an interrupt routine. might be called on to act as a NFS server for a diskless The solution to this problem is to do all the filesystem workstation, possibly evenofadifferent architecture, work on the filesystem side of DEVFS and use atomi- so we still need to be able to represent device nodes cally manipulated integer indices (‘‘inode numbers’’) as with no device driverbacking in the filesystems and the barrier between the twosides. consequently we need to be able to create a specinfo The functions called from the device drivers, from the major#+minor# in these inodes when we make_dev(), destroy_dev() &c. only manipulate the encounter them. In practice this is quite trivial, but in a DEVFS inode number of the dev_t in question and do fewplaces in the code one needs to be aware of the not evenget near anymountpoints or vnodes. existence of both ‘‘named’’and ‘‘anonymous’’specinfo structures. Formake_dev() the task is to assign a unique inode number to the dev_t and store the dev_t in the DEVFS- The make_dev() call creates a specinfo structure and global inode-to-dev_t array. populates it with driverentry points, major#, minor#, make_dev(...) device node name (for instance ‘‘lpt0’’), UID, GID and store argument values in dev_t access mode bits. The return value is a dev_t (i.e., a assign unique inode number to dev_t pointer to struct specinfo). If the device driverdeter- atomically insert dev_t into inode_array mines that the device is no longer present, it calls destroy_dev(), giving a dev_t as argument and the dev_t Fordestroy_dev() the task is the opposite: clear the will be cleaned and converted to an anonymous dev_t. inode number in the dev_t and NULL the pointer in the devfs-global inode-to-dev_t array. Once created with make_dev() a named dev_t exists destroy_dev(...) until destroy_dev() is called by the driver. The driver clear fields in dev_t can rely on this and keep state in the fields in dev_t zero dev_t inode number. which is reserved for driveruse. atomically clear entry in inode_array Both functions conclude by atomically incrementing a global variable devfs_generation to leave an indication to the filesystem side that something has changed. 7 Not to mention all the drivers which implemented panic(2) By modifying the global state only with atomic instruc- because theyforgot to perform bounds checking on the indexbefore using it on their softc arrays. tions, locks have been entirely avoided in this part of the code which means that the make_dev() and destroy_dev() functions can be called from practically anywhere in the kernel at anytime. On the filesystem side of DEVFS, the only twovnode #ifndef NFOO methods which examine or rely on the directory struc- #define NFOO 4 ture, VOP_LOOKUP and VOP_READDIR, call the #endif function devfs_populate() to update their mountpoint’s struct foo_softc { viewofthe device hierarchytomatch current reality ... before doing anywork. }foo_softc[NFOO]; devfs_readdir(...) int nfoo = 0; devfs_populate(...) ... foo_open(dev, ...) { The devfs_populate() function, compares the current int unit = minor(dev); devfs_generation to the value savedinthe struct foo_softc *sc; mountpoint last time devfs_populate() completed and if if (unit >= NFOO || unit >= nfoo) (actually: while) theydiffer a linear run is made return (ENXIO); through the devfs-global inode-array and the directory tree of the mountpoint is brought up to date. sc = &foo_softc[unit] The actual code is slightly more complicated than ... shown in the pseudo-code here because it has to deal } with subdirectories and hidden entries. foo_attach(...) devfs_populate(...) { while (mount->generation != devfs_generation) struct foo_softc *sc; for i in all inodes static int once; if inode created) create directory entry ... else if inode destroyed if (nfoo >= NFOO) { remove directory entry /* Have hardware, can't handle */ return (-1); } Access to the global DEVFS inode table is again imple- sc = &foo_softc[nfoo++]; mented with atomic instructions and failsafe retries to if (!once) { avoid the need for locking. cdevsw_add(&cdevsw); once++; From a performance point of viewthis scheme also } means that a particular DEVFS mountpoint is not ... } updated until it needs to be, and then always by a pro- cess belonging to the jail in question thus minimising Fig. 3 - Device-driver, old style. and distributing the CPU load. Also notice howrange checking is needed to makesure that the minor# is inside range. This code gets more 9. Device-driverimpact complexifdevice-numbering is sparse. Code equiv- All these changes have had a significant impact on how alent to that shown in the foo_open() routine would device drivers interact with the rest of the kernel also be needed in foo_read(), foo_write(), foo_ioctl() regarding registration of devices. &c. If we look first at the ‘‘before’’image in Fig. 3, we Finally notice howthe attach routine needs to remem- notice first the NFOO define which imposes a firm ber to register the cdevsw structure (not shown) when upper limit on the number of devices the kernel can the first device is found. deal with. Also notice that the softc structure for all of Now, compare this to our ‘‘after’’image in Fig. 4. them is allocated at compile time. This is because most NFOO is totally gone and so is the compile time alloca- device drivers (and texts on writing device drivers) are tion of space for softc structures. from before the general kernel malloc facility [Mcku- The foo_open (and foo_close, foo_ioctl &c) functions sick1988] was introduced into the BSD kernel. can nowderive the softc pointer directly from the dev_t theyreceive asanargument. appears, mount it. When a network interface appears, struct foo_softc { .... configure it. And in some configurable way allowthe }; user to customise the action, so that for instance images will automatically be copied offthe flash-based media int nfoo; from a camera, &c. foo_open(dev, ...) { In this context it is good to question howweview struct foo_softc *sc = dev->si_drv1; dynamic devices. If for instance a printer is removedin the middle of a print job and another printer arrivesa ... } moment later,should the system automatically continue the print job on this newprinter? When adisk-like foo_attach(...) device arrives, should we always mount it? Should we { struct foo_softc *sc; have a database of known disk-likedevices to tell us where to mount it, what permissions to give the mount- ... point? Some computers come in multiple configura- sc = MALLOC(..., M_ZERO); if (sc == NULL) { tions, for instance laptops with and without their dock- /* Have hardware, can't handle */ ing station. Howdowewant to present this to the users return (-1); and what behaviour do the users expect? } sc->dev = make_dev(&cdevsw, nfoo, UID_ROOT, GID_WHEEL, 0644, "foo%d", nfoo); 10.2. Pathname length limitations nfoo++; In order to simplify memory management in the early sc->dev->si_drv1 = sc; ... stages of boot, the pathname relative tothe mountpoint } is presently stored in a small fixed size buffer inside struct specinfo. It should be possible to use filenames Fig. 4 - Device-driver, new style. as long as the system otherwise permits, so some kind In foo_attach() we can nowattach to all the devices we of extension mechanism is called for. can allocate memory for and we register the cdevsw Since it cannot be guaranteed that memory can be allo- structure per dev_t rather than globally. cated in all the possible scenarios where make_dev() This last trick is what allows us to discard all bounds can be called, it may be necessary to mandate that the checking in the foo_open() &c. routines, because they caller allocates the buffer if the content will not fit can only be called through the cdevsw,and the cdevsw inside the default buffer size. is only attached to dev_t’swhich foo_attach() has cre- ated. There is no way to end up in foo_open() with a 10.3. Initial access parameter selection dev_t not created by foo_attach(). As it is now, device drivers propose the initial mode, In the twoexamples here, the difference is only 10 lines owner and group for the device nodes, but it would be of source code, primarily because only one of the more flexible if it were possible to give the kernel a set worker functions of the device driverisshown. In real of rules, much likepacket filtering rules, which allow device drivers it is not uncommon to save 50ormore the user to set the wanted policyfor newdevices. Such lines of source code which typically is about a percent amechanism could also be used to filter newdevices or twoofthe entire driver. for mount points in jails and to determine other behaviour. 10. Futurework Doing these things from userland results in some awk- Apart from some minor issues to be cleaned up, ward race conditions and software bloat for embedded DEVFS is nowareality and future work therefore is systems, so a kernel approach may be more suitable. likely concentrate on applying the facilities and func- tionality of DEVFS to FreeBSD. 10.4. Applications of on-demand device creation 10.1. devd The facility for on-demand creation of devices has It would be logical to complement DEVFS with a some very interesting possibilities. ‘‘device-daemon’’which could configure and de-con- One planned use is to enable user-controlled labelling figure devices as theycome and go. When a disk of disks. Today disks have names like/dev/da0, /dev/ad4, but since this numbering is topological any credit for inspiration. change in the hardware configuration may rename the Julian Elischer implemented a DEVFS for FreeBSD disks, causing /etc/fstab and backup procedures to get around 1994 which neverquite made it to maturity and out of sync with the hardware. subsequently was abandoned. The current idea is to store on the media of the disk a Bruce Evans deserves special credit not only for his user-chosen disk name and allowaccess through this keen eye for detail, and his competent criticism but also name, so that for instance /dev/mydisk0 would be a for his enthusiastic resistance to the very concept of symlink to whatevertopological name the disk might DEVFS. have atany giv entime. Manythanks to the people who took time to help me To simplify this and to avoid a forest of symlinks, it stamp out ‘‘Danglish’’through their reviews and com- will probably be decided to move all the sub-divisions ments: Chris Demetriou, Paul Richards, Brian Somers, of a disk into one subdirectory per disk so just a single Nik Clayton, and Hanne Munkholm. Anyremaining symlink can do the job.Inpractice that means that the insults to proper use of english language are my own current /dev/ad0s2f will become something like fault. /dev/ad0/s2f and so on. Obviously,inthe same way, disks could also be accessed by their topological address, down to the specific path in a SAN environ- 13. References ment. [44BSDBook] Mckusick, Bostic, Karels & Quarter- Another potential use could be for automated offline man: ‘‘The Design and Implementation of 4.4 BSD data media libraries. It would be quite trivial to makeit Operating System.’’ Addison Wesley, 1996, ISBN possible to access all the media in the library using 0-201-54979-4. /dev/lib/$LABEL which would be a remarkable simpli- [Heidemann91a] John S. Heidemann: ‘‘Stackable lay- fication compared with most current automated ers: an architecture for filesystem development.’’ Mas- retrievalfacilities. ter’sthesis, University of California, Los Angeles, July Another use could be to access devices by parameter 1991. Available as UCLA technical report rather than by name. One could imagine sending a CSD-910056. printjob to /dev/printer/color/A2 and behind the scenes [Kamp2000] Poul-Henning Kamp and Robert N. M. asearch would be made for a device with the correct Watson: ‘‘Confining the Omnipotent root.’’ Proceed- properties and paper-handling facilities. ings of the SANE 2000 Conference. Av ailable in FreeBSD distributions in /usr/share/papers. 11. Conclusion [MD.C] Poul-Henning Kamp et al: FreeBSD memory DEVFS has been successfully implemented in disk driver: src/sys/dev/md/md.c FreeBSD, including a powerful, simple and flexible [Mckusick1988] Marshall Kirk Mckusick, MikeJ. solution supporting pseudo-devices and on-demand Karels: ‘‘Design of a General Purpose Memory Alloca- device node creation. tor for the 4.3BSD UNIX-Kernel’’Proceedings of the Contrary to the trend, the implementation added func- San Francisco USENIX Conference, pp. 295-303, June tionality with a net decrease in source lines, primarily 1988. because of the improvedAPI seen from device drivers [Mckusick1999] Dr.Marshall Kirk Mckusick: Private point of view. email communication. ‘‘According to the SCCS logs, Even if DEVFS is not desired, other 4.4BSD derived the chroot call was added by on March 18, UNIX variants would probably benefit from adopting 1982 approximately 1.5 yearsbefore4.2BSD was the dev_t/specinfo related cleanup. released. That was well beforewehad ftp serversof any sort (ftp did not show up in the source tree until January 1983). My best guess as to its purpose was to 12. Acknowledgements allow Bill to chroot into the /4.2BSD build directory Ifirst got started on DEVFS in 1989 because the and build a system using only the files, include files, etc abysmal performance of the Olivetti M250 computer contained in that tree.That was the only use of chroot forced me to implement a network-disk-device for that I remember from the early days.’’ Minix in order to retain my sanity.That initial work [Mckusick2000] Dr.Marshall Kirk Mckusick: Private led to a crude but working DEVFS for Minix, so obvi- communication at BSDcon2000 conference. ‘‘Ihave ously both AndrewTannenbaum and Olivetti deserve not used blockdevices since I wrote FFS and that was many yearsago.'' [NewBus] NewBus is a subsystem which provides most of the glue between hardware and device drivers. Despite the importance of this there has neverbeen published anygood overviewdocumentation for it. The following article by Alexander Langer in ‘‘Dæmonnews’’isthe best reference I can come up with: http://www.daemonnews.org/200007/newbus- intro.html [Pike2000] Rob Pike: ‘‘Systems Software Research is Irrelevant.’’ http://www.cs.bell−labs.com/who/rob/utah2000.pdf [Pike90a] Rob Pike, Dave Presotto, Ken Thompson and Howard Trickey:‘‘Plan 9 from Bell Labs.’’ Proceed- ings of the Summer 1990 UKUUG Conference. [Pike92a] Rob Pike, Dave Presotto, Ken Thompson, Howard Trickey and Phil Winterbottom: ‘‘The Use of Name Spaces in Plan 9.’’ Proceedings of the 5th ACM SIGOPS Workshop. [Raspe1785] Rudolf Erich Raspe: ‘‘Baron Münch- hausen’sNarrative ofhis marvellous Travels and Cam- paigns in Russia.’’ Kearsley, 1785. [Ritchie74] D.M. Ritchie and K. Thompson: ‘‘The UNIX Time-Sharing System’’Communications of the ACM, Vol. 17, No. 7, July 1974. [Ritchie98] Dennis Ritchie: private conversation at USENIX Annual Technical Conference NewOrleans, 1998. [Thompson78] Ken Thompson: ‘‘UNIX Implementa- tion’’The Bell System Technical Journal, vol 57, 1978, number 6 (part 2) p. 1931ff.