<<

PRACTICES

Operating Support for Management University of California, Berkeley

1. Introduction Database management SUMMARY: Several services are examined (DBMS) provide higher level with a toward their applicability to support of database support than conventional operating management functions. These services include buffer pool systems. The DBMS designer must work in the context of the OS he/she management; the system; , manage- is faced with. Different operating ment, and interprocess ; and consistency systems are designed for different control. use. In this paper we examine several popular operating system services and indicate whether they are appro- priate for support of database man- suggestions concerning improve- the operating system is compiled. agement functions. Often we will see ments. In the several sections Then, all file I/O is handled through that the wrong is provided or we look the services provided by this . A file (.g., read X that severe performance problems buffer pool management; the file - in Figure 1) returns directly . When possible, we offer some tem; scheduling, process manage- from a block in the cache, if possible; ment, and interprocess communica- otherwise, it causes a block to be Permission to without fee all or part of “pushed” to disk and replaced by the this material is granted provided that the cop- tion; and consistency control. We ies are not made or distributed for direct then conclude with a discussion of desired block. In Figure 1 we show commercial advantage, the ACM copyright the merits of including all files in a block Y being pushed to room notice and the of the publication and its ‘date appear, and notice is given that copying paged . for block X. A tile simply is by permission of the Association for Com- The examples in this paper are moves data into the cache; at some puting Machinery. To copy otherwise, or to drawn primarily from the op- later the buffer manager writes republish, requires a fee and/or specific per- mission. erating system [ 171and the the block to the disk. The UNIX This was sponsored by U.S. Air system [19, 201 buffer manager used the popular Force Office of Scientific Research Grant 78- was designed for use with LRU [ 151 replacement strategy. Fi- 3596, U.S. Army Research Offtce Grant DAAG29-76-G-0245, Naval Electronics Sys- UNIX. of the points made for nally, when UNIX detects sequential tems Command Contract NOOO39-78-G-0013, this environment have general appli- access to a file, it prefetches blocks and National Science Foundation Grant cability to other operating systems before they are requested. MCS75-03839-AOl. Key words and phrases: database manage- and data managers. Conceptually, this service is de- ment, operating systems, buffer management, sirable because blocks for which tile systems, scheduling, interprocess commu- 2. Buffer Pool Management there is so-called nication CR Categories: 3.50,3.70,4.22,4.33,4.34,4.35 Many modem operating systems [ 15, 181will remain in the cache over Author’s address: M. Stonebraker, Dept. of provide a main memory cache for repeated reads and writes. However, Electrical Engineering and Sci- the tile system. Figure 1 illustrates the problems enumerated in the fol- ences, University of California, Berkeley, CA 94720. this service. In , UNIX provides lowing subsectionsarise in using this 0 1981 ACMOOOl-0782/81/0700-0412$00.75. a buffer pool whose size is set when service for database management.

412 July 198 1 of Volume 24 the ACM Number 7 virtual memory (e.g., [ 161) may exactly which block it will access read X provide a solution to this problem. next. Unfortunately, this block is not This topic is examined in detail in necessarily the next one in logical file Section 6. order. Hence, there is no way for an I OS to implement the correct prefetch 2.2 LRU Replacement strategy. Although the folklore indicates that LRU is a generally good 2.4 Recovery for buffer management, it appears to An important DBMS service is to perform only marginally in a data- provide recovery from hard and soft base environment. Database access crashes. The desired effect is for a in INGRES is a combination oE unit of work (a transaction) which (1) sequential access to blocks may be quite large and span multiple which will not be rereferenced; files to be either completely done or (2) sequential access to blocks look like it had never started. which will be cyclically rerefer- The way many DBMSs provide enced; this service is to maintain an inten- (3) random access to blocks which tions list. When the intentions list is will not be referenced again, complete, a flag is set. The (4) random access to blocks for last of a transaction is to process which there is a nonzero prob- the intentions list making the actual ability of rereference. updates. The DBMS makes the last Fig. 1. Structure of a Cache. operation idempotent (i.e., it gener- Although LRU works well for case ates the same final outcome no mat- 4, it is a bad strategy for other situ- ter how many times the intentions ations. Since a DBMS knows which list is processed) by careful program- blocks are in each category, it can ming. The general procedure is de- 2.1 Performance use a composite strategy. For case 4 scribed in [6, 131. An alternate pro- The overhead to fetch a block it should use LRU while for 1 and 3 cess is to do updates as they are from the buffer pool manager usu- it should use toss immediately. For found and maintain a log of before ally includes that of a and blocks in class 3 the reference pattern images so that backout is possible. a core-to-core . For UNIX on is 1, 2, 3, . . . , n, 1, 2, 3, . . . . Clearly, During recovery from a crash the a PDP-1 l/70 the cost to fetch 5 12 LRU is the worst possible - commit flag is examined. If it is set, bytes exceeds 5,000 instructions. To ment for this situation. the DBMS recovery utility processes fetch 1 byte from the buffer pool Unless all n can be kept in the the intentions list to correctly install requires about 1,800 instructions. It cache, the strategy should be to toss the changes made by updates in appears that these are immediately. Initial studies [9] sug- progress at the time of the crash. If somewhat higher for UNIX than gest that the miss ratio can be the flag is not set, the utility removes other contemporary operating sys- 10-15s by a DBMS specific algo- the intentions list, thereby backing tems. Moreover, they can be cut rithm. out the transaction. The impact of somewhat for VAX 11/780 hardware In order for an OS to provide crash recovery on the buffer pool [lo]. It is hoped that this trend to- buffer management, some means manager is the following. ward lower overhead access will con- must be found to allow it to accept The on which the commit tinue. “advice” from an application pro- flag .exists must be forced to disk However, many DBMSs includ- gram (e.g., a DBMS) concerning the after all pages in the intentions list. ing INGRES [20] and System [4] replacement strategy. Designing a Moreover, the transaction is not re- choose to put a DBMS managed clean buffer management interface liably committed until the commit buffer pool in to reduce with this feature would be an inter- flag is forced out to the disk, and no overhead. Hence, each of these sys- esting problem. response can be given to the person tems has gone to the trouble of con- submitting the transaction until this structing its own buffer pool man- 2.3 Prefetch time. ager to enhance performance. Although UNIX correctly pre- The service required from an OS In order for an operating system fetches pages when sequential access buffer manager is a selectedforce out (OS) provided buffer pool manager is detected, there are important in- which would push the intentions list to be attractive, the access overhead stances in which it fails. and the commit flag to disk in the must be cut to a few hundred instruc- Except in rare cases INGRES at proper order. Such a service is not tions. The trend toward providing (or very shortly after) the beginning present in any buffer manager the as a part of shared of its examination of a block knows known to us.

413 Communications July 1981 of Volume 24 the ACM Number 7 a character array object. The follow- tures as efficiently as those they cre- COMPUTING ing subsections explain why. ate themselves [2]. It is our feeling PRACTICES that OS designers should contem- 3.1 Physical Contiguity plate providing DBMS facilities as The character array object can lower level objects and character - usually be expanded one block at a rays as higher level ones. This phi- 2.5 Summary time. Often the result is blocks of a losophy has already been presented Although it is possible to provide given file scattered over a disk - PI* an OS buffer manager with the re- ume. Hence, the next logical block in quired features, none currently - a file is not necessarily physically 4. Scheduling, Process ists, at least to our . De- close to the one. Since a Management, and Interprocess signing such a facility with prefetch DBMS does considerable sequential Communication advice, block management advice, access, the result is considerable disk Often, the simplest way to orga- and selected force out would be an arm movement. nize a multiuser database system is interesting exercise. It would be of The desired service is for blocks to have one OS process per user; i.e., interest in the context of both a to be stored physically contiguous each concurrent database user runs paged virtual memory and an ordi- and a whole collection to be read in a separate process. It is hoped that nary file system. when sequential access is desired. all users will the same copy of The strategy used by most This naturally leads a DBMS to pre- the code segment of the database DBMSs (for example, System R [4] fer a so-called extent based file sys- system and perhaps one or data and IMS [S]) is to maintain a sepa- tem (e.g., VSAM [l 11) to one which segments. In particular, a DBMS rate cache in user space. This buffer scatters blocks. Of course, such files buffer pool and should be pool is managed by a DBMS specific must grow an extent at a time rather handled as a shared segment. The algorithm to circumvent the prob- than a block at a time. above structure is followed by Sys- lems mentioned in this section. The tem R and, in part, by INGRES. result is a “not quite right” service 3.2 Structured File Systems Since UNIX has no shared data seg- provided by the OS going unused UNIX implements two services ments, INGRES must put the lock and a comparable application spe- by means of data structures which table inside the operating system and cific service being provided by the are trees. The blocks in a given file provide buffering private to each DBMS. Throughout this paper we are kept track of in a tree (of indirect user. will see variations on this in blocks) pointed to by a file control The alternative organization is to several service delivery areas. block (i-node). Second, the files in a allocate one run-time database pro- given mounted tile system have a cess which acts as a . All con- 3. The File System user visible hierarchical structure current users send to this The file system provided by composed of directories, subdirecto- server with work requests. The one UNIX supports objects (files) which ries, etc. This is implemented by a run-time server schedules requests are character arrays of dynamically second tree. A DBMS such as through its own mechanisms and varying size. On of this abstrac- INGRES then adds a third tree may support its own multitasking tion, a DBMS can provide whatever structure to support keyed access via system. This organization is followed higher level objects it wishes. a multilevel structure (e.g., by Enscribe [21]. Figure 2 shows This is one of two popular ap- ISAM [7], -trees [I, 121, VSAM both possibilities. proaches to file systems; the second [ 111, etc.). Although Lauer [14] points out is to provide a record management Clearly, one tree with all three that the two methods are equally system inside the OS (e.g., RMS-11 kinds of is more efficient viable in a conceptual sense, the de- for DEC or Enscribe for than three separately managed trees. sign of most operating systems Tandem machines). In this approach The extra overhead for three sepa- strongly favors the first approach. structured files are provided (with or rate trees is probably substantial. For example, UNIX contains a mes- without variable length records). sage system (pipes) which is incom- Moreover, efficient access is often 3.3 Summary patible with the notion of a server supported for fetching records cor- It is that a character array process. Hence, it forces the use of responding to a user supplied value is not a useful object to a DBMS. the first alternative. There are at least (or key) for a designated field or Rather, it is the presum- two problems with the process-per- fields. Multilevel directories, - ably desired by language processors, user approach. ing, and secondary indexes are often editors, etc. Instead of providing rec- used to provide this service. ords management on top of character 4.1 Performance The point to be made in this sec- arrays, it is possible to do the con- Every time a run-time database tion is that the second service, which verse; the only issue is one of - process issues an I/O request that is what a DBMS wants, is not always ciency. Moreover, editors can possi- cannot be satisfied by data in the efficient when constructed on top of bly use records management struc- buffer pool, a switch is inevita-

414 Communications July 1981 of Volume 24 the ACM Number 7 To achieve internal parallelism user 1 user yet avoid multitasking, one could have user processes send work re- quests to one of perhaps several com- mon servers as noted in Figure 3. However, such servers would have to share a lock table and are only slightly different from the shared code process-per-user model. Alter- nately, one could have a collection of servers, each of which would send low-level requests to a group of disk processes which actually peform the I/O and locking as suggested in Figure 4. A disk process would process requests in first-in-first-out Process-Per-User Sewer DBMS order. Although this organization Structure Structure appears potentially desirable, it still Fig. 2. Two Approaches to Organizing a Multiuser Database System. may have the response time penalty mentioned above. Moreover, it re- sults in one message per I/O request. In reality one has traded a task ble. The DBMS suspends while wait- As a result of these two problems switch per I/O for a message per ing for required data and another with the process-per-user model, one I/O; the latter may turn out to be process is run. It is possible to make might the server model to be more expensive than the former. In task switches very efficiently, and especially attractive. The following the next subsection, we diseuss mes- some operating systems can perform subsection explores this point of sage costs in more detail. a task switch in a few hundred in- view. structions. However, many operating 4.4 Performance of Message systems have “large” processes, i.e., 4.3 The Server Model Systems ones with a great deal of state infor- A server model becomes viable if Although we have never been of- mation (e.g., accounting) and a so- the operating system provides a mes- fered a good explanation of why phisticated scheduler. This tends to sage facility which allows n processes messages are so expensive, the fact cause task switches costing a thou- to originate messages to a single des- remains that in most operating sys- sand instructions or more. This is a tination process. However, such a tems the cost for a round-trip mes- high price to pay for a buffer pool server must do its own scheduling sage is several thousand instructions. miss. and multitasking. This involves a For example, in PDP-11/70 UNIX painful duplication of operating sys- the number is about 5,000. As a re- 4.2 Critical Sections tem facilities. In order to avoid such sult, care must be exercised in a Blasgen [3] has pointed out that duplication, one must resort to the DBMS to avoid overuse of a facility some DBMS processes have critical following tactics. that is not cheap. Consequently, - sections. If the buffer pool is a shared One can avoid multitasking and able DBMS organizations will some- data segment, then portions of the a scheduler by a tirst-come-first- times be rejected because of exces- buffer pool manager are necessarily served server with no internal paral- sive message overhead. critical sections. System R handles lelism. A work request would be read critical sections by setting and releas- from the message system and exe- 4.5 Summary ing short-term locks which basically cuted to completion before the next There appears to be no way out simulate semaphores. A problem one was started. This approach of the scheduling dilemma; both the arises if the operating system sched- makes little sense if there is more server model and the individual pro- uler deschedules a database process than one physical disk. Each work cess model seem unattractive. The while it is holding such a lock. All request will tend to have one disk problem is at least, in part, the other database processes cannot ex- read outstanding at any instant. overhead in some operating systems ecute very long without accessing the Hence, at most one disk will be active of task switches and messages. Either buffer pool. Hence, they quickly with a non-multitasking server. Even operating system designers must queue up behind the locked resource. with a single disk, a long work re- make these facilities cheaper or pro- Although the of this oc- quest will be processed to completion vide special fast functions for curring is low, the resulting convoy while shorter requests must wait. The DBMS consumers. If this does not [3] has a devastating effect on per- penalty on average response time happen, DBMS designers will pre- formance. may be considerable [ 181. sumably continue the present prac-

415 Communications July 1981 of Volume 24 the ACM Number 7 COMPUTING user 1 user k PRACTICES

tice: implementing their own multi- tasking, scheduling, and message systems entirely in user space. The result is a “mini” operating system running in user space in addition to process a DBMS. One ultimate solution to task- switch overhead might be for an op- erating system to create a special scheduling class for the DBMS and other “favored” users. Processes in this class would never be forcibly descheduled but might voluntarily relinquish the CPU at appropriate intervals. This would solve the con- voy problem mentioned in Section 4.2. Moreover, such special processes disk might also be provided with a fast path through the task switch/sched- uler loop to control to one of Fig. 3. Server Pool Structure. their sibling processes. Hence, a :I,: DBMS process could pass control to another DBMS process at low over- user 1 . management. If a DBMS provides buffer management in addition to 5. Consistency Control whatever is supplied by the operating \ Ooo yrk The services provided by an op- system, then transaction manage- erating system in this area include ment by the operating system is im- /I the ability to lock objects for shared pacted as discussed in the following I or exclusive access and support for subsections. crash recovery. Although most op- erating systems provide locking for 5.1 Commit Point files, there are fewer which support When a finer granularity locks, such as those commits, a ,user space buffer man- on pages or records. Such smaller ager must ensure that all appropriate locks are deemed essential in some blocks are flushed and a commit de- database environments. livered to the operating system. Moreover, many operating sys- Hence, the buffer manager cannot be tems provide some cleanup after immune from knowledge of trans- crashes. If they do not offer support actions, and operating system func- @ @ for database transactions as dis- tions are duplicated. cussed in Section 2.4, then a DBMS must provide transaction crash re- 5.2 Ordering Dependencies Fig. 4. Disk Server Structure. covery on top of whatever is sup- Consider the following employee plied. data: It has sometimes been suggested Empname salary Manager ployee to receive a decrease, al- that both and Smith 10,ooo Brown though there are alternative semantic crash recovery. for transactions be Jones 9,@o None definitions. Brown 11,cOo Jones provided entirely inside the operat- Suppose the DBMS updates the ing system (e.g., [ 131). Conceptually, and the update which gives a 20% data set as it finds “overpaid” em- they should be at least as efficient as pay cut to all employees earn ployees, depending on the operating if provided in user space. The only more than their managers. Presum- system to provide backout or re- problem with this approach is buffer ably, Brown will be the only em- cover-forward on crashes. If so,

416 Communications July 1981 of Volume 24 the ACM Number 7 address space. In Figure 5 we show -unbind pairs in a transaction. the address space of a process wn- Since the overhead of a bind is likely taining code to be executed, data that to be comparable to that of a file the code uses, and the tiles Fl and open, this may substantially slow F2. Such files can be referenced by down performance. a program as if they are program It is an open question whether or run-time data variables. Consequently, a user never not novel paging organizations can needs to do explicit reads or writes; in solving the problems men- he can depend on the paging facili- tioned in this section. file Fl ties of the OS to move his file blocks into and out of main memory. Here, 6.2 Buffering we briefly discuss the problems in- All of the problems discussed in Section 2 concerning buffering (e.g., file F2 herent in this approach. prefetch, non-LRU management, and selected force out) exist in a user process 6.1 Large Files paged virtual memory context. How Fig. 5. Binding Files into an Address Any virtual memory scheme they can be cleanly handled in this Space. must handle files which are large context is another unanswered ques- objects. Popular paging hardware tion. creates an overhead of 4 bytes per 7. Conclusions Brown might be updated before 4,096-byte page. Consequently, a Smith was examined, and as a result, lOOM-byte file will have an overhead The bottom is that operating Smith would also receive the pay cut. of 1OOK bytes for the . system services in many existing sys- It is clearly undesirable to have the Although main memory is decreas- tems are either too slow or inappro- outcome of an update depend on the ing in cost, it may not be reasonable priate. Current DBMSs usually pro- order of . to assume that a page table of this vide their own and make little or no If the operating system maintains size is entirely resident in primary use of those offered by the operating the buffer pool and an intentions list memory. Therefore, there is the pos- system. It is important that future for crash recovery, it can avoid this sibility that an I/O operation will operating system designers become problem [ 191. However, if there is a induce two page faults: one for the more sensitive to DBMS needs. buffer pool manager in user space, it page containing the page table for A DBMS would prefer a must maintain its own intentions list the data in question and one on the efficient operating system with only in order to properly process this up- data itself, To avoid the second fault, desired services. Of those currently date. Again, operating system facili- one must down a large page available, the so-called real-time op- ties are being duplicated. table in main memory. erating systems which efficiently Conventional file systems include provide minimal facilities wme clos- 5.3 Summary the information contained in the est to this ideal. On the other hand, It is certainly possible to have page table in a file control block. most general-purpose operating sys- buffering, concurrency control, and Especially in extent-based file sys- tems offer all things to all at crash recovery all provided by the tems, a very compact representation much higher overhead. It is our hope operating system. In order for the of this information is possible. A run that future operating systems will be system to be successful, however, the of 1,000 consecutive blocks can be able to provide both sets of services performance problems mentioned in represented as a starting block and a in one environment. Section 2 must be overcome. It is length field. However, a page table References also reasonable to consider having for this information would store each 1. Bayer, R. Organization and maintenance all 3 services provided by the DBMS of the 1,000 addresses even though of large ordered indices. Proc. ACM- SIGFIDET Workshop on Data Description in user space. However, if buffering each differs by just one from its pred- and Access, Houston, Texas, Nov. 1970. This remains in user space and consis- ecessor. Consequently, a file control paper defines a particular of a balanced tency control does not, then much block is usually made main memory n-ary tree, called a B-tree. to , maintain this structure on inserts and deletes code duplication appears inevitable. resident at the time the file is opened. are presented. The original paper on this Presumably, this will cause perform- As a result, the second I/O need popular tile organization tactic. ance problems in addition to in- never be paid. 2. Birss, E. Hewlett-Packard Corp., General creased human effort. The alternative is to bind chunks Syst. Div. (private communication). of a file into one’s address space. Not 3. Blasgen, M., et al. The convoy 6. Paged Virtual Memory phenomenon. Operating Systs. Rev. 13, 2 only does this provide a multiuser (April 1979), 20-25. This article points out It is often claimed that the appro- DBMS with a substantial bookkeep- the problem with descheduling a process priate operating system tactic for ing problem concerning whether which has a short-term lock on an object which other processes require regularly. The database management support is to needed data is currently addressable, impact on performance is noted and possible bind files into a user’s paged virtual but it also may require a number of solutions proposed.

417 Communications July 1981 of Volume 24 1 the ACM Number 7 9. Kaplan, . Buffer management policies 15. Mattson, R., et al. Evaluation techniques COMPUTING in a database system. MS. Th., Univ. of for storage hierarchies. IBM Systs. J. (June & Berkeley, Cal& 1980. This thesis 1970). Discusses buffer management in PRACTICES simulates various non-LRU buffer detail. The paper presents and analyzes management policies on traced data obtained serveral policies including FIFO, LRU, OPT, from the INGRES database system. It and RANDOM. concludes that the miss rate can be cut lO- 16. Redell, ., et al. Pilot: An operating 15% by a DBMS specific algorithm system for a . . 4. Blasaen, M., et al. System R: An compared to LRU management. ACM 23,2 (Feb. 1980), 81-92. Redell et al. archite&ral update. Rep. RJ 2581, IBM 10. Kashtan, D. UNIX and VMS: Some focus on Pilot, the operating system for Res. Cu., San Jose, Calif., July 1979. Blasgen . It is closely coupled describes the architecture of System R, a performance comparisons. SRI Intemat., Menlo Park, Calif. (unpublished working with and makes interesting choices in novel full function relational database areas like protection that are appropriate for manager implemented at IBM Research. The paper). Kashtan’s paper contains a personal computer. discussion centers on the changes made since timings of operating system commands in the original System R paper was published UNIX and VMS for DEC PDP- 1l/780 17. Ritchie, D., and Thompson, K. The in 1976. computers. These include timings of tile -sharing system. Comm. ACM 17, reads, flags, task switches, and pipes. 7 (July 1974), 365-375. The original paper 5. Epstein, R., and Hawthorn, P. Design describing UNIX, an operating system for decisions for the 11. Keehn, D., and Lacy, J. VSAM data set PDP- 11 computers. Novel points include design parameters. IBM Systs. J. (Sept. accessing tiles, physical devices, and pipes in . Proc. Nat. Comptr. Co& 1974). Anaheim, Calif., May 1980, pp. 237-241. An a uniform way and running the command- overview of the philosophy of the Intelligent line as a user program. Strongly 12. Knuth, D. The Art of Computer recommended reading. is presented. This system Programming, Vol. 3: Sorting and Searching. provides a database manager on a dedicated Addison Wesley, Reading, Mass., 1978. 18. Shaw, A. The Logical Design of “back end” computer which can be attached Operating Sys~emts. Prentice-Hall, to a variety of machines. 13. Lampson, B., and Sturgis, H. Crash Englewood Cliffs, N.J. 1974. recovery in a distributed system. Xerox Res. 19. Stonebraker, M., et al. The design and 6. Gray, J. on operating systems. Ctr., Palo Alto, Calif., 1976 (working paper). implementation of INGRES. ACM Trans. Report RJ 3120, IBM Res. Ctr., San Jose, The first paper to present the now popular Database Systs. I, 3 (Sept. 1976), 189-222. Calif., Oct. 1978. A definitive report on two-phase commit protocol. Also, an The original paper describing the structure locking and recovery in a database system. It interesting model of computer system crashes of the INGRES database management pulls together most of the ideas on these is discussed and the notion of “safe” storage system, a relational data manager for PDP- subjects including two-phase- protocols, write suggested. 11 computers. ahead log, and variable granularity locks. Should be read every six months by anyone 14. Lauer, H., and Needham, R. On the 20. Stonebraker, M. Retrospection on a interested in these matters. duality of operating system structures. database system. ACM Trans. Database Operating Systs. Rev. 13, 2 (April 1979), 3- Systs. 5,2 (June 1980), 225-240. A self- 7. IBM Corp. OS ISAM Logic. GY28- 19. This article explores in detail the critique of the INGRES system by one of its 6618, IBM, White Plains, N.Y., June “process-per-user” approach to operating designers. The article discusses design flaws 1966. systems versus the “server model.” It argues in the system and indicates the historical that they are inherently dual of each other progression of the project. 8. IBM Corp. ZMS- VS General Information and that either should be- implementable as 21. . Enscribe Reference Manunl. GH20-1260, IBM, White Plains, efficiently as the other. Very interesting Manual. Tandem, Cupertino, Calif., Aug. N.Y., April 1974. reading. 1979.

418 Communications July 1981 of Volume 24 the ACM Number 7