Persistent Storage for 64-bit Systems 38 1

Srinivasa Rao Palnati and Gautam Barua

 Abstract Modern computer architectures have introduced 64- they be combined into a single data structure. bit addresses. A 64-bit address space is a huge address space. Recently proposed operating systems address this Single Address Space architectures have been proposed in the problem by using a linear 64-bit address space as a distributed, literature to utilize this address space and to provide efficiencies linear single-level store [2, 3, 6, 7, 8]. That is, all processes on in context switching and in implementing persistent storage. The problems of implementing single address space systems without these systems share a single global address space where all adding hardware support, are well known. This paper proposes volatile and persistent data are assigned unique addresses. an architecture with multiple address spaces for part of the Although on these systems processes can share arbitrary data virtual address space of a system, but with a set of shared address structures, transferring data structures (between processes and spaces for the rest. Files are mapped to this shared set of spaces between files) becomes more difficult than in conventional allowing an implementation of persistent storage that can store systems. For example, if we copy a file (which is a part of the complex data structures with pointers. The sharing is not global but within groups. This allows overlapping of address spaces and single-level store), we have to rewrite all pointers in it because avoids problems of conflicts with backed up information. This it moves to another location in the single-level store. Other allows a flexible sharing mechanism, more powerful than the problems with Single Address Space Operating Systems are current methods of sharing persistent storage. given later in the paper. The goals of this design are to Index Terms—memory management, operating systems,  remove the problems of a Single Address Space persistence, single address space, Operating System.  provide persistent storage bound to some address space. I. INTRODUCTION  propose a solution that can be implemented on existing In current systems, data is stored in persistent storage in the systems with current applications running without any form of files and there are well defined operations to change. manipulate the data. It has to be explicitly brought into the  propose a solution that can be implemented without any address space of a program by read operations and has to be extra hardware support. put into the storage by explicit write operations. The data is  build a prototype by making changes to the Linux not bound to any addresses of the program’s address space. kernel. This allows the data in a file to be read into any available The solution proposed in this paper uses multiple virtual space in a program. If a data structure in a program is stored as address spaces to enable sharing of persistence storage. A it is in a file, pointers in the structure will also be stored and solution proposed in the literature (“Using Huge Address the data then gets bound to an address range. It can be brought Spaces to Construct Cooperative Systems” [4] ) also uses into memory by a program only into that address range. If a multiple address spaces. The differences with our solution are large amount of data is created and stored in files, the virtual given later in the paper. address space of a program may not be sufficiently large to accommodate all the data. So data in different files (or in the II. BASIC ARCHITECTURE same file for that matter) may get bound to the same address range. Such data cannot be in a program’s address space at the Every process runs in its own 64 bit address space. same time. This will create many programming difficulties. However, a portion of the address space of each process (from Not binding the data to any address range removes these address M to N) is reserved for mapping files. The values of difficulties. M and N are system configurable parameters. Since a 64 bit Mechanisms for sharing data structures have been sought address space is very large, the range of addresses from M to for a long time in various fields of operating system research. N will also be large enough to map all the files in a system. A Existing data sharing mechanisms, however, provide only number of named address spaces are defined. Since these limited functionality of sharing complex data structures names are global, they are called named Global Allocated containing pointers, due to the limited address space available Spaces (GAS). Every mapped file in a system is allocated in 32 bit systems. For example, if processes share complex space in one of these GASes. Figure 1 illustrates this. A file data structures with mmap in UNIX BSD or shm in UNIX is statically mapped into a GAS. A system call (named system V, they have to agree with each other on the location mmap) is used to load a file into the address space of a running of the shared data structure, otherwise pointers lose their process. The map information of a file is stored in a control meaning. Consequently, if several data structures occupy the file which has the same name as the original file except for a same location, they cannot be used simultaneously nor can “.” prefix and which is in the same directory as the main file. The kernel reads this control file and loads the file in the Manuscript received June 15, 2001. appropriate address range. Allocation of space in a GAS is Srinivasa Rao Palnati was at IIT Guwahati, He is now with General done by a user level daemon process called the Space Electric (India) ( e-mail: [email protected]). Manager. Figure 2 illustrates the various components of the Gautam Barua is at IIT Guwahati. Guwahati 781039, India (e-mail: [email protected]). architecture. User programs access a mapped file through 38 2 memory operations. The operating system uses its virtual last for 5000 years). So the reserved area in a process’s memory system to update the contents of the file with page address space can be very large. The Space Manager performs outs. Since a file is statically mapped, it can contain pointers. the following operations:  It creates Global Allocated Spaces (GAS) upon requests from users. Access control is provided to control mapping of files into the GASes.  It maps files into user created GASes. The user can specify the bind address and the maximum file size it can have. If the specified bind address is already allocated to some other file then the space manager informs the user. If the user does not specify any bind address then the Space Manager allocates an appropriate bind address. The user specified maximum file size is the same as the virtual memory region allocated to that file. The bind address is stored into a control file in the same directory of the actual file and with the same name but with a “.” before the file name. Along with the bind address other information like the maximum file size, the access control bits etc. needed for the kernel are also stored in the control file.  It creates File Groups in an allocated space. A file group is a collection of files, that can contain inter-file references among them. Files in a file group can not be unmapped individually by users, because references may become dangling or invalid if a file is unmapped and the Figure 1: The File Mapping Mechanism space is allocated to another file. File groups are identified by giving names to them. These are called file group names.  It performs a number of operations on a file group such as, add a file to a file group, delete files from a file group, free the whole file group, show the files in a file group etc.  It performs Garbage Collection in an allocated space. Inadvertently or intentionally some files might be deleted by users. Garbage collection frees the address space for further use by other files.  It checks for conflicts in the mapping of a file. Suppose a backed-up file is restored, it checks the bound address space range of the backed-up file to see if it is free. If that space is occupied by some other file then a new GAS is created and the backed-up file is allotted space in this new address space.  It handles mapping changes on Copy and Move of files. When a file is copied, it has to be allocated the same address space the original space occupied but in another GAS, and a new control file has to be created. When a file is moved, the control file has also got to be moved. Figure 2: Private Address space and Global Allocated Spaces  It extends the mapped region of a file. This extension of a mapped region can be done only if contiguous free memory is available in an allocated space. If such space is III. THE SPACE MANAGER not available, then the only way to increase the size of a The Space Manager is an interface to the user to map files file is to remove it from the current GAS and to create a into the global allocated spaces. It manages the allocated new GAS in which the file is allocated space in the same spaces and allows the efficient manipulation of data in files. address range. In 64-bit architectures the address space is very large (a full 64-bit address space consumed at the rate of 100MB/sec will 38 3

IV. KERNEL IMPLEMENTATION user. A function, gasminit is used to ready access to a mapped file. Initially (that is, the first time), it reserves space To provide persistent storage, operating system help is for the Initvar table and forms a free space list in a file mapped necessary. We have modified the Linux kernel, version region. Other functions are gasmalloc and gasfree, to allocate 2.2.14. The mmap system call has been modified to provide and free space respectively, addsize to add more size to the the mapping of files. The Linux kernel does not have any free space, putvaraddr to insert the name of an allocated data semantics for the flag MAP_FIXED. The new functionality is structure, and getvaraddr to get the base address of a named provided via the MAP_FIXED flag. data structure. getallvars can be used to get a list of all names The base address of the file and the maximum virtual structures and their base addresses and sizes and finally, memory region allocated is obtained from the control file. gasmspacefree can be used to find the amount of free space in Conflicts with existing mapped files are checked. Suppose the the file. The architecture of the system is shown in Figure 3. file size is smaller than the mapped virtual memory region (the space manager can allocate a region larger than the actual file size) then the user may access addresses beyond the physical size of a file. An exception will occur normally. However, to keep compatibility with usual file semantics (files grow when written into at the end), the physical size of the file is increased. Of course, this increase will be allowed only till the extent of the allocated Virtual Address Space. Two new system calls have been provided to the user:  The mmstat() system call is used to show the current status of virtual memory allocation. Using this system call, current memory mapped files and their address ranges can be shown to the user.  The getmaxsize() system call is provided to get the maximum virtual memory region reserved for a file. The argument is the address returned by a previous mmap system call.

V. USER LIBRARY FUNCTIONS A system to provide persistence has been described in the previous sections. This section describes a set of library functions that have been implemented to help in the use of the Figure 3: The System Architecture system. The persistent storage described in this paper can be used in many ways depending on the applications of the users. So users can implement their own libraries as per the requirements of the specific application development. VI. AN EXAMPLE: A TREE A set of library functions have been provided to support dynamic data structures. These library functions directly A GAS has been created (using the request creatgas) with operate on the address space of a file mapped region. Space a name “datastruct” and a file “f1.dat” has been mapped into it can be allocated to data structures within a mapped file using a (using mapfile). routine that behaves in a manner similar to the “malloc()” Now, the system call mmap is used by a process to map routine. Similarly a routine like “mfree” is used to free space. the file “f1.dat” from the GAS “datastruct” to its address Since this area is persistent, there may already be data space. As discussed, the binding takes place in the address structures in a file that has been mapped. There must be some space specified by the control file of “f1.dat”. Then the file is way to access these structures. Relying on base addresses may initialized with the gasminit() function and memory is not suffice as these structures may move during the execution allocated (by using the gasmalloc() function) for the first of a program using them. What is being assumed is that these node. This name of the structure (“btree”) and the address of structures have names (which are strings) and a program refers the first node (returned by gasmalloc) are stored in the Initvar to a structure by its name. The types of these structures are table by using the putvaraddr() function. New nodes in the also assumed to be known. So names of structures, their tree are created by calling gasmalloc() for space allocation and starting address and size are stored at the beginning of a file by storing a pointer to them in appropriate node fields. A into a table called a Initvar table. Depending on the size of a conceptual view of the file after the creation of such a tree is file mapped region, some space is reserved for the Initvar shown in Figure 4. table. The free space in the file is kept track of using a standard free list mechanism. The following library functions have been provided to the 38 4

exactly the same addresses as they were before. Protection problem: This is inherent in the system itself because all processes execute in the same address space, so one process may access the data of other process. It requires extra hardware support to provide protection like PLB (Protection Lookaside Buffer) [5], or a software support like Protection Ids [3, 6, 2, 8] for each process is required. Relocation problem: Suppose an executable binary program which is shareable is being executed by more than one process at a time, then it has to map at different regions in the same address space. The base addresses are different, so the executable binary program can not be shared. Sharing among several systems: Since there is no uniform method to move and copy data structures without destroying their contents and referencing relationships, it is difficult to share information among several systems with their own single linear address spaces.

B. Rationale for our Design Having multiple allocated address spaces and a per-process address space architecture removes all the problems with single address space solutions. The following are the Figure 4: The layout of a file containing a tree named advantages and limitations of our design. “Btree”  By having multiple spaces, virtual space allocation Once such a data structure is created, it can be accessed becomes easier and the chances of conflicts (both address later on by getting the base address of the tree using the and name) are reduced. getvaraddr() function. Using the base address, the whole data  Access control to shared files becomes easier as it can be structure can be traversed. done on the basis of virtual spaces rather than on the basis of individual files. VII.DISCUSSION  The problem of backed-up files occupying virtual address spaces is also removed. As files allocated to one space A. Why Not a Single Linear Address Space? are likely to be related, all of them can be backed-up One possible design was to employ a single linear address together and the allocated address space “frozen.” When a space that is shared by all processes and data in the system. file (or a group of files) are brought in from back-up, the Several operating systems with 64-bit address spaces, use this frozen allocated address space can be used. Alternatively, approach. The Opal [3] system at the University of if files are being backed up without freezing the allocated Washington uses a single address space approach where all address space and conflicts arise on restoration, a new processes share a single flat address space. Other operating address space can be created for the restored files. systems such as Sombrero [7], Angel [2], Nemesis [6] and  Sharing is usually not global. It is restricted to groups of Mungi [8] also use the single address space approach. files and/or groups of users. Having multiple virtual Several other systems are also under active research. spaces allows this form of restricted sharing to be done Although these approaches seem attractive at first glance, it is easily. In fact, a lot of files may not be shared at all not so simple to manage long-lived data structures in a single among users but may be created and used by the linear address space. Some of the difficulties are given below. programs of one user. Copying, and deleting a file in such a system needs special  There is no need for global consensus on data location for treatment. Deleting a file results in complications because sharing complex data structures among processes. The many other files may contain references to it, and so all those space manager takes care of storing bind addresses and references become dangling, or invalid if another file occupies managing files etc. Arbitrary data structures in the that space. Copying a file can only be done if all the pointers primary and secondary storage can be shared among are relocated and this is difficult to do with dynamic data processes. structures.  The proposed mechanism can be implemented on a Problem of backup of files: Reusing addresses is standard paged virtual memory architecture. It requires complicated due to backups. As long as the backups of files no special hardware support such as segmentation are available somewhere in the world, addresses occupied by hardware and multi-level pointer indirection, or protection these files can not be reused when they are removed from the hardware as is required in some single-address space address space. This is because the files must be restored at solutions. 38 5

 The problem of dangling and incorrect references on the  Another limitation is that, in the worst case, files can deletion of a file is being controlled by the introduction of grow only up to the extent of address allocated to it at file groups. A file can contain a pointer to another file creation time (the space after the end of the current only if both the files belong to the same file group. When allocation has already been alloted to another file). We a file is deleted, the space allotted to it not freed till all the feel this is not a major limitation because pre-allocating a files of a group are deleted. Reference to a non-existent large address space for a file that is likely to grow is not file will result in an error as it should, but a pointer cannot going to cause any extra overhead or is going to cause any now be an incorrect reference. resource constraints (given the “huge” size of a 64 bit  The space manager creates a new global allocated space address space). Further, “sparse” allocation of the and maps a file into it whenever conflicts occur. So if a adddress space in a GAS is implemented in the Space file is copied, it is allocated space in a new GAS or in an Manager to avoid the worst case scenario. existing GAS where the address range occupied by the file is free. C. Comparison with another Proposal  Sharing among several systems is not difficult unlike a An already existing system “Using Huge Address Spaces to single address space solution, due to the presence of Construct Cooperative Systems” [4] also uses a separate multiple GASes. On transferring a file, it has to placed in address space per process and a common address space for a GAS in the destination system where the required mapping shared files. In their proposal, there is only one address space is available. In the worst case, a new GAS shared address space, unlike the multiple GASes proposed can be created. here. Their strategy basically is to have per-process address  The current design introduces a feature of shared spaces to avoid the disadvantages of a single address space persistent storage without affecting the existing scheme in providing protection during run time and in sharing architecture of a typical general purpose system. The of code segments . At the same time, by having one shared implementation has added the feature to a standard Linux space, the advantages of a single address space is sought to be system. Existing files and programs do not have to be retained. This strategy eliminates conflicts of address spaces at changed. The new feature is visible only to applications load time since the conflicts will be resolved at the time of using it. There is no change in the file system design to address space allocation (there being only one shared address accommodate bind information. Instead, separate control space). However, the problems of copying and of back-ups files have been introduced to store this information. While remains in their system due to the single shared address space. such control files can in rare cases cause naming conflicts They have made the unit of sharing a contiguous area called a ( a file with the same name may already exist) this is region. A file is composed of a number of regions. This allows unlikely to be a major issue. After all existing systems files to grow dynamically. However, this has resulted in a have been living with “reserved” file names for years. No system in which the file system has had to be changed and so doubt, integrating the bind information into the file existing files cannot co-exist on the same system. Our system would be a more elegant solution, but the need for arguments for our choices have already been given above. In compatibility with existing applications was of greater their design, the management of the shared address space is priority. done at the kernel, while our proposal uses a user level  A limitation of the current design is that, due to the daemon called a Space Manager to handle the multiple address presence of multiple GASes, addressing conflicts may spaces. This, we feel, is a more flexible and portable design. occur if a process maps files from more than one GAS. A Due to their management of the shared address space in the single address space system does not have this problem. kernel, and due to their dealing with regions as a unit of However, the design has been made on the premise of the sharing, major changes have been proposed in a standard “locality” of sharing. A typical application is likely to kernel, particularly, memory management.. Our proposal use files from at most one GAS. A user may create a needs minor changes to a Linux kernel and further, this change GAS to bind files being used by one application and in is limited to a small portion of the memory management code that case that will be the only GAS being used by the (the code for the mmap system call needs to be changed). application. Some GASes may have files that are shared by many users. All the shared files can be put into one GAS. If an application needs to use files in more than one VIII. CONCLUSIONS shared GAS, then conflicts among these GAS have to be In this paper, a design for a persistent storage system with avoided. This can be done during the creation of such files being bound statically to a given address range, has been shared GASes by identifying the existing GASes that it proposed. The advent of 64 bit address spaces has opened up should not overlap with. Due to the presence of a 64 bit the possibility of such static mapping of files without causing address space, avoiding such overlap will not create any a shortage of address space and without resulting in crippling address space problems. Ultimately, only experience conflicts. These difficulties prevented the wide-spread use of based on usage will reveal if this limitation is not a major such persistent storage in existing 32 bit systems. A number of one. proposals have been made to have a single, system-wide 38 6 address space across all processes, in a 64 bit system. While [10] D. Deller, and G. Heiser, “Linking Programs in a Single Address Space”, in Proceedings of the USENIX Annual Technical Conference, having a single address space is attractive from the point of pp. 283-294, June 1999. (URL: http://www.cse.unsw.edu.au/disy). view of sharing and of avoiding conflicts, there are a number of difficulties with having a single address space. There have therefore been proposals to retain per-process address spaces while providing static binding of files to addresses. Our proposal is along these lines and it provides for multiple shareable address spaces in which files are allocated space and get statically bound. Our proposal avoids problems of back-ups and of deleting and copying of such files. It makes a trade off by making address conflict resolution at creation time optional thus leaving open the possibility of load-time conflicts. The design requires minor changes to an existing commercial kernel (Linux has been in our prototype), leaving all the management of the shared address spaces to a user– level daemon. The changes have no effect on existing code or data so there is no need for any major overhaul in a system. This we feel is a major advantage over many of the other proposals in the literature. The trade offs made in achieving our objectives are, we feel, justified and the major assumption is that global sharing of data is limited to only a few applications and that most of the time there is a “locality” of sharing. What is needed to be done in the future is to create applications to see how they can best make use of a such a persistent storage architecture with all the attendant attractions of making programming easier, but without compromising on the performance issues that play such an important part in current day I/O programming.

REFERENCES [1] M. Talluri, D. Hill, and Y.A. Khalidi, “A New Page Table for 64-bit Address Spaces”, in Proceedings of the Fifth ACM Symposium on Operating System Principles, SIGOPS’95, pp. 184-200, April 1995. [2] K. Murray, T. Wilkinson, P. Osmon, A. Salisbury, T. Stiemerling and P. Kelly, “Design and Implementation of an Object-Oriented 64-bit Single Address Space Microkernel”, in Proceedings of the USENIX Symposium on Microkernels and other Kernel Architectures, (San Diego), pp. 31-43, Sep.1993. [3] J.S. Chase, H. Levy, M. Backer-Harvey, and Ed Lazowska, “Opal: A Single Address Space System for 64-bit Architectures”, in Proceedings of the Third Workshop on Workstation Operating Systems, WWOS-III, IEEE Computer Society, pp. 80-85, April 1992. [4] S. Inohara, and T. Masuda, “Using Huge Address Spaces to Construct Cooperative Systems”, in Proceedings of the International Seminar on Autonomous Decentralization Systems, ISADS’93, IEEE Computer Society, pp. 85-92, Feb. 1993. (URL: http://www.is.s.u-tokyo.ac.jp). [5] E.J. Koldinger, J.S. Chase, and S.J. Eggers, “Architectural Support for Single Address Space Operating Systems”, Technical Report 92- 03-10, University of Washington, Department of Computer Science & Engineering, March 1992. [6] D. Reed, and R. Fairbairns, “Nemesis: The Kernel Overview”, Published by the University of Cambridge, May 1997. (URL: http://www.cl.cam.ac.uk). [7] A.C. Skousen, D.S. Miller, and R.G. Feigen, “The Sombrero Operating System”, Technical Report TR-96-005, Arizona State University, Department of Computer Science and Engineering, April 1996. (URL: http://www.eas.asu.edu/sasos). [8] J. Vochteloo, S. Russell, and G. Heiser, “Capability Based Protection in the Mungi Operating System”, The 17th Annual Computer Science Conference, Australian Computer Science Communications, January 1994. (URL: http://www.cs.unsw.oz.au). [9] A. Dearle, J. Rosenberg, F. Henskens, F. Vaughan, and K. Maciunas, “An Examination of Operating System Support for Persistent Object Systems”, in Proceedings of the 25th Hawaii International Conference on System Sciences, Vol 1, No. 5, pp. 779-789, 1992.