Object Storage on Berkeley DB1
Total Page:16
File Type:pdf, Size:1020Kb
Object Storage on Berkeley DB1 Casey Marshall University of California, Santa Cruz CMPS 229: Storage Systems Abstract took one person less than three months to write a fairly stable and efficient object store, the benefits We have implemented an object storage system, of using an existing, stable system are quite clear. intended to simulate the operation of an Object We also believe that robust, simple, and scalable Storage Device (OSD), on Berkeley DB, a simple storage systems built on top of database systems database library distributed by Oracle Corporation such as Berkeley DB are feasible. (formerly Sleepycat Software) [1]. This object storage layer is used in the Ceph [2] distributed file system as an alternative to the EBOFS object 2 Implementation storage system. Our results indicate that object The implementation of our object store is as an storage systems built on Berkeley DB provide implementation of the C++ class ObjectStore, good performance for many small objects, and that Ceph uses internally for its object storage. competitive performance for large objects; inte- The ObjectStore interface provides a simple grating this system into Ceph has revealed a few set of commands that emulate the command set of performance problems. the T10 object storage specification [3]. These commands fall into the following logical groups: 1 Introduction 1. Object storage commands. These include ba- Ceph is an advanced, distributed, scalable file sys- sic operations for reading, writing, and query- tem being developed at UCSC. Ceph uses smart ing objects. Objects are plain byte sequences, object storage devices managed by an object dis- and are referenced by 128-bit object identifi- tribution and lookup layer called RADOS (reli- ers. able, autonomous, distributed object store). The current object store implementation in Ceph is an 2. Collection commands. Collections are bags of extent and B-tree based block management sys- object identifiers, and are referenced by 32-bit tem called EBOFS [2]. collection identifiers. Object identifiers may be added or removed from a collection, and Object storage systems like EBOFS, which pro- the contents of a collection can be queried. vide fairly simple key/value mappings, turn out to be very similar to the interfaces provided by data- 3. Attributes. These are simply key/value pairs, base systems like Berkeley DB. Berkeley DB [1] with simple arbitrary-length keys and byte also happens to provide a number of benefits over sequences for values. Both objects and collec- a custom solution, since it is a very mature prod- tions may have attributes assigned to them, so uct, it has robust transaction and recovery support an attribute key is an object identifier or col- built-in, and writing an object store implementa- lection identifier paired with a simple key. tion itself is simple. We call our implementation “OSBDB.” OSBDB Our project was to implement an object storage was, relatively speaking, simple to implement, layer on top of Berkeley DB, and to see how it and ignoring the size and complexity of Berkeley compares to EBOFS in performance. Given that it 1 This work was done in partial fulfillment of the requirements for the UCSC course CMPS 229: Storage Systems, Winter quarter 2007, Dr. Carlos Maltzahn. 1 DB itself, it has much less code to maintain and struct stored_coll { debug, a fact illustrated in table 1: uint32_t count; object_t *objects; } Store SLOC Where object_t is the 128-bit object identifier ebofs 6,600 type. osbdb 2,161 Insertion and deletion of members is done by per- Table 1: Lines of C++ code in the ebofs forming a binary search for the insertion point or and osbdb object stores (as measured by object identifier to be deleted, then performing a SLOCCount [4]). memmove up or down by 16 bytes. The ObjectStore interface also requires that all col- 2.1 Objects lections be enumerated, so in OSBDB we keep a master list of all valid collection identifiers. This Berkeley DB has a simple application interface, list has a form similar to a collection: providing obvious methods such as get, which takes a key argument and returns the value struct stored_colls { mapped to that key, and put, which takes key and uint32_t count; value arguments and creates a mapping between coll_t *colls; } the two. Both keys and values are byte sequences of arbitrary length up to 4GiB. Where coll_t is a 32-bit collection identifier. Because this interface closely matches the object The storage of collection identifiers is equivalent storage interface, objects are keyed directly by the to a collection: we store them as a sorted list. This 128-bit object identifier. In Ceph, object identifi- master collection list is referenced by the one-byte ers have some amount of structure, but at the key ‘c’. ObjectStore level we ignore this, and pass the 128-bit value (that is, the raw C++ struct) as 2.3 Attributes the key. Object values are just as simple, and are Attributes are rather unfriendly, since they both stored as-is; however, Berkeley DB offers no add an extra overhead to the database manage- simple way to query the size of a mapped object ment (including keeping a list of attributes, so without reading it, so in our implementation we they can be removed when the object is deleted), include an additional “inode” record for each ob- and also present a problem when dealing with ject, which has the form: variable-length keys. Our solution is to keep a list of attribute keys for each object and collection, struct stored_object { uint32_t length; keyed by the object or collection identifier ap- } pended with the byte ‘a’ (making for a 17-byte or 5-byte identifier). Both attribute lists have the form: These “inode” records are mapped by a 17-byte key formed by taking the 16-byte object identifier, struct attr_list { and appending a single byte ‘i’. uint32_t count; attr_key *key; 2.2 Collections } Collections are, in principle, simple bags of object identifiers. In OSBDB, a collection is stored as a The attr_key type is simply a wrapper structure sorted array of object identifiers, represented by around a 128-byte array. This means that attribute the structure: keys are currently limited to 128 bytes, and that keys are stored padded with as many NUL bytes 2 to fill the buffer. Since the ObjectStore code sizes, is shown in figure 1. Each test was run only uses NUL-terminated strings for attribute 1,024 times, writing out two objects of the same keys, this solution works out well. The attribute size, and the mean bandwidth was computed over keys are sorted lexicographically, and insertion/ all runs. deletion works similarly to the collection lists. 130 Attributes themselves are keyed by appending the 120 110 128-byte attribute key to the object identifier or 100 ) 9 0 collection identifier, for referencing an object at- s / B 8 0 tribute or collection attribute, respectively. Attrib- i M ( 7 0 s ute values are stored directly. e 6 0 t i r 5 0 W 4 0 2.4 Summary 3 0 To summarize briefly, keys in OSBDB fall into 2 0 1 0 the following categories, which have unique 0 5 1 0 1 5 2 0 2 5 3 0 3 5 4 0 4 5 5 0 5 5 lengths, or at least have unique suffixes: Object size (MiB) 6 5 ebofs osbdb osbdb-btree • Object identifiers, 16 bytes. 6 0 5 5 • Object inode identifiers, 17 bytes (final byte 5 0 ) 4 5 is always ‘i’). s / B 4 0 i M ( 3 5 • Collection identifiers, four bytes. s d 3 0 a Master collection list identifier, one byte e 2 5 • R (‘c’). 2 0 1 5 • Object attribute list identifiers, 17 bytes (fi- 1 0 5 nal byte always ‘a’). 0 5 1 0 1 5 2 0 2 5 3 0 3 5 4 0 4 5 5 0 5 5 • Collection attribute list identifiers, five Object size (MiB) bytes (final byte always ‘a’). ebofs osbdb osbdb-btree • Object attribute keys, 272 bytes. Figure 1: write (top) and read (bottom) throughput for a few large objects. The lighter • Collection attribute keys, 260 bytes. band indicates the standard deviation away Using this partitioning of the key space, we can from the mean, plotted by the solid line. easily make the object name spaces independent, largely by simply using the length of keys. This benchmark, like all others in this paper, was run on an Ubuntu Linux 6.10 system running on 3 Object Store Performance an Intel Pentium D 2.8GHz processor, 1GiB of For an initial test, we wrote a micro-benchmark memory, and an internal 80GiB, 7200 RPM disk program around the ObjectStore interface that on the SATA bus. EBOFS used a raw disk parti- repeatedly writes out a number of uniformly-sized tion, and OSBDB used a dedicated ext2 file sys- objects, remounts the file system, then reads these tem that was unmounted and remounted between objects in again, in a different order. We ran this operations. benchmark with a variety of object sizes and ob- We see generally stable throughput for all the ject counts. stores, with EBOFS outpacing OSBDB for nearly The read and write large object throughput of all tests. The hash database type produces some EBOFS, OSBDB, and OSBDB configured with interesting spikes in the variance, at seemingly the Btree database type, given different object logarithmic intervals. 3 We also ran a similar test with increasing object objects (although, this is not so great a concern in count, with a fixed object size of 1,024 bytes. This Ceph) and that the Berkeley DB-based object test was run 128 times for each object count, and stores scale rather well for stores that contain the mean 1,024 object throughput is presented in many small objects.