Object Storage on Berkeley DB1

Casey Marshall University of California, Santa Cruz CMPS 229: Storage Systems

Abstract took one person less than three months to write a fairly stable and efficient object store, the benefits We have implemented an object storage system, of using an existing, stable system are quite clear. intended to simulate the operation of an Object We also believe that robust, simple, and scalable Storage Device (OSD), on Berkeley DB, a simple storage systems built on top of database systems database library distributed by such as Berkeley DB are feasible. (formerly Sleepycat Software) [1]. This object storage layer is used in the Ceph [2] distributed file system as an alternative to the EBOFS object 2 Implementation storage system. Our results indicate that object The implementation of our object store is as an storage systems built on Berkeley DB provide implementation of the ++ class ObjectStore, good performance for many small objects, and that Ceph uses internally for its object storage. competitive performance for large objects; inte- The ObjectStore interface provides a simple grating this system into Ceph has revealed a few set of commands that emulate the command set of performance problems. the T10 object storage specification [3]. These commands fall into the following logical groups: 1 Introduction 1. Object storage commands. These include ba- Ceph is an advanced, distributed, scalable file sys- sic operations for reading, writing, and query- tem being developed at UCSC. Ceph uses smart ing objects. Objects are plain byte sequences, object storage devices managed by an object dis- and are referenced by 128-bit object identifi- tribution and lookup layer called RADOS (reli- ers. able, autonomous, distributed object store). The current object store implementation in Ceph is an 2. Collection commands. Collections are bags of extent and B-tree based block management sys- object identifiers, and are referenced by 32-bit tem called EBOFS [2]. collection identifiers. Object identifiers may be added or removed from a collection, and Object storage systems like EBOFS, which pro- the contents of a collection can be queried. vide fairly simple key/value mappings, turn out to be very similar to the interfaces provided by data- 3. Attributes. These are simply key/value pairs, base systems like Berkeley DB. Berkeley DB [1] with simple arbitrary-length keys and byte also happens to provide a number of benefits over sequences for values. Both objects and collec- a custom solution, since it is a very mature prod- tions may have attributes assigned to them, so uct, it has robust transaction and recovery support an attribute key is an object identifier or col- built-in, and writing an object store implementa- lection identifier paired with a simple key. tion itself is simple. We call our implementation “OSBDB.” OSBDB Our project was to implement an object storage was, relatively speaking, simple to implement, layer on top of Berkeley DB, and to see how it and ignoring the size and complexity of Berkeley compares to EBOFS in performance. Given that it

1 This work was done in partial fulfillment of the requirements for the UCSC course CMPS 229: Storage Systems, Winter quarter 2007, Dr. Carlos Maltzahn.

1 DB itself, it has much less code to maintain and struct stored_coll { debug, a fact illustrated in table 1: uint32_t count; object_t *objects; } Store SLOC Where object_t is the 128-bit object identifier ebofs 6,600 type. osbdb 2,161 Insertion and deletion of members is done by per- Table 1: Lines of C++ code in the ebofs forming a binary search for the insertion point or and osbdb object stores (as measured by object identifier to be deleted, then performing a SLOCCount [4]). memmove up or down by 16 bytes. The ObjectStore interface also requires that all col- 2.1 Objects lections be enumerated, so in OSBDB we keep a master list of all valid collection identifiers. This Berkeley DB has a simple application interface, list has a form similar to a collection: providing obvious methods such as get, which takes a key argument and returns the value struct stored_colls { mapped to that key, and put, which takes key and uint32_t count; value arguments and creates a mapping between coll_t *colls; } the two. Both keys and values are byte sequences of arbitrary length up to 4GiB. Where coll_t is a 32-bit collection identifier. Because this interface closely matches the object The storage of collection identifiers is equivalent storage interface, objects are keyed directly by the to a collection: we store them as a sorted list. This 128-bit object identifier. In Ceph, object identifi- master collection list is referenced by the one-byte ers have some amount of structure, but at the key ‘c’. ObjectStore level we ignore this, and pass the 128-bit value (that is, the raw C++ struct) as 2.3 Attributes the key. Object values are just as simple, and are Attributes are rather unfriendly, since they both stored as-is; however, Berkeley DB offers no add an extra overhead to the database manage- simple way to query the size of a mapped object ment (including keeping a list of attributes, so without reading it, so in our implementation we they can be removed when the object is deleted), include an additional “inode” record for each ob- and also present a problem when dealing with ject, which has the form: variable-length keys. Our solution is to keep a list of attribute keys for each object and collection, struct stored_object { uint32_t length; keyed by the object or collection identifier ap- } pended with the byte ‘a’ (making for a 17-byte or 5-byte identifier). Both attribute lists have the form: These “inode” records are mapped by a 17-byte key formed by taking the 16-byte object identifier, struct attr_list { and appending a single byte ‘i’. uint32_t count; attr_key *key; 2.2 Collections } Collections are, in principle, simple bags of object identifiers. In OSBDB, a collection is stored as a The attr_key type is simply a wrapper structure sorted array of object identifiers, represented by around a 128-byte array. This means that attribute the structure: keys are currently limited to 128 bytes, and that keys are stored padded with as many NUL bytes

2 to fill the buffer. Since the ObjectStore code sizes, is shown in figure 1. Each test was run only uses NUL-terminated strings for attribute 1,024 times, writing out two objects of the same keys, this solution works out well. The attribute size, and the mean bandwidth was computed over keys are sorted lexicographically, and insertion/ all runs. deletion works similarly to the collection lists. 130 Attributes themselves are keyed by appending the 120 110 128-byte attribute key to the object identifier or 100

) 9 0 collection identifier, for referencing an object at- s /

B 8 0 tribute or collection attribute, respectively. Attrib- i M

( 7 0

s

ute values are stored directly. e 6 0 t i r 5 0 W 4 0 2.4 Summary 3 0 To summarize briefly, keys in OSBDB fall into 2 0 1 0 the following categories, which have unique 0 5 1 0 1 5 2 0 2 5 3 0 3 5 4 0 4 5 5 0 5 5 lengths, or at least have unique suffixes: Object size (MiB)

6 5 ebofs osbdb osbdb-btree • Object identifiers, 16 bytes. 6 0 5 5 • Object inode identifiers, 17 bytes (final byte 5 0

) 4 5

is always ‘i’). s /

B 4 0 i M

( 3 5 • Collection identifiers, four bytes. s

d 3 0 a Master collection list identifier, one byte e 2 5 • R (‘c’). 2 0 1 5 • Object attribute list identifiers, 17 bytes (fi- 1 0 5 nal byte always ‘a’). 0 5 1 0 1 5 2 0 2 5 3 0 3 5 4 0 4 5 5 0 5 5 • Collection attribute list identifiers, five Object size (MiB) bytes (final byte always ‘a’). ebofs osbdb osbdb-btree • Object attribute keys, 272 bytes. Figure 1: write (top) and read (bottom) throughput for a few large objects. The lighter • Collection attribute keys, 260 bytes. band indicates the standard deviation away Using this partitioning of the key space, we can from the mean, plotted by the solid line. easily make the object name spaces independent, largely by simply using the length of keys. This benchmark, like all others in this paper, was run on an Ubuntu 6.10 system running on 3 Object Store Performance an Intel Pentium D 2.8GHz processor, 1GiB of For an initial test, we wrote a micro-benchmark memory, and an internal 80GiB, 7200 RPM disk program around the ObjectStore interface that on the SATA bus. EBOFS used a raw disk parti- repeatedly writes out a number of uniformly-sized tion, and OSBDB used a dedicated ext2 file sys- objects, remounts the file system, then reads these tem that was unmounted and remounted between objects in again, in a different order. We ran this operations. benchmark with a variety of object sizes and ob- We see generally stable throughput for all the ject counts. stores, with EBOFS outpacing OSBDB for nearly The read and write large object throughput of all tests. The hash database type produces some EBOFS, OSBDB, and OSBDB configured with interesting spikes in the variance, at seemingly the Btree database type, given different object logarithmic intervals.

3 We also ran a similar test with increasing object objects (although, this is not so great a concern in count, with a fixed object size of 1,024 bytes. This Ceph) and that the Berkeley DB-based object test was run 128 times for each object count, and stores scale rather well for stores that contain the mean 1,024 object throughput is presented in many small objects. Since most objects in Ceph figure 2. The decreasing throughput of the Ber- are around 1MiB or less, OSBDB should be well keley DB hash database type may be caused by suited for it. re-hashing the database as the object count grows; BDB offers control over the hash 4 Ceph Benchmarks size and fill factor, which we haven’t experi- mented with yet. Berkeley DB overall does very For these benchmarks we used the fakesyn test well in this test: program included in Ceph. Fakesyn supports a number of synthetic workloads, which run in a single process. For these benchmarks, we instru-

) 15.0 s

d mented the SyntheticClient class to record n a

s how long the operation performed (omitting start-

u 12.5 o h

t up and shut-down times) took to run. These (

d 10.0 n benchmarks also simply used the default cache o c

e sizes for each store, and made no attempt to flush s 7.5 r

e system or in-program caches during the run. Here p

n e

t 5.0 we present comparisons for runs of some of these t i r

w workloads.

s

t 2.5 c e j b writefile and readfile are workloads that write out O 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 a file in chunks, then read that file back in again. 1KiB Object count (thousands) In figure 3, we show a run of writefile followed ebofs osbdb osbdb-btree ) 2.50

s by readfile for a 1GiB file, reading and writing d

n 2.25 a 256KiB and 1MiB chunks. EBOFS has a clear s u

o 2.00 advantage on reads, while both kinds of Berkeley h t (

1.75 2

d DB store achieve good write performance : n

o 1.50 c e s

1.25

r 500 e p

1.00 450

d ebofs a

e 0.75 400 r

s t 0.50 350 c

e osbdb ) j

s 300 b 0.25 (

e O osbdb-btree m 0.00 i 250 2.5 5.0 7.5 10.0 12.5 15.0 17.5 T 200 1KiB Object count (thousands) 150

ebofs osbdb osbdb-btree 100 Figure 2: write (top) and read (bottom) through- 5 0 put, in thousands of objects per second, given ob- writefile/256KiB readfile/256KiB writefile/1MiB readfile/1MiB ject stores of increasing utilization. The lighter Figure 3: readfile and writefile, operating on a band again denotes the standard deviation away 1GiB file in 256KiB and 1MiB chunks. These from the mean, plotted by the solid line. tests represent 64 runs of the workload.

This seems to contradict the results in the previ- One conclusion to draw from these benchmarks is ous section, where we found that OSBDB does that EBOFS performs slightly better for very large very well for small objects.

2 The charts in this section are all quartile plots; the dot denotes the median, the horizontal line the mean, and the verticals the region from the minimum to the first quartile, and the third quartile to the maximum.

4 The situation, in fact, degrades considerably for Reading these directories again, however, shows a OSBDB if the block size is reduced to a few better picture: kibibytes: 0.0750 0.0725 2,000 0.0700 0.0675 1,750 osbdb 0.0650 0.0625 1,500 0.0600 ) ebofs s

1,250 (

0.0575 ) osbdb-btree s e (

m 0.0550 i e

1,000 T m i 0.0525 T

750 0.0500 0.0475 500 0.0450 osbdb ebofs 0.0425 250 0.0400

0 0.0375 writefile/1024 readfile/1024 writefile/4096 readfile/4096 fullwalk/2/16/4 readdirs/2/16/4 Figure 4: readfile and writefile, operating on a Figure 6: results of running walk and readdirs 1GiB file in 1KiB and 4KiB chunks. The above on the hierarchy created with makedirs. represents sixteen runs of the workload. Averaged over sixteen runs. makedirs is a workload that creates a directory We are not certain what the cause of these per- tree with files scattered about, given a directory formance differences is. One possible cause is the count, number of files, and directory depth. walk relative inefficiency of partial reads and writes in walks through the entire created hierarchy; read- Berkeley DB; also a problem may be our imple- dirs reads the directory hierarchy (walk and mentation of collections, and the rather large readdirs are, therefore, very similar tests). The number of data copies that must be made to per- makedirs results — with the count, file count, form certain operations. Locking in OSBDB is and depth 2, 16, and 4, respectively (which means also rather coarse-grained in this implementation, a binary file tree of depth 4 with 16 files in every even though Berkeley DB supports shared and directory) — speak unfavorably for OSBDB: exclusive per-record locks.

6 0

5 5 5 Related Work 5 0 Ceph already has a fairly robust and efficient ob- 4 5 osbdb ject storage system called EBOFS [2]. Possibly 4 0 Figure 5: results of running the chief advantage OSBDB has over EBOFS is ) 3 5 s

( makedirs with a count of 2,

e 3 0 that it simpler and easier to maintain, and pro- m

i a file count of 16, and a T vides excellent support for transactions, logging, 2 5 depth of 4. Averaged over 2 0 concurrency, recovery, and “hot” backups. sixteen runs. 1 5 1 0 Many systems have used Berkeley DB, and its 5 ebofs predecessors from the original dbm systems on 0 makedirs Berkeley , such that it is impossible to enumerate them all here. Notable free software packages that use Berkeley DB for storage in- clude the Cyrus mail server [5], the Subversion version control system [6], the MySQL database (up until version 5.1) [7], and the OpenLDAP di- rectory server [8]. Many organizations have em- ployed Berkeley DB for back-end storage in pri- vate systems.

5 The PVFS2 system, the underlying storage system References for the pNFS extension to the NFSv4 protocol, uses Berkeley DB for file metadata storage [9]. [1] M. A. Olson, K. Bostic, and M. Seltzer. Berkeley The Panasas ActiveScale cluster management sys- DB. In Proceedings of the 1999 Annual USENIX tem and the Veritas Enterprise Administrator ob- Technical Conference. ject bus are two enterprise storage systems that [2] S. A. Weil, S. A. Brandt, E. L. Miller, D. D. E. use Berkeley DB [10]. Long, and C. Maltzahn. Ceph: A Scalable, High- Performance Distributed File System. In Proceed- 6 Future Work ings of the 7th USENIX Symposium on Operating We consider this project a success; we have Systems Design and Implementation (OSDI '06). shown that it is possible to write an efficient stor- [3] R. O. Weber, editor. SNIA Project Specification age system very quickly using Berkeley DB. We T10/1355-D: Information Technology; SCSI Ob- have, however, only scratched the surface of what ject Based Storage Device Commands (OSD). is possible in this area. [4] D. Wheeler. SLOCCount. Examples of things to consider include the trans- http://www.dwheeler.com/sloccount/. action and logging support in Berkeley DB, and [5] Carnegie Mellon University. Project Cyrus. investigating good ways to incorporate these fea- http://cyrusimap.web.cmu.edu/. tures into a storage system. Backups and hot cop- ies, which Berkeley DB supports, offer a useful [6] CollabNet. Subversion. way to manage backups and replication in a stor- http://subversion.tigris.org/ age system. [7] MySQL AB. MySQL Database System. http://mysql.com/. While we have developed OSBDB in the context of Ceph, storage systems built specifically around [8] OpenLDAP Foundation. OpenLDAP. Berkeley DB is an interesting realm of research, http://www.openldap.org/. and such systems are useful in commercial and [9] D. Heldebrand and P. Honeyman. Exporting Stor- enterprise storage systems. age Systems in a Scalable Manner with pNFS. In Proceedings of the 22nd IEEE/13th NASA God- 7 Conclusions dard Conference on Mass Storage Systems and Technology (MSST 2005). We have implemented an object storage interface using the Berkeley DB. We [10] Oracle Corporation. Building Robust Storage have found that even though the code we had to Systems with Oracle Berkeley DB. White paper. write was simple and straightforward, our object October 2006. storage system has good performance characteris- [11] D. Gilbert et al. JFreeChart. tics, even when compared against an advanced, http://www.jfree.org/jfreechart/. complex system dedicated to this purpose. There are some clear performance issues remaining, but we believe that there is a simple cause underlying these problems. Robust, simple systems such as Berkeley DB, which provide fast and safe object storage, are compelling bases for object-based storage systems.

OSBDB is available as a part of Ceph, which is free software. Ceph and OSBDB can be down- loaded from, and you can join the project at, http://ceph.sourceforge.net/.

6