CSE 598D-Storage Systems Survey Object-Based Storage by: Kanishk Jain Dated: 10th May, 2007

This literature survey is the only work so far where the overall progress in object based storage technology has been analyzed, following its standardization. It also illustrates how the trends of recent research fit into the broad scope of the object based storage environment, thus bringing out the current status of the technology, while simultaneously providing a reality check for the concept. A special feature of this survey is an attempted comparison of existing object based file systems, mainly from a design perspective.

Abstract Object-based storage is a new technology which provides intelligence at the storage device. The object storage device (OSD) interface has recently been standardized. The main characteristic of an OSD is “intelligent data layout”. Its advantages include scalability, security, reliability, performance and ease of management. This literature survey looks at the capabilities of object-based storage, and explores how it improves data sharing, security, and device intelligence. It analyzes various aspects of object based storage such as design, application based optimizations and so on, in order to understand the advantages of this upcoming storage technology. Many of the ideas presented in this survey suggest extensions to the OSD interface to enhance performance, security, quality of service and so on. Hence the interface is still evolving.

1. Introduction The evolution and stability of current storage interfaces (SCSI and ATA/IDE) has allowed continual advances in both storage devices and applications, without frequent changes to the standards. However, since the interface ultimately determines the functionality supported by the devices, current interfaces are holding system designers back. Storage technology has progressed to the point that a change in the device interface is needed. Object-based storage [1] is an emerging technology designed to address this problem. The OSD (object storage device) interface has recently been standardized (as the ANSI T10 Object-based Storage Devices Standard). The main characteristic of an OSD is “intelligent data layout”. A storage object is a logical collection of bytes on a storage device, with well- known methods for access, attributes describing characteristics of the data, and security policies that prevent unauthorized access. Unlike blocks, objects are of variable size and can be used to store entire data structures, such as files, database tables, medical images, or multimedia. Objects can be regarded as the convergence of two technologies: files and blocks. Files provide user applications with a higher-level storage abstraction that enables secure data sharing across different operating system platforms, but often at the cost of limited performance due to file server contention. Blocks offer fast, scalable access to shared data; but without a file server to authorize the I/O and maintain the metadata, this direct access comes at the cost of limited security and data sharing. Objects can provide the advantages of both files and blocks. Like blocks, objects are a primitive unit of storage that can be directly accessed on a storage device (i.e., without going through a server); this direct access offers performance advantages similar to blocks. Like files, objects are accessed using an interface that abstracts storage applications from the metadata necessary to store the object, thus making the object easily accessible across different platforms. Providing direct, file-like access to storage devices is therefore the key contribution of object- based storage. Figure 1 shows the OSD model and the object interface it provides. [9] provides a detailed description of the advantages of the OSD model. An illustration of the hierarchy of OSD objects and attributes can be found in [2].

Figure 1: Object Storage Device model and interface

This literature survey looks at the capabilities of object-based storage, and explores how it improves data sharing, security, and device intelligence. The rest of the paper is organized as follows. Section 2 outlines the advantages of object based storage in a cluster computing environment. Section 3 analyses object based storage from a file systems perspective. Section 4 talks about the use of application-specific attributes and illustrates an example of their use in database storage management. Section 5 indicates trends of recent research in object based storage. Section 6 gives the related work. Finally, section 7 concludes the paper.

2. Object Based Storage for Cluster Computing Instead of using proprietary, expensive supercomputers to solve the most challenging computing problems, nearly every new supercomputing system installed today is comprised of thousands of low-cost Linux servers united into a cluster. Supercomputing applications, apart from having high computational complexity, have a need for high-performance data access. Without rapid and efficient access to data, scarce computing resources sit idle. Traditional networked storage systems are simply incapable of providing the data throughput needed to keep ever growing Linux clusters operating efficiently. Equally important, these massive datasets need to be made globally available to all processes executing across the compute cluster to simplify application development and to ease the burden of managing data repositories. Here again, traditional networked storage systems fall short: they are incapable of scaling capacity within a single namespace and thereby increase the time and complexity of managing networked data. To understand the need for a new approach to scalable storage, it is essential to explore the manner in which many cluster computing applications address the storage bottleneck. Linux cluster applications use a scale-out approach to parallel computing. In this model, applications employ a 'divide-and-conquer' approach, decomposing the problem to be solved into thousands of independently executed tasks. The most common decomposition approach exploits a problem's inherent data parallelism-- breaking the problem into pieces by identifying the data partitions that comprise the individual task, then distributing each task and corresponding partition to the compute nodes for processing. The natural inclination of cluster computing developers is to deploy a networked storage solution that can be accessed by all nodes in the cluster. Such a solution greatly simplifies management of the compute jobs as all data partitions and replicas can be made available to all nodes, and hence any of the tasks can be computed on any node. Additionally, the output of these jobs can then be used directly elsewhere: in post-processing, visualization or even as the input to the next processing task in a computational pipeline. However, neither storage area networks (SAN) nor network attached storage (NAS) architectures support the aggressive concurrency and per- client throughput requirements of scalable cluster computing applications [1,4]. Figure 2 illustrates NAS being used to share files among a number of clients. The files themselves may be stored on a fast SAN. However, because the clients often suffer from queuing delays at the server, they rarely see the full performance of the SAN. The file server is used to intermediate all requests and thus becomes the bottleneck.

Figure 2: The NAS architecture

Figure 3: The SAN architecture Figure 3 shows a SAN file system being used to share files among a number of clients. The files themselves are stored on a fast storage area network (e.g., iSCSI) to which the clients are also attached. File server queuing delays are avoided by having the file server share metadata with the clients who can directly access the storage devices. However, since the devices cannot authorize I/O, the file server must assume that the clients are trusted. Hence while the file server is removed as a bottleneck, security is a concern. Because of these limitations, organizations are forced to adopt a process in which data from a shared storage system is staged (copied) to the compute nodes, processing is performed, and results are de-staged from the nodes back to shared storage when done. In many applications, the staging setup time can be appreciable-up to several hours for large clusters. Object-based storage clustering [4] is useful in unlocking the full potential of these Linux compute clusters, as object storage clusters have the intrinsic ability to linearly scale in capacity and performance to meet the demands of the supercomputing applications (the scalability of the object-storage architecture is explained in detail in [9]). Object-based storage offers high bandwidth parallel data access between thousands of Linux cluster nodes and a unified storage cluster over standard TCP/IP networks. It is a solution in which the storage system's scalability can be precisely matched and then scaled to needs of the cluster computer. Together, Linux clusters and object-based storage clusters deliver commodity-like supercomputers able to keep pace with increasingly voracious applications. In an object-based storage architecture, the dynamic, self-managing data objects are stored across a cluster of intelligent object storage devices (OSDs). Data objects are fundamental containers that house both application data (including metadata describing the "mapping" of object data to physical disk drives) and an extensible set of storage attributes (application specific attributes). User and application files are decomposed into a set of data objects and distributed across one or more OSDs. Each OSD is an easily scalable cluster element: it includes one or more disk drives, local processing to manage data flow, memory for data caching, and a high-speed network connection. Together, data objects stored on object storage devices form the core of a scalable storage system. Uniquely, each object-based cluster element has the intelligence to deliver data directly and securely to the Linux cluster. This is how highly parallel data access is achieved: Linux cluster nodes can securely read and write data objects in parallel to all object storage devices in the storage cluster. The intelligence of this system offer further benefits: all data is virtualized into a single seamless namespace for ease of manageability and the entire system can be dynamically rebalanced to ensure ongoing, self-managed operational efficiency. While object storage devices form the foundation of massively parallel storage architectures, they do not comprise a storage system by themselves. To deliver a complete system, a scalable file-level metadata management layer must be developed. Metadata managers in an object-based system manage information such as directory membership, file ownership and permission attributes. Metadata managers are responsible for striping data objects (portions of files) across OSDs and ensuring file- level data integrity (for example, by computing and storing parity objects that implement RAID-5 redundancy). Figure 4 illustrates the object-based storage security architecture. Metadata managers grant capabilities to clients; clients present these capabilities to the devices on every I/O. Secrets shared between the manager and the storage devices are used to generate a keyed hash of the capability, thereby protecting it from modification. Figure 4: Object Based Architecture In an object-based design, metadata managers represent the control path between the Linux cluster and the storage cluster. This is the path through which compute cluster nodes make requests (e.g., to open or close files), are authenticated, and receive authorization credentials and a map of object locations and their host OSDs. The node then uses the map and credentials to securely access the cluster of OSDs, reading and writing file data without additional intervention by the metadata manager (this represents an important separation of the data and control path). Furthermore, metadata managers can also be clustered like object storage devices for optimal performance and reliability. Many of these basic concepts are used in the design of file systems for object storage devices, which is explained in the next section.

3. Object Based Storage - A File Systems Perspective The interface to object-based storage is very similar to that of a file system and hence most of the basic file system principles are applicable to it. Objects can be created or deleted, read or written, and even queried for certain attributes — just like files are today. File interfaces have proven easy to understand, straightforward to standardize, and are thus used to enable sharing between different platforms. In [9] Ruwart talks about the difference between an object based file system and current, conventional file systems. Current file system technologies that access disk drives directly are “block-based” in nature. These file systems are responsible for the management of all available disk blocks on the disk storage devices they manage. Hence, the “file system manager” is the program that runs on a computer system that manages all the data structures on a disk storage device that make up a “file system”. The file system manager performs file creation, data block allocation, tracking of which files occupy which data blocks, control of access to these files, file deletion, and management the list of free or unused data blocks. In performing these functions the file system manager examines and manipulates on-disk data structures such as information nodes (inodes) and directory trees. The file system manager manages both basic types of data: metadata and user data. Therefore, the file system manager has the ability to understand the structure of the file system but not the contents of the user data contained in the file system. Also, from the point of view of the file system manager, a disk storage device is simply a sequential set of disk blocks. All the meta- data and user data is mapped into this sequential set of blocks. From the point of view of the storage device, it only knows how to access blocks. The storage device has no concept of the structure of these blocks as it relates to the file system or the data contained within the blocks. The object-based storage model in effect splits the file system into two components, a user component and a storage component by moving the management (i.e. the storage component) of the individual blocks to the devices themselves. The file system manager (i.e. the user component) the then only needs to manage objects – a far more manageable problem. The fact that a disk device has blocks is completely hidden from anything outside the disk drive itself. In fact, it does not even have to be a “disk” drive. It could be a solid-state device, a MEMS device, or a quantum crystal device. It no longer matters to the file system manager as long as the device can store and retrieve “objects”. For example, a file system manager in such a scenario could be only required to manage 500,000 objects, and the fact that they take up the equivalent of 30 trillion 512-byte blocks is no longer directly relevant. An analysis of the file system design principles of different object based storage systems is made below.

3.1 General purpose file systems and object storage devices Much research has gone into hierarchy management, scalability, and availability of distributed file systems in projects such as AFS, , GPFS, and GFS but relatively little research has been aimed toward improving the performance of the storage manager (i.e. the storage component of a file system). Because modern distributed file systems may employ thousands of storage devices, even a small inefficiency in the storage manager could result in a significant loss of performance in the overall storage system. In practice, general-purpose file systems are often used as the storage manager. For example, [described below] uses the Linux file system as its object storage targets (OSTs). Since the workload offered to OSDs may be quite different from that of general-purpose file systems, a better storage manager can be built by matching its characteristics to the workload. File systems such as and Ext3 are optimized for general-purpose Unix environments in which small files dominate and the file sizes vary significantly. They have several disadvantages that limit their effectiveness in large object-based storage systems. Ext2 caches metadata updates in memory for better performance. Although it flushes the metadata back to disk periodically, it cannot provide the high reliability we require. Both Ext3 and XFS employ write-ahead logs to update the metadata changes, but the lazy log write policy used by both of them can still lose important metadata (and therefore data) in some situations. These general-purpose file systems trade off the reliability for better performance. If we force them to synchronously update object data and metadata for better reliability, their performance degrades significantly. Many general-purpose file systems such as Ext2 and Ext3 use flat directories in a tree-like hierarchy, which results in relatively poor searching performance for directories of more than a thousand objects. XFS uses B+-Trees to address this problem. Some object based file systems such as OBFS [described below] use hash tables to obtain very high performance directory operations on the flat object namespace. In an object-based storage system as in many others, RAID-style striping with parity and/or replication is used to achieve high performance, reliability, availability, and scalability. Unlike RAID, the devices are semi-autonomous, internally managing all allocation and scheduling details for the storage they contain. The devices themselves may use RAID internally to achieve high performance. In this architecture, each stripe unit is stored in a single object. Thus, the maximum size of the objects is the stripe unit size of the distributed file system, and most of the objects will be this size. At the OSD level, objects typically have no logical relationship, presenting a flat name space. As a result, general-purpose file systems, which are usually optimized for workloads exhibiting relatively small variable-sized files, relatively small hierarchical directories, and some degree of locality, do not perform particularly well under this workload.

3.2 Lustre File System The Lustre File System [7] is an object-based, open source, high-performance file system from Cluster File Systems, Inc., designed to improve performance, availability, and scalability in a distributed, network-oriented computing environment. Such an environment requires high-performance, network-aware file systems that can satisfy both the requirements of individual systems and the data sharing requirements of workgroups and clusters of cooperative systems. Lustre provides high I/O throughput in clusters and shared-data environments and also provides independence from the location of data on the physical storage, protection from single points of failure, and fast recovery from cluster reconfiguration and server or network outages.

Lustre provides a unique combination of the advantages of journaling and distributed file systems. Some of the important design features of Lustre are as follows:

 Decoupling computation and storage resources: Like many other distributed file systems, Lustre decouples computational and storage resources. Lustre runs on commodity hardware and uses object based disks for storage and metadata servers for storing file system metadata. This design provides a substantially more efficient division of labor between computing and storage resources. Replicated, failover metadata Servers (MDSs) maintain a transactional record of high-level file and file system changes. Distributed Object Storage Targets (OSTs) are responsible for actual file system I/O and for interfacing with storage devices. This division of labor and responsibility, leads to a truly scalable file system and more reliable recoverability from failure conditions by providing a unique combination of the advantages of journaling and distributed file systems. The Lustre system provides several abstractions designed to improve performance and scalability. At the file system level, Lustre treats files as objects that are located through metadata servers (MDS). Metadata servers support all file system namespace operations, directing actual file I/O requests to object storage targets (OSTs) [8] which manage the storage that is physically located on underlying object based storage disks. Metadata servers are only updated when namespace changes to a file are required. Divorcing file system metadata operations from actual file data operations improves immediate performance and also improves long term aspects of the file system such as recoverability and availability, as using distributed object storage targets and also providing a failover metadata server eliminates any one OSD as a single point of failure.  Use of underlying journaling file systems: It leverages the underlying journaling file systems provided by Linux to enable persistent state recovery, enabling resilency and recoverability from failed OSTs. Many modern file systems use metadata journaling to maximize file system consistency. In Lustre, metadata servers keep a transactional record of file system metadata changes and cluster status, and support failover so that the hardware and network outages that affect one metadata server do not affect the operation of the file system itself.  Security: Lustre provides security in the form of authentication, authorization and privacy by leveraging existing OSD security mechanisms.  Use of open standards: It uses open flexible standards that make it easy to integrate new and emerging network storage technologies.

Lustre’s flexibility, reliable and highly available design and inherent scalability make it well suited for use as a cluster file system.

Some the disadvantages of Lustre are:  Use of object storage targets: In Lustre object storage targets (OSTs) are separate from the object storage devices (OSDs). OSTs are responsible for allocation of objects to the OSDs. In (and OBFS) [described below], the intelligence of OSTs is assumed to be present on the OSD itself. Hence Lustre does not use device intelligence to manage storage layout. Hence it cannot use application specific attributes to optimize layout at the devices themselves [2].  Metadata resides on the metadata server itself: In OBFS [described below], the metadata is packed with the data block on the disk itself, allowing very low overhead metadata updates. Thus there is an extra overhead in the case of Lustre because metadata is on the metadata server.  Limited by existing security mechanisms: The basic security approach implemented by Lustre may not scale to millions of clients [6].

3.3 Panasas File System (PanFS) Panasas ActiveScale Storage Cluster [11] is a product of Panasas Inc. The cluster has five major components:  The primary component is the object, which contains the data and enough additional information to allow the data to be autonomous and self-managing.  The Object-based Storage Device (OSD), which is a more intelligent evolution of today’s disk drive that can lay out, manage, and serve objects.  The Panasas File System (PanFS) client module, which integrates into the client, accepting POSIX file system commands and data from the operating system, addresses the OSDs directly and stripes the objects across multiple OSDs.  The PanFS Metadata Server (MDS), which intermediates amongst multiple clients in the environment, allowing them to share data while maintaining cache consistency on all nodes.  The Gigabit Ethernet network fabric that ties the clients to the OSDs and Metadata Servers. Panasas is very similar to Lustre in its design. Also, like Lustre it fails to delegate responsibility to OSDs, and has limited support for efficient distributed metadata management, limiting its scalability and performance.

3.4 Ceph File System The Ceph file system [5,6] is another object-based reliable, high performance distributed file system with excellent scalability. Ceph also uses network attached object storage devices to hold file data for applications ranging from workstation-style individual file access to coordinated access for high-performance parallel applications.

Some of the important design features of Ceph are as follows:  Distribution and replication of a file across a sequence of objects on many OSDs: Like Lustre (and Panasas), Ceph distributes and replicates a file across a sequence of objects on many OSDs. Traditional NASD techniques would bottleneck at the OSD if multiple hosts access the same file.  Separation of metadata and data paths: As in Lustre (and Panasas), in Ceph, separate metadata servers (MDS) manage the directory hierarchy, permissions and file to object mapping.  Partitioning the directory tree: To efficiently balance load, the MDS partition the directory tree across the cluster. A client guesses which metadata server is responsible for a file, and contacts that server to open the file. That MDS will forward the request to the correct MDS if necessary. The responsible MDS will reply with a file handle, and information about which MDS manages each component along the full pathname. The client may use this information to improve future guesses about which MDS to contact for a particular file.  Use of a specialized mapping algorithm: The file handle returned by the metadata server describes which objects on which OSD contain the file data. A special algorithm, CRUSH [described in a later section] maps a sequence index to the OSD holding the object at that position in the sequence, distributing the objects in a uniform way. Unlike a simple hash function modulo the number of available OSDs, the CRUSH algorithm requires Ceph to redistribute only a small portion of the file data when adding or removing OSDs from the system. The CRUSH algorithm also allows Ceph to replicate files in different patterns rather than replicate an entire OSD. Thus, when one OSD fails, multiple OSDs participate in its restoration. This narrows Ceph’s window of vulnerability to a second failure that could cause irrecoverable data loss.  Limit on object size and use of regions: Ceph limits objects to a maximum size (e. g., 1MB), so files are a sequence of bytes broken into chunks on the maximum object size boundary. Since only the MDS hold the directory tree, OSDs do not have directory information to suggest layout hints for file data. Instead, the OSDs organize objects into small and large object regions, using small block sizes (e. g., 4KB or 8KB) for small objects and large block sizes (e. g. 50–100% of the maximum object size) for large objects. This layout guarantees that bulk reads and writes for a large object are likely to be contiguous on disk.  Use of a more advanced security mechanism: Security issues are addressed by the OSD standard. However, a naive implementation using the basic security approach may not scale to hundreds of millions of pairs. Hence in [6] Olson et al. suggest a scalable security approach for use with the Ceph file system.  POSIX interface: Ceph implements the standard POSIX interface, making it easy to integrate with existing applications.

3.5 OBFS - Object Based File System Many of the design principles of Ceph are drawn from a basic, conceptual object based file system (OBFS), suggested by Wang et. al. in [3]. The design considerations of this OBFS are:  Workload differences: By distributing the objects across many devices, these systems have the potential to provide high capacity, throughput, reliability, availability and scalability. Because files will generally consist of many objects and objects will be distributed across many OSDs, there will be little locality of reference within each OSD. The workload presented to the OSDs in this system will be quite different from that of general-purpose file systems. The basic idea of OBFS is to optimize disk layout based on available knowledge of the workload. This available knowledge could be through application specific attributes [2] [talked about in a later section of the survey].  Disk Layout Policy: OBFS introduced the concept of regions used in Ceph. OBFS uses two block sizes: small and large. Small objects are kept with other small objects in small object regions, while large objects are kept with other large objects in large object regions. This concept maintains good disk utilization in addition to increasing throughput. It leads to relatively little fragmentation as the file system ages [3].  Flat name space: On the OSD there is a complete lack of information about relationship between objects. Hence a flat name space is used to manage objects (i.e. since there is no relationship between objects, efficient searching methods are required; in a hierarchical name space data lookup is implements by following the path associated with the object to the destination directory). A hash table called the Object Lookup Table is used to increase searching efficiency.  No caching at the OSDs: Since there is no locality to be leveraged, OBFS does not cache writes on the OSD. Hence all writes to the OSD are made persistent and thus do not need to be made synchronous explicitly.  Object Metadata is on the disk: Object metadata is referred to as onode and is used to track the status of each object. The metadata is packed with the data block on the disk itself, allowing very low overhead metadata updates. Thus it implies that the role of a metadata server is only to deal with providing security capabilities for authenticated access.

3.6 Performance Results A user-level implementation of OBFS [3] outperforms Linux Ext2 and Ext3 by a factor of two or three, and while OBFS is 1/25 the size of XFS, it provides only slightly lower read performance and 10%–40% higher write performance. This can be seen as an indication of the potential of object based file systems as compared to conventional file systems, in general.

4. Application Specific Optimization- Example of Database Storage Management Object attributes improve data sharing by allowing storage applications to share a common set of information describing the data (e.g., access times). They are also the key to giving storage devices an awareness of how objects are being accessed, so that it can use this information to optimize disk layout specific to the application. This section illustrates this advantage by considering a specific example of database management for which an object based storage solution has been suggested in [2]. Storage performance is of utmost importance for database applications, and is problematic because database software often has very little detailed information about the storage subsystem other than a few crude parameters and basic rules of thumb. The standard interfaces to storage subsystem (e.g., SCSI and IDE) virtualize storage as a simple linear array of fixed-size logical blocks. Such virtualization is important for reasons of compatibility and ease of implementation, but can hide useful information that can be critical for storage-intensive workloads like databases. Previous research took the view that a storage device can provide relevant characteristics to applications (by letting applications query disk parameters) to allow for optimized I/O access. However in [2], Schlosser and Iren suggest an alternative approach. Rather than have the storage subsystem inform the application of its capabilities, they propose that the application (a database system in this case) should communicate semantic information (e.g., the schema of a relation) and other quality of service requirements to the storage subsystem, allowing it to make data allocation decisions on behalf of the application. The main advantage of such an approach is that since all of the requisite device-specific information is known to the storage subsystem, and thus it is better-equipped to manage low-level storage tasks. Thus their work is a first step towards application-aware storage management by considering low-level data placement by object based storage systems. Object attributes can contain information about the expected behavior of an object such as expected read/write ratio, access pattern (sequential vs. random), or expected size, dimension, and content of the object. Having access to such object attributes enables the device to better organize and serve the data to the applications including database applications. By moving this storage component of a DBMS (or file system) to storage devices, object-based storage removes the biggest obstacle to data sharing. With objects, since meta-data is offloaded to the storage device, the dependency between the metadata and storage system/application is removed. This assists with data sharing between different storage applications (i.e. cross platform data sharing). This also improves scalability of clusters [as talked about in previous sections], since hosts no longer need to coordinate meta-data updates. The shortcomings of existing database storage models stem from the fact that the common storage interface is inherently linear, requiring serialization of the relation along one axis or the other. Using OSD, a DBMS can inform the storage subsystem of the geometry of a relation, thereby passing responsibility for low-level data layout to the storage device where the requisite disk parameters reside. Once data placement tasks are handled by the storage subsystem, the higher-level storage functions must be able to access the data in a clean way, independent of low-level placement. For example, the buffer pool manager could be able to specify the data that is required from the storage subsystem in terms of a relation’s schema, rather than in terms of block or byte addresses on disk. In [2] the approach used is to store an entire table in a single object, and the schema of the relation being expressed through attributes assigned to objects. The work suggests extensions to the OSD interface to allow further optimization of database storage management. It considers only relational database systems as a starting point. However, apart from detailing the advantages of such an approach, the authors provide no experimental results and call their work a “first look on using OSD to improve database systems”.

5. Trends of Recent Research

5.1 Security Security is one of the key advantages of object-based storage, as compared to block-based storage. Recent research has been directed both towards improving the security mechanisms of the object based storage interface itself and also using the object based storage interface to provide security in other systems. In [13], Factor et al. present the requirements, the design tradeoffs, and the final security protocol as defined in the ANSI T10 Object-based Storage Devices (OSD) Standard. The resulting protocol is based on a secure capability-based model, enabling fine-grained access control that protects both the entire storage device and individual objects from unauthorized access. The protocol defines three methods of security based on the applications’ requirements. Furthermore, the protocol’s key management algorithm allows keys to be changed quickly, without disrupting normal operations. They also suggest future extensions to the interface which include data- encryption and access-control on sections of storage objects. Some of their ideas have been furthered by Olson and Miller in [5], when they suggest a more scalable security mechanism, which is used by the Ceph file system. A prominent example of the application of object-based storage to improve the security of other systems is the work by Zhang and Wang in [14] on object-storage- based intrusion detection. Intrusion detection is a much researched area of computers security. Storage-based intrusion detection systems (IDS) can be valuable tools in monitoring for the intrusion on a host computer. However, because the traditional storage device works on the block-level while intrusion always happens on the file- level, this gap has to be erased by detection software, which is a hard and time- consuming task. To solve this problem Lu et al. present a novel idea to design an IDS based on object-based storage devices and analyzed how the features of OSD can be used for intrusion detection (ID) and for violation responding. They also suggest enhancements to the existing OSD interface. While OSD-based ID is more straightforward for implementation compared with the block-level storage devices, their testing results show that it does incur an extra overhead.

5.2 Quality of Service Quality of Service (QoS) is crucial for certain applications such as multimedia. QoS guarantees for applications running on multiple clients accessing shared storage devices are becoming more important than they were when every application had essentially a direct connection to the storage device. In [12] Lu et al. consider QoS provisioning for OSD-based systems. They analyze the QoS requirements for applications on OSD clients and based on their observations propose a three-level QoS specification, consisting of an object level, an operation level and a session level. They go on to suggest extensions to the existing OSD interface and iSCSI protocol to support their work and also propose future work which will involve studying the resource allocation and scheduling in an object storage device that will enforce these QoS requirements.

5.3 Object Placement Placement of objects in an object based storage system (i.e. distribution of objects on different OSDs) is the key to achieving the goal of systems to evenly distribute data and workload, in order to efficiently utilize available resources and maximize system performance, while facilitating dynamic system growth and management of hardware failures. Dividing a large and complicated storage system into reasonable number of domains facilitates management of the system. This partitioning is the main concept used by Zend et al. in [16] as they discuss principles of a load balancing policy for object storage devices. They also suggest that an object storage system can initiate load balancing policy at three levels: metadata-level (using global information), object storage controller level (using local information), and eventually at level of the storage object itself. Qin and Feng, in [17], suggest that instead of the traditional approach of studying the two operations, data replication and migration, separately, they should be studied together. They present an adaptive load balancing algorithm, which combines replication and migration in a single uniform model. To evaluate the load of an object based storage device in a timely and accurate manner, they use a hybrid load metric as a cost function. To solve the online problem with a changing workload, the algorithm employs an adaptive mechanism to keep track of the characteristics of workloads. Without assuming any a priori knowledge of the workload, the algorithm records the history access information of each object and learns the traffic intensity for each object. They compare their algorithm with three other algorithms (one without load balancing, another with only object replication and finally one with only object migration). Their simulation results show that an algorithm with replication and migration outperforms the other three. Such an algorithm can dramatically reduce the system response time in the situation where the load imbalance is caused by uneven object distribution or large mount of accesses to hot objects. Both the above concepts of partitioning and a unified model of data replication and migration are taken care of by Weil et al. in [15] as they discuss CRUSH (Controlled Replication Under Scalable Hashing) , a scalable pseudorandom data distribution function designed for distributed object-based storage systems that efficiently maps data objects to storage devices without relying on a central directory. Because large systems are inherently dynamic, CRUSH is designed to facilitate the addition and removal of storage while minimizing unnecessary data movement. The algorithm accommodates a wide variety of data replication and reliability mechanisms and distributes data in terms of user defined policies that enforce separation of replicas across failure domains. This algorithm is used by the Ceph file system [discussed in a previous section]. Ceph must distribute petabytes of data among an evolving cluster of thousands of storage devices such that device storage and bandwidth resources are effectively utilized. In order to avoid imbalance (e. g., recently deployed devices mostly idle or empty) or load asymmetries (e. g., new, hot data on new devices only), Ceph adopts a strategy that distributes new data randomly, migrates a random sub-sample of existing data to new devices, and uniformly redistributes data from removed devices. This stochastic approach is robust in that it performs equally well under any potential workload. Ceph first maps objects into placement groups (PGs) using a simple hash function, with an adjustable bit mask to control the number of PGs. The value chosen is such that it gives each OSD on the order of 100 PGs to balance variance in OSD utilizations with the amount of replication related metadata maintained by each OSD. Placement groups are then assigned to OSDs using CRUSH, which efficiently maps each PG to an ordered list of OSDs upon which to store object replicas. This differs from conventional approaches (including other object-based file systems) in that data placement does not rely on any block or object list metadata. To locate any object, CRUSH requires only the placement group and an OSD cluster map: a compact, hierarchical description of the devices comprising the storage cluster. This approach has two key advantages: first, it is completely distributed such that any party (client, OSD, or MDS) can independently calculate the location of any object; and second, the map is infrequently updated, virtually eliminating any exchange of distribution-related metadata. In doing so, CRUSH simultaneously solves both the data distribution problem (“where should the data be stored”) and the data location problem (“where was the data stored”). By design, small changes to the storage cluster have little impact on existing PG mappings, minimizing data migration due to device failures or cluster expansion. The cluster map hierarchy is structured to align with the clusters physical or logical composition and potential sources of failure. For instance, one might form a four-level hierarchy for an installation consisting of shelves full of OSDs, rack cabinets full of shelves, and rows of cabinets. Each OSD also has a weight value to control the relative amount of data it is assigned. CRUSH maps PGs onto OSDs based on placement rules, which define the level of replication and any constraints on placement. For example, one might replicate each PG on three OSDs, all situated in the same row (to limit inter-row replication traffic) but separated into different cabinets (to minimize exposure to a power circuit or edge switch failure). The cluster map also includes a list of down or inactive devices and an epoch number, which is incremented each time the map changes. All OSD requests are tagged with the client’s map epoch, such that all parties can agree on the current distribution of data. Incremental map updates are shared between cooperating OSDs, and piggyback on OSD replies if the client’s map is out of date. CRUSH in an efficient algorithm, both in terms of the computational efficiency and the required metadata. Mapping calculations have O (log n) running time, requiring only tens of microseconds to execute with thousands of devices.

6. Related Work – Object Based Storage versus Active Disks Object based storage can be broadly categorized under storage intelligence. The object-based storage interface can also be easily extended with application-specific methods for manipulating data within an object, a technology referred to as active disks. An active disk storage device combines on-drive processing and memory with software download ability to allow disks to execute application level functions directly at the device. Moving portions of an application’s processing to a storage device significantly reduces data traffic and leverages the parallelism already present in large systems, dramatically reducing the execution time for many basic data mining tasks. A comparison of active disks and object based storage with respect to database storage management has been studied by Pramod in [10] and has been summarized in Table 1.

Table 1: Object Based Storage versus Active Disks for database storage management

Layout Processing Flexibility User Scalability Awareness on On Disk interface disk Active x √ somewhat √ √ Disk OSD + DB √ √ √ √ √

According to Ruwart in [9] the object-based storage model is intended to be used with the concept of active disks. It can be extended, by scaling the processing power of an OSD to meet the requirements of the functions an active disk is expected to perform. Hence the object based storage concept is more generic than the active disk concept. It can also be seen as an “enabling technology” for active storage devices.

7. Conclusions Object-based storage provides intelligence at the storage device, in the form of intelligent data layout. Its advantages include scalability, security, reliability, performance and ease of management. The characteristics of object based storage make it an excellent choice for use in large cluster systems for performance and security reasons. However, file systems design principle of object storage devices differ from conventional file systems because of the differences in the workload. Object based storage devices can optimize the data layout specific to an application by making use of application-specific object attributes. Some of the recent trends of research are in the areas of security, quality of service and object placement. However, most of the existing work on object based storage devices is still in a nascent stage. Some of it is entirely conceptual, while some results are based on simulations. Many of the ideas presented in this survey suggest extensions to the OSD interface to enhance performance (for example in the case of database storage management), security, quality of service and so on. Hence the interface, though recently standardized, is still evolving. 8. References [1] M Mesnier, G. Ganger and E. Riedel. Object Based Storage. In IEEE Communications Magazine, August 2003. [2] S. Schlosser and S. Iren. Database storage management with object-based storage devices. In Proceedings of the First International Workshop on Data Management on New Hardware (DaMoN 2005), June 2005, Baltimore, MD. [3] F. Wang, S. Brandt, E. Miller, and D. Long. OBFS: A File System for Object- based Storage Devices. In 12th NASA Goddard Conference on Mass Storage Systems and Technologies (MSST2004), April 2004. [4] R. Schrock. Smart object based storage cluster computing- Storage Networking. In Computer Technology Review, October 2003. [5] C. Olson and E. Miller. Secure Capabilities for a Petabyte-Scale Object-Based Distributed File System. In StorageSS’05, November 2005, Virginia USA. [6] S.Weil, S. Brandt, E. Miller, D. Long, and C. Maltzahn. Ceph: A Scalable, High- Performance Distributed File System. [7] Cluster File Systems Inc. Lustre: A Scalable High-Performance File System. Cluster File Systems Inc. [8] P. Braam and M. Callahan. Lustre: A SAN File System for Linux. Stelias Computing Inc. [9] T. Ruwart. OSD: A Tutorial on Object Storage Devices. Ciprico Inc. [10] N. Pramod. A Survey of techniques used to reduce the Semantic Gap between Database Management Systems and Storage Subsystems. In http://www- users.cs.umn.edu/~npramod/project_report_v1.doc [11] D. Nagle, D. Serenyi and A. Matthews. The Panasas ActiveScale Storage Cluster. In Proceedings of the ACM/IEEE SC2004 Conference, November 2004. [12] Y. Lu, D.H.C Du and T. Ruwart. QoS provisioning framework for an OSD-based storage. In Proceedings of 22nd IEEE 13th NASA Goddard Conference, April 2005. [13] M. Factor, D. Nagle, D. Naor, E. Riedel and J. Satran. The OSD Security Protocol. In Proceedings of the 3rd IEEE International Security in Storage Workshop ( SISW’05), 2005. [14] Y. Zhang and D. Wang. Research on object-storage-based Intrusion Detection. In Proceedings of the 12th International Conference on Parallel and Distributed Systems (PADS ’06), 2006. [15] S.Weil, S. Brandt, E. Miller and C. Maltzahn. CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data. In Proceedings if the ACM/IEEE SC 2006 Conference on Supercomputing, 2006. [16] L. Zeng, and D. Feng, F. Wang and K. Zhou. Object replication and migration policy based on OSS. In Proceedings of the International Conference on Machine Learning and Cybernetics, 2005. [17] L.Qin, and D.Feng. An Adaptive Load Balancing Algorithm in Object-Based Storage Systems. In Proceedings of the International Conference on Machine Learning and Cybernetics, August 2006.