Object-Based Storage By: Kanishk Jain Dated: 10Th May, 2007

CSE 598D-Storage Systems Survey Object-Based Storage by: Kanishk Jain Dated: 10th May, 2007 This literature survey is the only work so far where the overall progress in object based storage technology has been analyzed, following its standardization. It also illustrates how the trends of recent research fit into the broad scope of the object based storage environment, thus bringing out the current status of the technology, while simultaneously providing a reality check for the concept. A special feature of this survey is an attempted comparison of existing object based file systems, mainly from a design perspective. Abstract Object-based storage is a new technology which provides intelligence at the storage device. The object storage device (OSD) interface has recently been standardized. The main characteristic of an OSD is “intelligent data layout”. Its advantages include scalability, security, reliability, performance and ease of management. This literature survey looks at the capabilities of object-based storage, and explores how it improves data sharing, security, and device intelligence. It analyzes various aspects of object based storage such as file system design, application based optimizations and so on, in order to understand the advantages of this upcoming storage technology. Many of the ideas presented in this survey suggest extensions to the OSD interface to enhance performance, security, quality of service and so on. Hence the interface is still evolving. 1. Introduction The evolution and stability of current storage interfaces (SCSI and ATA/IDE) has allowed continual advances in both storage devices and applications, without frequent changes to the standards. However, since the interface ultimately determines the functionality supported by the devices, current interfaces are holding system designers back. Storage technology has progressed to the point that a change in the device interface is needed. Object-based storage [1] is an emerging technology designed to address this problem. The OSD (object storage device) interface has recently been standardized (as the ANSI T10 Object-based Storage Devices Standard). The main characteristic of an OSD is “intelligent data layout”. A storage object is a logical collection of bytes on a storage device, with well- known methods for access, attributes describing characteristics of the data, and security policies that prevent unauthorized access. Unlike blocks, objects are of variable size and can be used to store entire data structures, such as files, database tables, medical images, or multimedia. Objects can be regarded as the convergence of two technologies: files and blocks. Files provide user applications with a higher-level storage abstraction that enables secure data sharing across different operating system platforms, but often at the cost of limited performance due to file server contention. Blocks offer fast, scalable access to shared data; but without a file server to authorize the I/O and maintain the metadata, this direct access comes at the cost of limited security and data sharing. Objects can provide the advantages of both files and blocks. Like blocks, objects are a primitive unit of storage that can be directly accessed on a storage device (i.e., without going through a server); this direct access offers performance advantages similar to blocks. Like files, objects are accessed using an interface that abstracts storage applications from the metadata necessary to store the object, thus making the object easily accessible across different platforms. Providing direct, file-like access to storage devices is therefore the key contribution of object- based storage. Figure 1 shows the OSD model and the object interface it provides. [9] provides a detailed description of the advantages of the OSD model. An illustration of the hierarchy of OSD objects and attributes can be found in [2]. Figure 1: Object Storage Device model and interface This literature survey looks at the capabilities of object-based storage, and explores how it improves data sharing, security, and device intelligence. The rest of the paper is organized as follows. Section 2 outlines the advantages of object based storage in a cluster computing environment. Section 3 analyses object based storage from a file systems perspective. Section 4 talks about the use of application-specific attributes and illustrates an example of their use in database storage management. Section 5 indicates trends of recent research in object based storage. Section 6 gives the related work. Finally, section 7 concludes the paper. 2. Object Based Storage for Cluster Computing Instead of using proprietary, expensive supercomputers to solve the most challenging computing problems, nearly every new supercomputing system installed today is comprised of thousands of low-cost Linux servers united into a cluster. Supercomputing applications, apart from having high computational complexity, have a need for high-performance data access. Without rapid and efficient access to data, scarce computing resources sit idle. Traditional networked storage systems are simply incapable of providing the data throughput needed to keep ever growing Linux clusters operating efficiently. Equally important, these massive datasets need to be made globally available to all processes executing across the compute cluster to simplify application development and to ease the burden of managing data repositories. Here again, traditional networked storage systems fall short: they are incapable of scaling capacity within a single namespace and thereby increase the time and complexity of managing networked data. To understand the need for a new approach to scalable storage, it is essential to explore the manner in which many cluster computing applications address the storage bottleneck. Linux cluster applications use a scale-out approach to parallel computing. In this model, applications employ a 'divide-and-conquer' approach, decomposing the problem to be solved into thousands of independently executed tasks. The most common decomposition approach exploits a problem's inherent data parallelism-- breaking the problem into pieces by identifying the data partitions that comprise the individual task, then distributing each task and corresponding partition to the compute nodes for processing. The natural inclination of cluster computing developers is to deploy a networked storage solution that can be accessed by all nodes in the cluster. Such a solution greatly simplifies management of the compute jobs as all data partitions and replicas can be made available to all nodes, and hence any of the tasks can be computed on any node. Additionally, the output of these jobs can then be used directly elsewhere: in post-processing, visualization or even as the input to the next processing task in a computational pipeline. However, neither storage area networks (SAN) nor network attached storage (NAS) architectures support the aggressive concurrency and per- client throughput requirements of scalable cluster computing applications [1,4]. Figure 2 illustrates NAS being used to share files among a number of clients. The files themselves may be stored on a fast SAN. However, because the clients often suffer from queuing delays at the server, they rarely see the full performance of the SAN. The file server is used to intermediate all requests and thus becomes the bottleneck. Figure 2: The NAS architecture Figure 3: The SAN architecture Figure 3 shows a SAN file system being used to share files among a number of clients. The files themselves are stored on a fast storage area network (e.g., iSCSI) to which the clients are also attached. File server queuing delays are avoided by having the file server share metadata with the clients who can directly access the storage devices. However, since the devices cannot authorize I/O, the file server must assume that the clients are trusted. Hence while the file server is removed as a bottleneck, security is a concern. Because of these limitations, organizations are forced to adopt a process in which data from a shared storage system is staged (copied) to the compute nodes, processing is performed, and results are de-staged from the nodes back to shared storage when done. In many applications, the staging setup time can be appreciable-up to several hours for large clusters. Object-based storage clustering [4] is useful in unlocking the full potential of these Linux compute clusters, as object storage clusters have the intrinsic ability to linearly scale in capacity and performance to meet the demands of the supercomputing applications (the scalability of the object-storage architecture is explained in detail in [9]). Object-based storage offers high bandwidth parallel data access between thousands of Linux cluster nodes and a unified storage cluster over standard TCP/IP networks. It is a solution in which the storage system's scalability can be precisely matched and then scaled to needs of the cluster computer. Together, Linux clusters and object-based storage clusters deliver commodity-like supercomputers able to keep pace with increasingly voracious applications. In an object-based storage architecture, the dynamic, self-managing data objects are stored across a cluster of intelligent object storage devices (OSDs). Data objects are fundamental containers that house both application data (including metadata describing the "mapping" of object data to physical disk drives) and an extensible set of storage attributes (application specific

Load more