Block-level Virtualization: How far can we go?

Michail D. Flouris†, Stergios V. Anastasiadis∗ and Angelos Bilas‡ Institute of Computer Science (ICS) Foundation for Research and Technology - Hellas P.O.Box 1385, Heraklion, GR 71110, Greece { flouris, stergios, bilas }@ics.forth.gr

significant objective of lowering administration costs through Abstract automated administration. Furthermore, the software infras- tructure should provide monitoring and dynamic reconfigura- In this paper, we present our vision on building large-scale tion mechanisms to allow the definition and implementation of storage systems from commodity components in a data center. For high-level quality-of-service policies. reasons of efficiency and flexibility, we advocate maintaining so- 3. Finally, the third challenge is efficient storage sharing between phisticated data management functions behind a block-based in- remote data-centers, enabling global-scale storage distribution. terface. We anticipate that such an interface can be customized to meet the diverse needs of end-users and applications through the Next we explore each of these challenges in more detail. extensible storage system architecture that we propose. 1.1. Scalable, Low-cost Platform Today, commercial storage systems mostly use expensive custom- 1. Introduction made hardware components and proprietary protocols. We ad- vocate building scalable low-cost storage systems from storage Data storage is becoming increasingly important as larger clusters composed of common off-the-shelf components. Current amounts of data assets need to be stored for archival or online- commodity computers and networks have (or will soon have) sig- processing purposes. Increasing requirements for improving stor- nificant computing power and network performance, at a fraction age efficiency and cost-effectiveness introduce the need for scal- of the cost of custom storage-specific hardware (controllers or net- able storage systems in a data-center, that consolidate system re- works). Using such components, we can build low-cost storage sources into a single system image. Consolidation enables con- clusters with enough capacity and performance to meet practi- tinuous system operation uninterrupted from device faults, easier cally any storage needs, and scale it economically. This idea is storage management, and ultimately lower cost of purchase and not new. It has been proposed and predicted a few years ago by maintenance. several researchers and vendors [8, 9]. We think that the necessary Our target is primary a data-center environment, where appli- commodity technologies are maturing now with the emergence of cation servers and user workstations are able to access the storage new standards, protocols and interconnects, such as Serial ATA, system through potentially different network paths and protocols. PCI-X/Express/AS and iSCSI. For the rest of this discussion, we In this environment, although storage consolidation has potential assume that the future storage will be based on clusters built of for lower costs and more efficient storage management, its suc- commodity PCs (e.g. x86, SATA disks) and commodity intercon- cess depends on the ability of the system architecture to face three nects (e.g. gigabit Ethernet, PCI-Express/AS ) accessible through main challenges: standard protocols (e.g. SCSI). 1. The first significant challenge is the platform, that is how to A commodity PC with PCI-X or PCI-Express can currently provide system with minimal hardware costs. accommodate at least 2 TBytes of storage using commodity I/O controllers with 8 or 16 disks SATA disks each. The PCI-X bus 2. The second crucial challenge is the software infrastructure re- can support aggregate throughput in excess of 800 MBytes/sec to quired for (i) virtualizing, (ii) sharing, and (iii) managing the the disks, while gigabit Ethernet NICs and switches are an inex- physical storage in a data center. This includes the virtual- pensive means for interconnecting the storage nodes. The nodes ization capability to mask applications from each other and can also run an existing high-performance such to facilitate storage management tasks. It also includes the as Linux or FreeBSD for a very low cost. Similarly, intercon- † nect technology that will allow efficient scaling is currently being Also a graduate student at the Department of Computer Science, University of Toronto, Toronto, Ontario M5S 3G4, Canada. developed (e.g. PCI-Express/AS). Thus, we assume that the com- ∗ Also, with the Department of Electronic and Computer Engineer- modity components to build such storage platform are or will be ing, Technical University of Crete, Chania, GR 73100, Greece. available at a low cost. We believe that the next missing com- ‡ Also, with the Department of Computer Science, University of ponent for building cost-effective storage systems is addressing Crete, P.O. Box 2208, Heraklion, GR 71409, Greece. challenge 2. This will provide the software infrastructure that

101 will run on top, integrating storage resources into a system that in any significant way. Even though some vendors claim that they provides flexibility, sharing, and automated management. This is can support such sharing, efficient virtualization remains an elu- the main motivation for our work. sive target in large-scale systems, because of the software com- plexity. Maintaining consistency is complex and usually intro- 1.2. Software Infrastructure duces significant overheads in the system. A consolidated storage system requires increased flexibility to serve multiple applications and satisfy diverse needs in both stor- 1.3. Global-scale Sharing age management and data access. Moreover, as we accumulate In this area issues include interoperability, hiding network laten- massive amounts of storage resources behind a single system cies and achieving high throughput in networks with large delay- image, the automation of the administration tasks becomes cru- bandwidth products. We consider this area out of the scope of cial for sustaining efficient and cost-effective operation. Contin- our work, but we believe that the virtualization infrastructure we uous unobtrusive measurement of the device efficiency and the propose facilitates global storage connectivity and interoperabil- user-perceived performance should drive data reorganization and ity between data centers. device reconfiguration towards load-balanced system operation. The rest of this paper is organized as follows. Section 2 Finally, policy-based data administration should meet preagreed presents our position on virtualization issues, while Section 3 dis- performance objectives for individual applications. These are the cusses our approach to block-level virtualization. Finally, Section challenges for the software infrastructure in a data center. We 4 discusses related work and Section 5 draws our conclusions. argue that such requirements can be satisfied through a virtualiza- tion infrastructure. 2. Our Approach The term “virtualization” has been used to describe mainly two Our goal in this work is to address the software infrastructure concepts: indirection and sharing. The first refers to the addition challenges, by exploring storage virtualization solutions. The of an indirection layer between the physical resources and the log- term virtualization is used here to signify both an indirection layer ical devices accessed by the applications. Such a layer allows ad- between physical resources and virtual volume, and resource shar- ministrators to create and applications to access various types of ing. Next we describe the directions we have taken in our ap- virtual volumes that are mapped to physical devices, while offer- proach. ing higher-level semantics through multiple layers of mappings. This kind of virtualization has several advantages: (i) It enables a lot of useful management and reconfiguration operations, such as 2.1. Block-level functionality non-disruptive data reorganization and migration during the sys- We advocate block-level virtualization (we use the indirection tem operation. (ii) It adds flexibility to the system configuration, notion here), as opposed to file-level virtualization for the fol- since physical storage space can be arbitrarily mapped to virtual lowing reasons. First, certain functionality, such as compres- volumes. For example a virtual volume can be defined as any sion or encryption, may be simpler and more efficient to provide combination of basic operators, such as aggregation, partition- on unstructured fixed data blocks rather than variable-size files. ing, striping or mirroring. (iii) It allows implementation of useful Second, storage hardware at the back-end has evolved signifi- functionality, such as fault-tolerance with RAID levels, backup cantly from simple disks and fixed controllers to powerful stor- with volume snapshots or encryption with encryption data filters. age nodes [1, 8, 9] that offer block-level storage to multiple ap- (iv) It allows extensibility, the addition of new functionality in plications over a [16, 17]. Block-level vir- the system. This is achieved by implementing new virtualiza- tualization modules can exploit the processing capacity of these tion modules with the desired functions and incrementally adding storage nodes, where filesystems (running mainly on the applica- them to the system. tion servers and the clients) cannot. Third, storage management Currently, the “indirection” virtualization mechanisms of most (e.g. migrating data, backup, redundancy) at the block-level is storage systems and products [11, 13, 14] mostly support a set of less complex to implement and easier to administer, because of predefined functions and leave freedom to the administrator in the unstructured form of data blocks. For these reasons and over the way of combining them. That is, they provide configuration time, with the evolution of storage technology a number of vir- flexibility, but very little functional flexibility. For example prede- tualization features, e.g. volume management functions, RAID, fined virtualization semantics include virtual volumes mapped to snapshots, moved from higher system layers to the block level. an aggregation of disks or RAID levels. In this category belong With the traditional file vs. block abstractions, intelligent con- both research prototypes of volume managers [2, 5, 7, 12, 19] as trol functions remain at higher-level abstractions such as filesys- well as commercial products [3, 4, 10, 21, 22]. In all these cases tems or databases running at the front-end to support application the storage administrator can switch on or off various features servers and client workstations. In this manner, all the process- at the volume level. However, there is no support for extend- ing power of the storage hardware has to be hidden behind virtual ing the I/O protocol stack by providing new functionality, such as disks for transparency reasons. This imbalance that favors the snapshots, encryption, compression, virus-scanning, application- front-end control leaves underutilized the hardware capabilities of specific data placement or content hash computation. Moreover, the back-end. On the other hand, offloading functionality to the it is not possible to combine such new functions in a flexible man- back-end could free up the processor at the front-end to actually ner. run application code instead of filesystem functions. The concept The second notion of virtualization refers to the ability of shar- is similar to using DMA and smart hardware to relieve the main ing the storage resources across many nodes and multiple applica- processor from performing memory transfers between the main tions without compromising the performance of each application memory and the hardware on the bus (e.g. a RAID controller on

102 Filesystem Block I/O Raw I/O Apps Hierarchy A Hierarchy B User Space Apps Apps Filesystem Block I/O Kernel Space Raw I/O Source Source Source Buffer Cache kernel module

Violin Persistent Violin Kernel components Modules Metadata IJ LXT Violin context M High−level H User extensions I/O API Disk Device Drivers F L Disks E G

Figure 1. Violin in the OS. Internal Devices (A−M) CD K A B the PCI). Similarly, we could offload processing and data trans- SinkSink Sink Sink Sink fers to the storage back-end. For example, instead of the client filesystem moving data over the network in order to copy a file, it could instruct the back-end to perform the copy locally at the Input Capacities: Output Capacities: level of data blocks. Thus, we could improve the performance of individual applications and the entire system. Figure 2. The virtual device graph in Violin. By using block-level virtualization we have to restrict our sys- tem to the traditional block-level interface. A higher API, such as objects, may be better in this case and we are considering it for resents practically a virtualization layer stack. In each hierarchy, the future. However, we currently aim to explore the limits of the blocks of each virtual device can be mapped in arbitrary ways to block-level API before dealing with higher abstractions. the successor devices, enabling advanced storage functions, such as dynamic relocation of blocks. 3. Block-level Virtualization Violin also provides virtual devices with full access to both Our work addresses block level virtualization, both for indirection the request and completion paths of I/Os allowing for easy im- and for sharing purposes. In this section we present our approach plementation of synchronous and asynchronous I/O. Supporting for both notions in more detail. asynchronous I/O is important for performance reasons, but also raises significant challenges when implemented in real systems. Additionally, Violin deals with metadata persistence of the full 3.1. Indirection Virtualization storage hierarchy, offloading the related complexity from individ- 3.1.1. Single-storage-node Virtualization As a building ual virtual devices. It supports persistent objects for storing virtual block towards a distributed storage system with flexible and ef- device metadata. Violin automatically synchronizes persistent ob- ficient virtualization support, we have designed and implemented jects to stable storage, resulting in much less effort for the module Violin [6]. Violin (Virtual I/O Layer INtegrator), is a kernel-level developer. framework for (i) building and (ii) combining block-level virtual- Systems such as Violin can be combined with standard storage ization functions. Violin is a virtual I/O framework for commod- access protocols, such as iSCSI to build large-scale distributed ity storage nodes that replaces the current block-level I/O stack volumes. Figure 3 shows a system with multiple storage nodes with an improved I/O hierarchy that allows for (i) easy exten- that provide a common view of the physical storage in a clus- sion of the virtual storage hierarchy with new mechanisms and ter. We believe that future, large-scale storage systems will be (ii) flexible combination of these mechanisms to create modular built in this manner to satisfy application needs at a cost-effective hierarchies with rich semantics. Figure 1 illustrates a high-level manner. We have implemented Violin as a block device driver view of Violin in the operating system context. under Linux. To demonstrate the effectiveness of our approach in The main contributions of Violin are: (i) it significantly re- extending the I/O hierarchy we implemented various virtual mod- duces the effort of introducing new functionality in the block ules as dynamically loadable kernel devices that bind to Violin’s I/O stack of a single storage node and (ii) provides ample con- API. We also provide simple user level tools that are able to per- figuration flexibility to combine simple virtualization operators form on-line fine-grain configuration, control, and monitoring of into hierarchies with semantics that can satisfy diverse applica- arbitrary hierarchies of instances of these modules. tion needs. To achieve configuration flexibility, Violin allows In [6], we evaluate the effectiveness of our approach in three storage administrators to create arbitrary, acyclic graphs of vir- areas: ease of module development, configuration flexibility, and tual devices, each adding to the functionality of the successor de- performance. In the first area we are able to quickly prototype vices in the graph. Figure 2 shows such a device graph, where modules for RAID levels (0, 1 & 5), versioning, partitioning, ag- mappings between higher and lower devices are represented by gregation, MD5 hashing, migration and encryption. Further ex- arrows. Every virtual device’s functionality is provided by inde- amples of useful functionality that can be implemented as mod- pendent virtualization modules that are linked to the main Violin ules include all kinds of encryption algorithms, storage virus- framework. A linked device graph is called a hierarchy and rep- scanning modules, online migration and transparent disk layout

103 Cluster Node B Cluster Node C Cluster Node A data I/O mechanisms combined with virtualization layers to han- Application dle control messages. One approach we are considering for the Violin Violin FS downstream direction is defining special virtual block addresses iSCSI TCP Violin iSCSI iSCSI in each volume, where the usual I/O operations would be trans- iSCSI TCP TCP TCP lated to control messages. For the upstream direction we consider other options, such as ticket callbacks from higher to lower layers. Figure 3. Violin in a distributed environment. The upstream direction, however, is currently considered future work. algorithms (e.g. log-structured allocation or cylinder group place- 3.2. Sharing Virtualization ment). In many cases, writing a new module is just a matter of The second notion of virtualization concerns storage sharing. recompiling existing user-level library code. Overall, using Vi- Sharing is required at two levels, either at the block level where olin encourages the development of simple virtual modules that storage volumes are shared, or at the file level, where applications can later be combined to more complex hierarchies. Regarding share files. configuration flexibility, we are able to easily configure I/O hi- erarchies that combine the functionality of multiple layers and 3.2.1. Shared Block-level Access Many block-level applica- provide complex high-level semantics that are difficult to achieve tions, such as databases, web caches or video servers using raw otherwise. Finally, we use Postmark and IOmeter to examine the disk access, must share concurrent access to volumes in a consol- overhead that Violin introduces over traditional block-level I/O idated storage system. Thus, we need to support concurrent vol- hierarchies. We find that overall, internal modules perform within ume sharing. We find that a single primitive is required for this 10% (throughput) of their native Linux block-driver counterparts. purpose: locking support of data block address ranges on the vol- umes. Using such locks, applications can synchronize accesses to 3.1.2. Multi-storage-node Virtualization Our step towards a shared blocks on the volume. scalable virtualization platform is to enhance the Violin frame- We have designed and we are currently implementing this fea- work with the necessary primitives in order to allow multi-node ture in Violin, as a virtualization module. The “locking” layer hierarchies. We find that the primitives required to achieve this captures lock commands and generates distributed control com- are two: (i) locking of layer metadata, and (ii) control messaging mands in Violin. These commands are forwarded through the through the distributed I/O stack. virtualization hierarchy to special “lock server” modules, which The single-node Violin I/O stack allows the creation of multi- serialize and enforce the locks. Lock commands are extensions to layered storage volumes. These are exported to other storage the block API but are compatible to the device interface of nodes through block-level network protocols, such as iSCSI. The most operating systems. Also the SCSI interface supports custom nodes can further build I/O stacks on top of the remote storage “command blocks” that can transfer such commands. volumes, creating higher-level volumes as needed. This organi- zation is depicted in Figure 3. We find that data I/O requests are 3.2.2. Shared File Access Violin achieves shared block-level correctly forwarded by the layers through this distributed hierar- access and layer distribution within the storage cluster. However chy by the layers distributed to different nodes. Reliability is- the most common API of accessing storage is through the file sues regarding node or device failures are effectively handled by interface, which we can offer with a filesystem on top of a Vio- fault-tolerance layers (e.g. RAID levels) that map storage vol- lin shared block volume. Our goal for the filesystem implemen- umes between nodes or devices. The system can thus tolerate tation is that the filesystem itself should be as simple as possi- storage node failures as a RAID system tolerates device failures. ble, while independent functionality should be moved at the stor- A remaining issue with layer distribution is related to the consis- age back-end. Our approach to the filesystem design is similar tency of layer metadata: since a layer can be distributed to two or to Frangipani [20], where many independent filesystem instances more nodes (i.e. every node runs a separate instance of the layer run on a single shared block volume without direct communica- on the same volume), in order to maintain the layer metadata con- tion among the instances themselves. The filesystem instances sistent between nodes, each layer has to synchronize its metadata share the blocks of the volume as any other block-level applica- to the storage volume in every metadata operation. Since this can tion, using locks on the volume’s blocks to maintain consistency be very expensive in terms of performance, the system is able to for the filesystem metadata. This organization initially simplifies minimize the overhead by supporting metadata locking for layers. the filesystem design. However, in contrast to previous work, our Thus, the metadata can be locked and updated asynchronously. goal is to simplify filesystem design and implementation by re- A second issue with distributing the virtualization hierarchy is considering the division of functionality between the filesystem effectively passing control messages to layers both downstream and the block-level layer. (from a high layer to a lower one) and upstream (from a lower We find that the block-level layer can be enhanced with three layer to a high one). Storage network protocols such as iSCSI types of support: (a) block allocation, (b) block-range locking, offer some support for transferring control commands between and (c) metadata consistency. The free block allocation and the remote volumes through SCBs (or SCSI command blocks), but lock service greatly reduce the filesystem complexity in compar- they are always towards the downstream direction. We propose ison to other distributed filesystems. The filesystem code is only more generic and flexible support for control message transfers aware of a large shared volume and is not aware of other instances through the distributed I/O stack. To maintain compatibility with running on it. It simply locks the blocks it needs for atomic meta- current standards (e.g. iSCSI), we advocate using the common data updates, and uses the free block allocation commands to al-

104 locate or free data blocks on the shared volume. These extensions work for (i) building and (ii) combining block-level virtualization allow us to build a simple pass-through filesystem that provides functions. We have also designed and are currently implement- only naming and file-level access rights. ing extensions that will allow distributed virtualization stacks and sharing at both the block and the file level. 4. Related Work REFERENCES Storage Tank (ST) is a scalable distributed file and storage man- agement system running on heterogeneous machines over storage [1] A. Acharya, M. Uysal, and J. Saltz. Active Disks: Programming Model, Algorithms and Evaluation. In Proc. of the 8th ASPLOS, area networks [14]. It offers virtualization and centralized policy- pages 81–91, San Jose, California, Oct. 3–7, 1998. based management of storage resources. It separates data stor- [2] M. de Icaza, I. Molnar, and G. Oxman. The linux -1,-4,-5 code. age from metadata management and enables clients to access data In LinuxExpo, Apr. 1997. directly for improved scalable performance. Although we share [3] EMC. Enginuity(TM): The Storage Platform Operating Environ- several goals with ST, the main difference of our system is its ment (White Paper). http://www.emc.com/pdf/techlib/c1033.pdf. design approach that aims at pushing significant storage system [4] EMC. Introducing RAID 5 on Symmetrix DMX. functionality behind the block-based interface. A second differ- http://www.emc.com/ products/ systems/ enginuity/ pdf/ ence is the versatility of our system to serve diverse application H1114 Intro raid5 DMX ldv.pdf. [5] Enterprise Volume Management System. evms.sourceforge.net. needs through our extensible design. Another difference is the [6] M. D. Flouris and A. Bilas. Violin: A Framework for Extensible architecture of our distributed filesystem. Our FS is based on in- Block-level Storage. In Proceedings of 13th IEEE/NASA Goddard dependent instances operating on a shared block volume, simi- (MSST2005) Conference on Mass Storage Systems and Technolo- lar to Frangipani [20]. On the other hand, the ST filesystem has gies, Monterey, CA, Apr. 11–14 2005. metadata servers, uses many disk volumes and each FS instance [7] FreeBSD: GEOM Modular Disk I/O Request Transformation is aware of the others and directly cooperates with them. Framework. http://kerneltrap.org/node/view/454. Object-based storage advocates that the traditional block- [8] G. A. Gibson, D. F. Nagle, K. Amiri, J. Butler, F. W. Chang, H. Go- bioff, C. Hardin, E. Riedel, D. Rochberg, and J. Zelenka. A Cost- based interface of storage devices should be changed to directly Effective, High-Bandwidth Storage Architecture. In Proc. of the 8th support management of objects [15]. Although blocks offer fast ASPLOS, pages 92–103, San Jose, California, Oct. 3–7, 1998. scalable access to shared data, a file server is currently needed to [9] J. Gray. Storage Bricks Have Arrived. Invited Talk at the 1st authorize the I/O and maintain the metadata at the cost of lim- USENIX Conf. on File And Storage Techn. (FAST ’02), 2002. ited security and data sharing. Support for data objects essentially [10] HP. OpenView Storage Area Manager. http://h18006. moves to the storage devices issues of internal space management, www1.hp.com/ products/ storage/ software/ sam/ index.html. quality-of-service guarantees and access authorization. Although [11] E. K. Lee and C. A. Thekkath. Petal: Distributed Virtual Disks. In decentralized storage management is one of our goals, we aim at Proc. of ASPLOS VII, volume 24:Special, pages 84–93, Oct. 1996. [12] G. Lehey. The Vinum Volume Manager. In Proc. of the FREENIX supporting metadata management on top of commodity devices Track (FREENIX-99), pages 57–68, Berkeley, CA, June 6–11 1999. rather than modifying their hardware interface. Additionally, we USENIX Association. strive to hide storage management functions behind the block- [13] J. MacCormick, N. Murphy, M. Najork, C. A. Thekkath, and based interface whenever possible. L. Zhou. Boxwood: Abstractions as the Foundation for Storage Cluster filesystems such as GPFS offer shared access to dis- Infrastructure. In Proc. of the 6th Symposium on Operating Systems tributed homogeneous resources [18]. Metadata management Design and Implementation (OSDI-04). USENIX, Dec. 6–8 2004. is distributed across multiple file server nodes, and shared data [14] J. Menon, D. A. Pease, R. Rees, L. Duyanovich, and B. Hillsberg. can be accessed in parallel from different storage devices. Even IBM Storage Tank - A heterogeneous scalable SAN file system. IBM Systems Journal, 42(2):250–267, 2003. though cluster filesystems successfully manage storage space in [15] M. Mesnier, G. R. Ganger, and E. Reidel. Object-based storage. data centers, their design is monolithic and non-extensible. Fur- IEEE Communications Magazine, (8):84–90, Aug. 2003. thermore, such systems offer little flexibility in reorganizing data [16] D. Patterson. The UC Berkeley ISTORE Project: bringing availabil- or reconfiguring devices, making administration tasks tedious. ity, maintainability, and evolutionary growth to storage-based clus- Stackable filesystems facilitate file system development ters. http://roc.cs.berkeley.edu, January 2000. through incremental refinement of features and extensible inter- [17] B. Phillips. Industry Trends: Have Storage Area Networks Come of faces [23]. In the present paper, we advocate our approach on Age? Computer, 31(7):10–12, July 1998. using software modularity and extensibility in building sophis- [18] F. Schmuck and R. Haskin. GPFS: A Shared-disk for Large Computing Centers. In USENIX Conference on File and Stor- ticated storage system functionality while preserving the block- age Technologies, pages 231–244, Monterey, CA, Jan. 2002. based interface. In essence, our approach is complementary to [19] D. Teigland and H. Mauelshagen. Volume managers in linux. In the stackable filesystem design since we target different layers of Proc. of USENIX 2001 Technical Conference, June 2001. the data storage hierarchy. [20] C. A. Thekkath, T. Mann, and E. K. Lee. Frangipani: A Scalable Distributed File System. In Proc. of the 16th Symposium on Op- 5. Conclusions erating Systems Principles (SOSP-97), volume 31,5 of Operating Systems Review, pages 224–237. ACM Press, Oct. 5–8 1997. In this paper we identify two main challenges for consolidated [21] Veritas. Storage Foundation(TM). http://www.veritas.com/ Prod- storage in a data center: (i) the scalable, low-cost platform issue ucts/ www?c=product&refId=203. and (ii) the software infrastructure issue. Our goal is to address [22] Veritas. Volume Manager(TM). http://www.veritas.com/vmguided. the latter. We propose a virtualization infrastructure to tackle the [23] E. Zadok and J. Nieh. FiST: A Language for Stackable File Systems. In Proc. of the 2000 USENIX Annual Technical Conference, pages storage management, flexibility and reconfiguration issues. We 55–70. USENIX Association, June 18–23 2000. have designed and implemented Violin, a virtualization frame-

105