Automated Performance Control in a Virtual Distributed Storage System
Total Page:16
File Type:pdf, Size:1020Kb
Automated Performance Control in a Virtual Distributed Storage System H. Howie Huang 1, Andrew S. Grimshaw Department of Computer Science University of Virginia [email protected] Abstract on the fact that large hard disks have been prolifically distributed on PCs at the edge of the network, thanks to Storage demand in large organizations has grown rapidly falling prices. We refer to storage on desktop rapidly in the last few decades. In this paper, we machines as desktop storage. The abundance of describe Storage@desk, a new virtual distributed desktop storage, unfortunately, does not lead to great storage system that utilizes a large number of utilization - a typical desktop machine in an exemplar distributed machines to provide storage services with organization contains hundreds of gigabytes with more quality of service guarantees. We present the than half left unused [1]. Further, resources available Storage@desk architecture and core components. The on desktop machines are in abundance: (1) dual-core machines on which Storage@desk relies exist in an CPUs are appearing on desktop machines with their environment that is unreliable and dynamic. We utilize CPUs having clock rates over 3 GHz; (2) RAM of 1GB replication to provide high levels of availability and or 2GB is not unusual; (3) 1 Gigabit Ethernet has reliability. Our evaluation reveals that Storage@desk become a common network interface on new PCs, achieves better read and write performance compared replacing fast Ethernet. In general, desktop users do not to CIFS. As Storage@desk serves clients and fully utilize the computing power, disk space, and applications on shared storage resources, it is crucial network bandwidth at all times [1-3], which presents an to ensure predictable storage access even when the opportunity to take advantage of the abundant yet idle workloads are unknown a priori. To this end, we take a resources. control theoretic approach for automated performance Storage@desk will not only provide a useful storage control in Storage@desk. Given a reference value, the service, but one with quality of service (QoS) feedback controller is able to effectively regulate guarantees in terms of capacity, life time, availability, service requests to virtual storage resources under reliability, and performance. As the machines on which various scenarios. Storage@desk relies exist in an environment that is unreliable and dynamic, we utilize replication to 1. Introduction provide high levels of availability and reliability. Because Storage@desk utilizes shared storage Storage demand in large organizations has grown resources, predictable storage access even when the rapidly in the last few decades. Over the years workloads are unknown a priori is challenging. We distributed storage – network-attached storage (NAS), take a control theoretic approach for automated and storage area network (SAN) – has emerged as a performance control in Storage@desk. Feedback standard practice to provide high performance, fault control has shown success on performance control of resilience, and data integrity. We refer to this as various computer systems, including replica managed storage. However, depending on the quality, management system [4], email server [5], web server the vendor, discounts, and the target market, managed [6], and real-time systems [7]. Given a reference value, storage remains expensive despite the hardware cost of the feedback controller can effectively regulate access storage falling every year. There has been a lot of bandwidth to virtual storage resources. interest in researching new methods to relieve the The remainder of this paper has the following increasing storage demand. In this paper, we present structure. We introduce the background materials in Storage@desk, a new virtual distributed storage Section 2. Storage@desk architecture and core system, which has the ability to harvest underutilized components are described in Section 3. We discuss the storage resources in the existing information evaluation results in Section 4. Finally, we present the infrastructure of large organizations. This idea is based related work and conclusion in Section 5 and 6, respectively. 1 2. Background applications (e.g., video-on-demand) demand high performance services of great urgency, while others Storage@desk is motivated by three key (e.g., text processing) may tolerate less perfect observations: strong demand for disk storage, abundant performance. How to achieve predictable performance yet idle resources on desktop machines, and various is a challenging task for two reasons. First, in essence performance requirements for storage services. Storage@desk is a virtual storage system shared by a large number of clients and applications. It would be 2.1 Strong demand for disk storage preferable to isolate one client from others; otherwise an I/O surge from one client would negatively affect Data play a critical role in enterprise and scientific others sharing the same storage resource. Second, the computing. According to a study from Berkeley [8], the workloads are generally unknown a priori . Without amount of new information produced worldwide is accurate information on access patterns, human estimated to grow about 30% a year since 1999, with 5 regulation is inherently slow in reaction to changes in million terabytes in 2002. More than 90% of that data the workload. To address this problem, we will design was estimated to be stored on magnetic media (e.g., a feedback controller for automated performance hard disks, tapes), and hard disks accounted for most control in Storage@desk. storage, nearly 2 million terabytes. As data volumes continue to explode, there is always a consistently 3 Storage@desk architecture strong demand for cost effective, massive-scale disk storage. Storage@desk achieves storage virtualization by utilizing the iSCSI protocol [11]. We choose iSCSI 2.2 Abundant yet idle resources on desktop because this IETF standard can transport the SCSI machines commands and data over TCP/IP networks. From the perspective of a client, Storage@desk storage appears Today, the advent of hard drive technology is as a virtual volume that consists of an array of fixed- capable of packing hundreds of gigabytes onto size blocks. Therefore, the clients can transparently inexpensive hard disks, and these large disks are access data in a uniform, standard-based fashion. A pervasive. The same Berkeley study estimated that client treats virtual volumes as locally-attached hard 90% of 9.9 million terabytes of disk storage shipped in drives, isolating from the distributed resource 2002, or 9 million terabytes, sits in PCs, workstations, management behind the scene. Once logging on to a and laptops. Clearly, PCs are holding the greatest virtual volume, the client can make partitions, create amount of storage resources even though those various file systems (e.g., NTFS, ext2, ext3), make resources are distributed and managed individually. directories and subdirectories, and perform file Unfortunately, the utilization of desktop storage is operations (e.g., create, copy, move, modify, delete). low. Our 2005 study [1] of 729 desktop machines In Storage@desk architecture presented in Fig. 1, revealed that on average 62% of a machine’s raw disk iSCSI servers (including controllers on them) capacity was unused. Previous research studies [9, 10] implement the iSCSI layer and serve the requests; showed that the unused portion of disk space from storage machines as resource providers save data on 4800 desktop machines was 49%, 50%, and 58% local hard drives; and one or more databases maintain respectively in 1998, 1999, and 2000. These numbers all the metadata in the system. indicate a clear uptrend of unused disk space on an Upon receiving a client request, the iSCSI server will average machine over the years. As this trend continues need to contact the databases for the necessary in the future, we see Storage@desk is a good metadata if they are not up to date. Otherwise, the complement to managed storage by making the best use iSCSI server may directly forward the request to the of unutilized desktop storage in large organizations. corresponding storage machines to read and write the blocks. 2.3 Various performance requirements for In this section, we present databases and metadata management in Section 3.1, storage machines in Storage Service Section 3.2, and iSCSI servers in Section 3.3. We discuss the feedback controller in Section 3.4. In an organization, different clients have theirs own performance requirements. Furthermore, a single client may have different performance requirements for multiple applications at various times. Some 2 Client Services and many physical chunks. The mapping table reflects Storage this relationship by mapping a replica of a virtual Machine (3.2) chunk to a physical chunk. The number of mappings iSCSI Server (3.3) Client for a virtual volume is the multiplication of the number Controller (3.4) Storage of virtual chunks and the replication degree. With the Machine mapping from the mapping table, an iSCSI server can iSCSI Server locate physical chunks on one or many storage Client Controller Storage machines to read and write blocks in a virtual volume. Machine iSCSI iSCSI iSCSI Server Volume Controller Storage Volume ID Name Replication Degree Virtual Chunks QoS Client Machine iSCSI Server Controller VolumeVolume Mapping DatabaseDatabase (3.1)(3.1) Volume Replica Virtual Chunk Machine Physical Chunk Fig. 1. Storage@desk architecture.