Disk Mirroring with Alternating Deferred Updates
Total Page:16
File Type:pdf, Size:1020Kb
Disk Mirroring with Alternating Deferred Updates Christos A. Polyzois Anupam Bhide Daniel M. Dias IBM T. J. Watson Research Center P.O. Box 704, Yorktown Heights, NY 10598 {christos, dias}Qwatson.ibm.com Abstract age overhead for replicating the data. Compared with RAID, mirroring has simpler and more effi- Mirroring is often used to enhance the reliability of cient recovery from disk failure. RAID architec- disk systems, but it is usually considered expensive tures [15] try to ameliorate the space overhead of because it duplicates storage cost and increases the mirroring at the expense of increasing disk arm cost of writes. Transaction processing applications overhead under normal operation and complicat- are often disk arm bound. We show that for such ing recovery. applications mirroring can be used to increase the The selection among reliable disk architectures efficiency of data accesses. The extra disk arms depends on the application characteristics. An ap- pay for themselves by providing extra bandwidth plication may be: to the data, so that the cost of the overall system compares favorably with the cost of non-redundant 1. Disk capacity bound: For example, storage disks. The basic idea is to have the mirrored disks of historical data requires only disk space; nei- out of phase with one handling reads and the other ther performance nor reliability is critical. For applying a batch of writes. We also present an effi- such applications the goal is minimal $/CR, so cient recovery procedure that allows reconstruction that large, cheap disks with inexpensive con- of a failed disk while guaranteeing the same level trollers can be used. Redundancy is not im- of read performance as during normal operation. portant, since backups can be kept on tapes. 2. Sequential bandwidth bound: A iog- 1 Introduction ging process is typically sequential bandwidth bound. For such applications file striping im- Two goals have driven disk systems research in re- proves bandwidth; striping data on a RAID- cent years: reliability and performance. Reliabil- 5 array [15] can provide both high sequential ity is typically achieved through redundancy, e.g., bandwidth and reliability at a low cost. mirrored disks or redundant arrays of inexpensive 3. Random access bound: On-line trans- disks (RAID). Mirroring has good reliability char- action processing applications usually make acteristics and read performance, but it incurs a many small random accesses. As indicated disk arm overhead for writes and a significant stor- in Reference (91, installations often leave as Permission to copy without fee all or part of this material much as 55% of their disk space empty, to en- is granted provided that the copies are not made or distrib- sure that the performance for access to the ac- uted for direct commercial advantage, the VLDB copyright tual data is satisfactory. Thus, the number of notice and the title of the publication and its date appear, disks required is often determined by the num- and notice is given that copying is by permission of the Very ber of arms necessary to support the workload Large Data Base Endowment. To copy otherwise, or to re- (disk arm bound), not by the space occupied publish, requires a fee and/or special permission from the by the data. More efficient utilization of the Endowment. disk arms can increase space utilization and Proceedings of the 19th VLDB Conference actually reduce the number of disks required. Dublin, Ireland, 1993 Our focus in this paper is on case (3) above, 604 i.e., random access bound applications. In Refer- ence [4] it is shown that mirroring is better than RAID-5 in such cases. We show that the perfor- IhkB K mance of mirroring can be enhanced substantially. We exploit the redundancy inherent in mirroring to 0 TR T 3TR 2T gain performance by taking advantage of technol- ogy trends: large memories and efficient sequential Figure 1: Read/write alternation. disk I/O. Disk access times have not improved signifi- 2 Normal Operation cantly in the past decade, so research efforts have recently focused on delaying writes and applying We assume that the system has a large buffer pool. them to the disk in batches, in order to benefit from Transaction processing systems typically employ a the higher performance of sequential vs. random write-ahead log, so that data pages can be writ- access. Such efforts include [l, 8,16,17,18,19]. As ten to their home location on disk asynchronously. large memories become available, it is possible to For random access bound applications that do not delay writes and apply them in a batch; therefore, use a write-ahead log, non-volatile, reliable mem- we expect this technique to gain significance. A po- ory may be required. We return to this issue in tential drawback is the response time for read oper- Section 5. The buffer can be located either in main ations: reads must be suspended during the appli- memory or at the disk controller; our technique is cation of a batch, which may in turn affect transac- applicable in both cases. tion response time. Our scheme exploits the avail- During normal operation, updates are not ap- ability of two mirrored arms, which ensures good plied to disks immediately. Periodically, when a read response time by alternating batch updates certain number of updates have accumulated in the between the two mirrors. We address batch up- buffer, they are sorted in physical layout order and dates on a single disk in Section 5. applied to the disks as a batch. Algorithms for Our goal is to provide both availability and per- applying updates in batches efficiently have been formance. Since redundant data must be stored discussed and analyzed in [l?]. The improvement to ensure availability, this data might as well be in bandwidth with respect to random accessescan used to enhance performance. We start from the be significant. observation that mirroring and batch updates are While a disk is not applying updates, it services complementary, in the sense that the weak point read requests. Thus, each disk alternates between of batch updates (read response time) is the strong time spans of write-only and read-only activity. point of mirroring [2,3]. Careful scheduling of disk Let the period during which the updates accumu- accesses in the presence of mirroring can lead to late be T, and consider a pair of mirrored disks A increased disk arm utilization, and thus to lower and B. Both disks apply the updates in batches. cost. However, they do not apply the batches simultane- Our technique also gives rise to an efficient ously. Instead, they operate at a phase difference scheme for recovery from disk failure. In contrast of 180°. For example, assume that disk A starts to previous schemes, our method does not hurt applying updates at time 0. Disk B will start ap- read access to data that had a copy on the failed plying updates at time T/2. Disk A applies the disk. second batch at time T, then disk B applies the Section 2 describes our technique during normal second batch at time 3T/2, and so on. Figure 1 operation. Section 3 evaluates the fault-free per- illustrates this alternation with a timing diagram, formance of the proposed technique using a simple where the high value represents write activity and analytical model (Section 3.1) and simulation (Sec- the low value represents read activity for the cor- tion 3.2). Section 4 presents our procedure for re- responding disk. covering from single disk failure and estimates the If each disk can apply the batch in time less recovery time. Finally, the region of applicability than T/2, then the write-only time spans of the of the proposed scheme is examined in Section 5. two disks do not overlap, so that there is always at least one disk available to service read requests; there may even be some time when both disks are servicing reads. This case is shown in Figure 1. Read requests are routed to the disk operating 605 ..-----..=e -- -..-“- oz+-l H H H H t-l Parameter -- Symbol- Value Disk A W W W No of Cylinders cyls 840 No of Surfaces surfs 20 W W W Block Size blksiz 4K Blocks per Track blkstrk 8t 0 TR T 3Tf2 2T Rotation Time t rot 16.6 msec Figure 2: Overlapping write periods. Half Rotation Time throl 8.3 msec Average Seek t avrerk 18 msec Block transfer Time t e/w 2 msec in read-only mode and are serviced immediately. Avg. Random Access t 28.4 msec Thus, batches can be made large to gain efficiency, tan t 5.47 mscc without affecting read response time. The limit- Single Track Seek trkreek ing factor for batch size is the amount of memory End to End Seek t endrrck 35 msec available. Memory Fraction f l%t In cases of extremely write-intensive loads, it Write Ratio W 0.5t -------.-- may take each disk longer than T/2 to install a batch, so that the write time spans of the two disks will overlap. This situation is shown in Fig- Table 1: System Parameters (tdefault values) ure 2. Read requests that arrive during the overlap period are deferred until one of the disks finishes its write batch, which hurts read response time. 3 Evaluation of Fault-Free Per- Alternatively, such reads can be serviced immedi- formance ately, thus disrupting the efficient installation of updates. If the second option is adopted, the reads 3.1 Simple Analytical Model can be routed to the disk whose head is closest to the cylinder where the read must be performed.