On the Performance and Scalability of a Data Mirroring Approach for I2-DSI Bert J

B. Dempsey and D. Weiss On the Performance and Scalability of a Data Mirroring Approach for I2-DSI Bert J. Dempsey, Debra Weiss University of North Carolina at Chapel Hill [email protected] Abstract This paper presents work on scaleable design for the automated synchronization of large collections of files replicated across multiple hosts. Unlike conventional mirroring tools, our approach addresses multiple-site file synchronization by capturing file-tree update information in an output file during an initial file synchronization session. Once the update file is available, it can be transmitted over the network using parallel point-to-point file transfers or reliable multicasting and then processed at the remote sites. This paper outlines of how the above concept has been implemented as a modification to the open-source mirroring tool, rsync. It then presents performance experiments designed to characterize the server-side processing costs and network throughput requirements under realistic workloads on large storage servers. The performance experiments use the WAN testbed of the Internet2 Distributed Storage Infrastructure (I2-DSI) project, the context for this work, and the results provide guidance on the scalability of I2-DSI using the proposed data mirroring scheme. Keywords: mirroring, distributed storage, replication, Internet2, reliable multicast, I2-DSI. and bottlenecks caused by hierarchical 1. Introduction processing of client requests, the limits Client-server applications on the Internet of caching dynamically generated server benefit from the use of content responses, and the tendency of some replication through improved access for content providers to eschew caching in widely distributed clients and balancing order to retain server-side control [1]. load across different servers and network paths. Content replication can be Mirroring source objects across server employed either in front of the server platforms is another approach to content (proxy-based caching of server replication. Among its key advantages responses) or behind the server are (1) the ability to replicate server-side (mirroring of source objects). On- functionality such as server-managed demand replication of server responses secure logins, (2) replication benefits for at proxy-based caches is widely any client-server protocol, not just HTTP deployed within the current WWW services, and (3) a basis for constructing where hierarchical or distributed manageable, deterministic control of the arrangements of cooperating caches replication process for the content handle HTTP requests and, less provider and service provider. In the commonly, those of other access current Internet, server-side replication protocols. While an important is most commonly employed for (1) the component of the current and future limited (but valuable) case of sharing Internet, proxy-based caching is tied to highly popular static file archives served specific protocols and has well-studied by FTP or HTTP or for (2) more general performance limitations. Problematic replication between servers controlled by aspects of proxy caching include delays a single organization where strong NetStore 1999 - 1 - B. Dempsey and D. Weiss uniformity can be imposed on the server AFS. A primary advantage of our platforms. approach is that it provides a highly portable and easily modified data The context of our work is the Internet2 mirroring mechanism. Application-level Distributed Storage Initiative (I2-DSI) solutions, however, are unlikely to project [2], which is developing match the throughput achievable with replication middleware to extend source- kernel-level solutions. This paper object replication to loosely-coupled, provides empirical evidence that the file heterogeneous platforms. I2-DSI relies replication solution proposed here for I2- on a set of dedicated, geographically DSI will provide adequate performance distributed replication hosts (I2-DSI for the current I2-DSI testbed and scale servers) in the network on which user up with the number of I2-DSI replication content will be hosted. Application hosts. developers publish collections of source objects (I2-DSI content channels [3]) in The rest of the paper proceeds as conjunction with the I2-DSI interface follows. Section 2 presents the describing supported content types and motivation and design of a novel file the standard services available on the mirroring approach and the scenario for replication hosts. Current content its use within I2-DSI. In Section 3 channels are linked at [2]. As part of performance aspects of the solution are Internet2, the I2-DSI project aims to explored through controlled experiments develop an open architecture for that measure server load and the replication services to serve the 140 available network bandwidth using the Internet2 universities and, subsequently, I2-DSI project testbed. Section 4 gives the wider Internet community. additional discussion of related work, Proprietary solutions for application and Section 5 summarizes the findings hosting and replication are now of the paper. appearing in commercial efforts [4] [5]. 2 Rsync+ In this paper a design is presented for The data mirroring tool developed here, scaleable replication of large collections rsync+, is a modification to the open- of files in support of file-oriented source file-mirroring tool, rsync. Rsync replication across the I2-DSI servers.1 [7] is a widely used data mirroring tool Our design is a novel adaptation of an designed to automate synchronization of existing data mirroring tool, rsync, that file hierarchies residing on the same runs in user space and manages data machine or, alternatively, on machines replication through operating system across the Internet. Rsync has gained calls to the filesystem. Alternative popularity due to its rich feature set for approaches include block-level controlling the synchronization process, management of file replication either at and its open-source C code the middleware layer [6] or within a implementation has enabled it to be distributed filesystem such as NFS or ported to many Unix platforms and MS NT. A distinctive feature of rsync (and thus rsync+) is its use of block-level 1 I2-DSI ultimately expects to support multiple replication mechanisms in order to enable the checksumming to perform differential largest possible set of applications to benefit file update when an existing file has from its replication services. NetStore 1999 - 2 - B. Dempsey and D. Weiss been updated, i.e., has a new disk image replication evokes a set of rsync under the same file name. sessions, one for each remote I2-DSI replication host that carries the channel. As shown, the clients for this I2-DSI Channel provider channel access the replication hosts nearest to them using resolution import mechanisms that are part of the I2-DSI replication architecture. M The modifications developed for rsync+ rsync src/ S1::src/ are a response to two factors in the rsync src/ S2::src/ scenario in Figure 2.1 that result in rsync src/ S3::src/ inefficiency and limit scalability. First, M performs the same processing on the src/ directory for each of the three rsync sessions required to update the slave clients S3 hosts (S1, S2, and S3). This processing includes checking file status information S1 and, if a file has been modified, S2 computing checksums, and thus it can be compute-intensive for large file trees clients and/or large individual file objects. With multiple point-to-point rsync sessions, Figure 2.1: An Rsync Scenario for the processing load on M increases with Content Replication in I2-DSI the number of remote servers being In its current form, rsync synchronizes updated. Secondly, M transmits identical remote directory hierarchies during an data streams over the network in each of interactive network session involving the three master-slave rsync sessions two sites, the master site and a slave. (S1, S2, S3). Thus, for example, its use in the I2-DSI context would follow the scenario shown To enable efficient multiple-site in Figure 2.1. In the scenario shown, a synchronization, the rsync source code channel provider places new or modified was modified to create new modes of files for a file-based channel stored in operation, as shown in Figure 2.2. the directory, src/, on the master I2-DSI Options in rsync+ (the –f and –F flags) server, denoted as M. For our purposes, enable the end-to-end update process to the import mechanism here is not be decomposed into three independent specified. In the current I2-DSI actions: architecture, modifications to the source Step 1: the generation of replica objects in a channel are localized to a update information at the master master channel copy at one I2-DSI site, server, here M. The set of all other I2- Step 2: the network transport of DSI replication servers on which the update information from the channel is carried are shown within the master to the slave sites, and shaded area as S1, S2, and S3. On M Step 3: the processing of the then, a script managing channel update information at each slave NetStore 1999 - 3 - B. Dempsey and D. Weiss site in order to synchronize the 3 Performance Analysis slave’s files with those at the master. This section focuses on empirical results gathered to characterize the performance The key feature of rsync+ is then that it of and to provide insight into the decouples network communication (Step scalability of the data mirroring scenario 2) from filesystem update actions (Steps depicted in Figure 2.2. The data 1 and 3). This decoupling eliminates represents performance measurements redundant processing at the master site gathered from network and server and enables the use of parallel network experiments running on high- (TCP) connections or a reliable multicast performance servers within the I2-DSI transport protocol to deliver data (Step WAN testbed. In particular, the results 2) from a single master site to the quantify the balance between server multiple slave sites, resulting in a processing throughput for file scaleable approach to multiple-site file synchronization actions, server synchronization.

Load more