External Sorting on a Parallel Interleaved File System Univ. of Rochester 1989–90 Computer Science and Engg. Research Review
Total Page:16
File Type:pdf, Size:1020Kb
External Sorting on a Parallel if the same data were to be used in several applications, each of which had its own idea about how to organize ex- Interleaved File System plicit partitions. Peter C. Dibble and Michael L. Scottt We believe that it is possible to combine convenience Department of Computer Science and performance by designing a file system that operates in parallel and that maintains the logical structure of files while physically distributing the data. Our approach is Abstract based on the notion of an interleaved file, in which con- Parallel computers with non-parallel file systems find secutive logical records are assigned to different physical limited by the performance of the processor nodes. We have realized this approach in a prototype sys- running the file system. We have designed and imple- tem called Bridge [Dibble et al. 1988]. mented a parallel file system called Bridge that eliminates To validate Bridge, we must demonstrate that it can this problem by spreading both data and file system com- provide most I/O-intensive applications with significant putation over a large number of processors and disks. To speedups on a significant number of processors. We have assess the effectiveness of Bridge we have used it as the therefore implemented several data-intensive applications, basis of a parallel external merge sort, an application re- including utilities to copy and sort sequential files and to quiring significant amounts of interprocessor communica- transpose image bitmaps. Sorting is a particularly signifi- tion and data movement. A detailed analysis of this appli- cant example; it is important to a large number of real- cation indicates that Bridge can profitably be used on con- world users, and it reorganizes files thoroughly enough to figurations in excess of one hundred processors with disks. require a large amount of interprocessor communication. Empirical results on a 32-processor implementation agree closely with the analysis, providing us with a high degree We have analyzed and implemented a parallel external of confidence in this prediction. Based on our experience, merge sort on Bridge. Our analysis suggests that the paral- we argue that file systems such as Bridge will satisfy the lel portions of the algorithm, especially disk I/O, will 1/0 needs of a wide range of parallel architectures and ap- overlap the sequential portions under reasonable assump- plications. tions regarding the number and speed of processors and disks and the speed of interprocessor communication. Not 1. Introduction until disks are attached to over a hundred different proces- sors will the system become CPU or communication- Parallelism is a widely-applicable technique for maximiz- bound. In an attempt to confirm this analysis, we have ing computer performance. Within limits imposed by measured the merge sort on a 32-processor prototype of algorithms and interprocessor communication, the com- Bridge. The results are within three percent of the analyti- puting speed of a multiple-processor system is directly cal predictions. proportional to the number of processing nodes, but for all but the most compute-intensive applications, overall 2. Parallel Interleaved Files system throughput cannot increase without corresponding improvements in the speed of the I/O subsystem. An interleaved file can be regarded as a two-dimensional array. Each of p disk drives (or multi-drive subsystems) is Internally-parallel I/O devices can provide a conven- attached to a dfstinct processor, and is managed by a sepa- tional file system with effectively unlimited data rates rate local file system (LFS). Files span all the disks, [Manuel and Barney 1986], but a bottleneck remains if the interleaved with a granularity of logical records. For file system software itself is sequential or if interaction example, the lines in a text file would be distributed such with the file system is confined to only one process of a that consecutive lines would be located on logically adja- parallel application. Ideally, one would write parallel pro- cent disks. The main file system directory lists the names grams so that each individual process accessed only local of the constituent LFS files for each interleaved file. This files. Such files could be maintained under separate file information suffices to map an interleaved file name and systems, on separate processors, with separate storage record number to the corresponding local file name and devices. Unfortunately, such an approach would force the record number. Formally, with p instances of the LFS, programmer to assume complete responsibility for the numbered 0 ... p-I. record R of an interleaved file will be physical partitioning of data among file systems, destroy- record (R div p) in the constituent file on LFS ing the coherence of data logically belonging to a single (R mod pl. The part of the file located on node x con- file. In addition, frequent restructuring might be required sists of records {y I y mod p = x I. Round-robin interleaving guarantees that programs can access p t This work was supported in part by DARPA/ETL Contract consecutive records in parallel. For random access, it DACA76·85·C·OOOl, NSF/CER Gran. DCR-8320I36, Micro- scatters records at least as well as any other distribution ware Systems Corporation, and an IBM Faculty Development strategy. Award. 20 Our approach to data distribution resembles that of Between the file system and the application, interprocessor soveral other researchers. At the file system level, disk communication remains a potential bottleneck. It can be sf/";I,illg can be used to interleave data across the disks addressed by reducing communication to the minimum comprising a sequential file system [Salem and Garcia- required for each file operation-by exporting as much Molina 1986[. In the second-generation Connection functionality as possible out of the application. across the Machine [Thinking Machines Inc. 1987], data is inter- communication medium, and into the processors that run leaved across processors and disks at the level of individual the LFS. bits. At the physical device level. storage arrays encapsu- Bridge tools are applications that become part of the late multiple disks inside a single logical device. Work is file system. A standard set of tools (copy. sort. grep. etc.) underway at Berkeley to construct such arrays on a very can be viewed as part of the top layer of the file system. large scale [Patterson et al. 1988]. Our work is distin- but an application need not be a standard utility program guished by its emphasis on the design and use of parallel to become a tool. Any process that requires knowledge of file system software, particularly its use of tools to the LFS structure may be written as a tool. Tools com- dynamically add high-level operations to the file system. municate with the Bridge Server to obtain structural information from the Bridge directory. Thereafter they 2.1 The Bridge File System have direct access to the LFS level of the file system. All Bridge is an implementation of a parallel interleaved file accesses to the Bridge directory (Create. Delete. and Open) system on the BBN Butterfly Parallel Processor [BBN are performed by the Bridge Server in order to ensure con- Laboratories 1986]. Bridge has two main functional lay- sistency. In essence, Bridge tools communicate with the ers. The lower layer consists of a Local File System Server as application programs. but they communicate (LFS) on each of the processors with disks. The upper with the local file systems as if they were the Server. layer is called the Bridge Server; it maintains the integrity Our simplest tool copies files. It requires communi- of the file system as a whole and provides the initial inter- cation between nodes only for startup and completion. face to user applications. Except for a few functions that Where a sequential file system requires time 0(11) to copy act on the state of the server itself, the Bridge Server an II-block file, the Bridge Copy Tool can accomplish the interprets I/O requests and dispatches them to the appro- same thing in time O(II/p). plus O(log(p)) for startup and priate LFSs. completion. Any one-to-one filter will display the same An LFS sees only one column sliced out of each behavior; simple modifications to the Copy Tool would interleaved file, but this column can be viewed locally as a allow us to perform a large number of other tasks. includ- complete file. The LFS instances are self-sufficient, fully- ing character translation, block encryption. or lexical competent file systems. They operate in ignorance of one analysis. By returning a small amount of information at another. LFSs can even maintain local files outside the completion time. we could also perform sequential Bridge file system without introducing any problems. Our searches or produce summary information. LFS implementation is based on the Elementary File Other tools can be expected to require non-trivial System developed for the Cronus distributed operating communication between parallel components. We focus system [Gurwitz et al. 1986]. Since our goal is simply to in this paper on the problem of sorting, first because it is demonstrate the feasibility of Bridge, we have not pur- an important operation in practice (files are frequently chased real disk drives. Instead of invoking a device driver. the lowest level of the LFS maintains an image of the sorted), and second because it is in some sense an inher- ently hard problem for interleaved files. disk in RAM and executes an appropriate delay with each I/O request. 3. Sorting Parallel Interleaved Files In order to meet the needs of different types of users. the Bridge Server implements three different system Several researchers have addressed the problem of parallel views.