Thesis Supervisor

Virtual Memory for Data-Parallel Computing by Thomas H. Cormen B.S.E., Princeton University (1978) S.M., Massachusetts Institute of Technology (1986) Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY February 1993 c Massachusetts Institute of Technology 1993 Signature of Author : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Department of Electrical Engineering and Computer Science September 11, 1992 Certified by : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Charles E. Leiserson Professor of Computer Science and Engineering Thesis Supervisor Accepted by : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Campbell L. Searle Chairman, Departmental Committee on Graduate Students 2 Virtual Memory for Data-Parallel Computing by Thomas H. Cormen Submitted to the Department of Electrical Engineering and Computer Science on September 11, 1992, in partial fulfillment of the requirements for the degree of Doctor of Philosophy Abstract This thesis explores several issues that arise in the design and implementation of virtual-memory systems for data-parallel computing. Chapter 1 presents an overview of virtual memory for data-parallel computing. The chapter lists some applications that may benefit from large address spaces in a data-parallel model. It also describes the view of virtual memory for data-parallel computing used in this thesis, along with a machine model. Chapter 1 also summarizes VM-DP, a system we designed and implemented as a first cut at a system for data-parallel computing with virtual memory. Chapter 2 covers bit-defined permutations on parallel disk systems. It gives an algorithm for performing bit-permute/complement, or BPC, permutations, a class of permutations in which each target address is formed by permuting the bits of its corresponding source address according to a fixed bit permutation and then complementing a fixed subset of the resulting bits. This class includes common permutations such as matrix transpose (with dimensions integer powers of 2), bit-reversal permutations, hypercube permutations, and vector-reversal permutations. Chapter 2 also shows how to perform bit-matrix-multiply/complement, or BMMC, permutations efficiently. In a BMMC permutation, each target address is formed by multiplying its corresponding source address, treated as a vector, by a nonsingular matrix over GF (2) and then complementing a fixed subset of the resulting bits. BMMC permutations include all BPC permutations and classes such as Gray code and inverse Gray code permutations. Not only are the algorithms for BPC and BMMC permutations efficient, but they are also deterministic, easily programmed, and can be performed \on-line" in the sense that they take little time and space. Chapter 2 also proves lower bounds for BMMC and BPC permutations, showing that the BPC algorithm is asymptotically optimal. Chapter 3 examines the differences between performing general and special permutations. It presents a simple, though asymptotically suboptimal, algorithm to perform general permutations by sorting records according to their target addresses. Chapter 3 also explores several classes of special permutations that we can perform much faster than general permutations: monotonic and k-monotonic routes, mesh and torus permutations, BMMC and BPC permutations, and general matrix transpose. These classes arise frequently. Chapter 3 shows, for 3 each of these classes, how to perform them quickly and how to detect them at run-time given a vector of target addresses. Chapter 3 also focuses on the question of how to invoke special permutations. It argues that although we can detect many special permutations quickly at run time, it is better to invoke them by specifying them in the source code. Chapter 3 supports this argument with empirical data for BPC permutations. In a data-parallel computer with virtual memory, the way in which vectors are laid out on the disk system affects the performance of data-parallel operations. Chapter 4 presents a general method of vector layout called banded layout, in which we divide a vector into bands of a number of consecutive vector elements laid out in column-major order. Chapter 4 analyzes the effect of band size on the major classes of data-parallel operations, deriving the following results: For permuting operations, the best band sizes are a track or smaller. Moreover, regardless • of the number of grid dimensions, we can perform mesh and torus permutations efficiently. For scan operations, the best band size equals the size of the I/O buffer, and these sizes • depend on several machine parameters. The band size has no effect on the performance of reduce operations. • When there are several different record sizes, band sizes for elementwise operations should • be based on the largest record size, which has the smallest band size. Chapter 5 presents the design of the demand-paging portion of the VM-DP system. We implemented three different paging schemes and collected empirical data on their I/O performance. One of the schemes was straight LRU (least-recently used) paging, in which all tracks are treated equally. The other two schemes treat tracks differently according to the sizes of vectors they contain. The empirical tests yielded two somewhat surprising results. First, the observed I/O counts for the test suite were roughly equal under all three schemes. Moreover, as problem sizes increased, performance differences among the three schemes decreased. Second, the best scheme overall of the three was the simplest one: straight LRU paging. The other two schemes, which attempt to account for vast differences in vector sizes, were not quite as good overall. Chapter 5 also discusses some of the addressing and implementation issues that arose in the VM-DP system. Thesis Supervisor: Charles E. Leiserson Title: Professor of Computer Science and Engineering Contents Acknowledgments 6 1 Virtual Memory for Data-Parallel Computing 8 1.1 Who needs virtual memory for data-parallel computing? : : : : : : : : : : : : : : 10 1.2 Virtual memory and data-parallel computing : : : : : : : : : : : : : : : : : : : : 12 1.3 The machine model : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 14 1.4 Previous results and related work : : : : : : : : : : : : : : : : : : : : : : : : : : : 17 1.5 The VM-DP system : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 21 1.6 Contributions of this thesis : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 23 2 Performing Bit-Defined Permutations on Parallel Disk Systems 25 2.1 Classes of permutations : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 27 2.2 BPC permutations : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 38 2.3 Permutable sets of blocks : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 45 2.4 MRC permutations : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 46 2.5 Block BMMC and block BPC permutations : : : : : : : : : : : : : : : : : : : : : 48 2.6 BMMC permutations : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 59 2.7 Arbitrary block permutations : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 67 2.8 Lower bounds : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 68 2.9 Performing BMMC and BPC permutations with other vector layouts : : : : : : : 76 2.10 Conclusions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 80 3 Performing General and Special Permutations 82 3.1 Performing general permutations : : : : : : : : : : : : : : : : : : : : : : : : : : : 84 3.2 Special permutations : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 87 4 Contents 5 3.3 Monotonic routes, mesh permutations, and torus permutations : : : : : : : : : : 88 3.4 BMMC and BPC permutations : : : : : : : : : : : : : : : : : : : : : : : : : : : : 96 3.5 General matrix transpose : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 99 3.6 Invoking special permutations : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 108 3.7 Conclusions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 116 4 Vector Layout 118 4.1 Banded vector layout : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 119 4.2 Data-parallel operations : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 124 4.3 Permuting operations and banded layout : : : : : : : : : : : : : : : : : : : : : : : 126 4.4 Scans and reduces with banded layout : : : : : : : : : : : : : : : : : : : : : : : : 134 4.5 Choosing optimal band and I/O buffer sizes for scans : : : : : : : : : : : : : : : 141 4.6 Band sizes for elementwise operations : : : : : : : : : : : : : : : : : : : : : : : : 148 4.7 Conclusions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 150 5 Paging and Addressing in the VM-DP System 152 5.1 Performance of paging schemes in VM-DP : : : : : : : : : : : : : : : : : : : : : : 154 5.2 Vector allocation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 161 5.3 Implementation issues in the VM-DP paging system

Load more