Reducing I/O Complexity by Simulating Coarse Grained Parallel Algorithms

Reducing I/O Complexity by Simulating Coarse Grained Parallel Algorithms Frank Dehne, David Hutchinson, Anil Maheshwari School of Computer Science, Carleton University g Ottawa, Canada K1S 5B6, fdehne, hutchins, maheshwa @scs.carleton.ca Wolfgang Dittrich Bosch Telecom GmbH, UC-ON/ERS Gerberstraûe 33, 71522 Backnang Germany Abstract We ®rst review some de®nitions for the BSP and CGM model; consult [11, 23, 13] for more details. A BSP algo- P Block-wise access to data is a central theme in the de- rithm A on a ®xed problem instance and computer con®g- N v sign of ef®cient external memory (EM) algorithms. A sec- uration C can be characterized by the parameters ( , , , L N v ond important issue, when more than one disk is present, g , ), where is the problem size (in problem items), is is fully parallel disk I/O. In this paper we present a de- the number of processors, g is the time required for a com- terministic simulation technique which transforms paral- munication operation, L is the minimum time required for lel algorithms into (parallel) external memory algorithms. the processors to synchronize, and is the number of BSP P C Speci®cally, we present a deterministic simulation technique supersteps required by A on and . (The times are in num- which transforms Coarse Grained Multicomputer (CGM) ber of processor cycles.) A CGM algorithm is a special case algorithms into external memory algorithms for the Parallel of a BSP algorithm where the communication part of each N = Disk Model. Our technique optimizes block-wise data ac- superstep consists of exactly one h-relation with h . v cess and parallel disk I/O and, at the same time, utilizes mul- Such a superstep is called a round. The notion of an h- tiple processors connected via a communication network or relation is often used in the analysis of parallel algorithms shared memory. based on BSP-like models. An h-relation is a communica- We obtain new improved parallel external memory al- tion superstep in which each of the v processors sends and gorithms for a large number of problems including sorting, receives at most h data items. An algorithm for a CGM with permutation, matrix transpose, several geometric and GIS multiple disks attached to each processor is referred to as an problems including 3D convex hulls (2D Voronoi diagrams), EM-CGM algorithm. and various graph problems. All of the (parallel) external The following outlines the results obtained. memory algorithms obtained via simulation are analyzed with respect to the computation time, communication time A 1. We show that any v processor CGM algorithm with and the number of I/O's. Our results answer to the challenge supersteps/rounds, local memory size , computation posed by the ACM working group on storage I/O for large- + L g + L time , communication time and mes- scale computing [8]. N sage size 2 can be simulated, deterministically, as v 0 A a p-processor EM-CGM algorithm with computa- v v + O + L tion time , communication time p p v v v v + L G O + L 1 Introduction g , and I/O time for p p p DB p N v p M = N = vDB B = O , , and 2 . v Cormen and Goodrich [8] posed the challenge of com- G is the time (in processor cycles) for a parallel I/O op- bining BSP-like parallel algorithms with the requirements eration. for parallel disk I/O. Solutions based on probabilistic meth- N LN v N N ods were presented in [11] and [14]. In this paper, we present Let g , , be increasing functions of . c deterministic solutions which are based on a deterministic If A is -optimal (see [16, 11] for de®nition) on the g N L LN v v N simulation of parallel algorithms for the Coarse Grained CGM for g , and ,then 0 c = ! Multicomputer (CGM) model and answer to that challenge. A is a -optimal EM-CGM algorithm for , p g g N G = BD o L LN , and . v The analysis of the I/O complexity of our algorithms is done 0 as in the PDM model. In addition, we analyze the running A is work-optimal, communication-ef®cient, and I/O- = g time and communication time. ef®cient (see [11] for de®nitions) if , p g N G = BD O L LN , ,and . techniques map parallel algorithms designed for the PRAM v model to the BSP model. Furthermore, by generating pro- While our parameter space is constrained to a coarse grams for a single processor computer from coarse grained grained scenario which is typical of the CGM con- parallel algorithms, our approach can also be used to control straints for parallel computation, we show that this pa- cache memory faults. This supports a suggestion of Vishkin rameter space is both interesting and appropriate for [24, 25]. EM computation. Once the expression for the general The parameter space for EM problems which we are lower bounds reported by [1, 26, 3] is specialized for proposing in this paper is both practical and interesting. The the subset of the parameter space that we consider, it logarithmic term in the I/O complexity of sorting is bounded N N becomes fully compatible with our results. This an- M c M = by a constant c if ,where . Since this B v swers questions of Cormen [7] and Vitter [27] on the B B N c apparent contradictions between the results of [11] and constraint involves the parameters v , , , ,wehavea the previously stated lower bounds. four-dimensional constraint space. For practical purposes, 3 10 the parameter B can be ®xed at about for disk I/O [22]. c1 c c1 = v B 2. We obtain new, simple, parallel EM algorithms for This reduces the constraint to a surface N in sorting, permutation,andmatrix transpose with I/O three dimensional space. Any point on or above the surface N represents a valid set of parameters for the elimination of the complexity O . pD B logarithmic factor. For 100 processors or less, for instance, 3. We obtain parallel EM algorithms with I/O complex- any problem size greater than about 10 mega-items is suf®- N cient. ity O for the following computational geome- pD B Due to page restrictions for the conference proceedings, try/GIS problems : (a) 3-dimensional convex hull and some details are omitted in this paper. An extended ver- planar Voronoi diagram (these results are probabilistic sion with additional ®gures and charts can be found at since the underlying CGM algorithms are probabilis- http://www.scs.carleton.ca/ dehne. e tic), (b) lower envelope of line segments (here, N de- notes the size of the input plus output), (c) area of union D of rectangles, (d) 3 -maxima, (e) nearest neighbour 2 Single Processor Target Machine problem for planar point set, (f) weighted dominance counting for planar point set, (g) uni-directional and In this section we describe a deterministic simulation multi-directional separability. technique that permits a CGM algorithm to be simulated as 4. We obtain parallel EM algorithms with I/O complex- an external memory algorithm on a single processor target log N N machine.We ®rst consider the simulation of a single com- ity O for the following computational geom- pD B pound superstep and, in particular, how the contexts and etry/GIS and graph problems: (a) trapezoidal decom- messages of the virtual processors can be stored on disk, and position (b) triangulation (c) segment tree construction, retrieved ef®ciently in the next superstep. The management (d) batched planar point location, of the contexts is straightforward. Since we know the size of the contexts of the processors, we can distribute the contexts 5. We obtain parallel EM algorithms with I/O complex- deterministically. N log v ity O for the following computational geome- pD B The main issue is how to organize the generated mes- try/GIS and graph problems: (a) list ranking, (b) Euler sages on the D disks so that they can be accessed using tour of a tree, (c) connected components, (d) spanning blocked and fully parallel I/O operations. This task is sim- forest, (e) lowest common ancestor in a tree, (f) tree pler if the messages have a ®xed length. Although a CGM contraction, (g) expression tree evaluation, (h) open ear N algorithm has the property that data is deemed to be decomposition, (i) biconnected components. exchanged by each processor in everyv superstep, there is no 6. In contrast to previous work, all of our methods are also guarantee on the size of individual messages. Algorithm BalancedRouting scalable with respect to the number of processors. gives us a technique for achieving ®xed size messages. The results for Items (2), (3) and (4) are subject to the 2 2 Algorithm 1 BalancedRouting (from [4]) = vDB N v B + v v 1=2 conditions N , ,and n v Input: Each of the processors has elements, which are v > v > 1 N ,where is a constant that depends on the n divided into v messages, each of arbitrary length .Let problem. The latter constraint arises in the CGM algorithm v msg i which we simulate. For the problems examined in this pa- ij denote the message to be sent from processor to pro- j jmsg j cessor ,andlet ij be the length of such a message. 3 per, . Our results show that the EM-CGM is a good generic pro- Output: The v messages in each processor are delivered to gramming model that facilitates the design of I/O-ef®cient their ®nal destinations in two balanced rounds of communi- algorithms in the presence of multiple processors and mul- cation, and each processor then contains at most h data.

Reducing I/O Complexity by Simulating Coarse Grained Parallel Algorithms

Paralellizing the Data Cube

Two-Level Main Memory Co-Design: Multi-Threaded Algorithmic Primitives, Analysis, and Simulation

Minimizing Writes in Parallel External Memory Search

The Parallel Persistent Memory Model

Implementing Operational Intelligence Using In-Memory Computing

Models for Parallel Computation in Multi-Core, Heterogeneous, and Ultra Wide-Word Architectures

High Performance Computing Ð Past, Present and Future

Experiments with a Parallel External Memory System*

Fundamentals – Parallel Architectures, Models, and Languages

The Efficiency of Mapreduce in Parallel External Memory Arxiv

The Power of In-Memory Computing: from Supercomputing to Stream Processing

On the Complexity of List Ranking in the Parallel External Memory Model