Gossamer: a Lightweight Programming Framework for Multicore Machines∗

Gossamer: A Lightweight Programming Framework for Multicore Machines∗ Joseph A. Roback and Gregory R. Andrews Department of Computer Science, The University of Arizona, Tucson 1 Introduction nization; (2) a source-to-source translator that takes an annotated sequential program and produces an optimized The key to performance improvements in the multicore program that uses our threading library; and (3) a run- era is for software to utilize the available concurrency. A time system that provides efficient fine-grained threads recent paper [3] summarizes the challenges and describes and high-level synchronization constructs. twelve so-called dwarfs—types of computing and com- As will be seen, the Gossamer annotations are as sim- munication patterns that occur in parallel programs. One ple to use as those of Cilk++, and Gossamer’s perfor- of the key points in the paper is that a general program- mance is better than OpenMP. What sets Gossamer apart ming model has to be able to accommodate all of the is a more extensive set of annotations that enable solving patterns defined by the dwarfs, singly or in combination. a greater variety of applications. In addition to iterative The challenge is to do so both simply and efficiently. and recursive parallelism, Gossamer supports pipelined Parallel programming can be supported by providing computations by a general ordering primitive, domain a threads library [14, 15, 2]; by modifying compilers to decomposition by means of replicated code patterns, and extract parallelism or use optimistic code execution tech- MapReduce [10, 26] computations by means of an effi- niques [4, 6, 5]; by using concurrency features of exist- cient associative memory type. ing languages such as Java or C#; by designing new pro- The paper is organized as follows. Section 2 intro- gramming languages such as Erlang [16], Fortress [24], duces our programming model and annotations by means X10 [22], and ZPL [7]; or by annotating sequential pro- of numerous examples. Section 3 summarizes the Gos- grams with directives that specify concurrency and syn- samer translator and run-time system. Section 4 gives chronization, as in Cilk [11], Cilk++ [8], OpenMP [20], experimental results. Section 5 discusses related work. and others [25, 19, 1, 12, 17, 21]. All of these approaches are valuable and are pro- 2 Annotations ducing useful results, but the last approach—annotating programs—has, in our opinion, the most potential to be Gossamer provides 15 simple annotations, as listed in simultaneously simple, general, and efficient. In partic- Table 1: ten to specify concurrency and synchroniza- ular, annotations are easier to use than libraries because tion and five to program MapReduce computations. The they hide lots of bookkeeping details, and they are sim- fork annotation supports task and recursive parallelism. pler to learn than an entire new programming language. The parallel annotation supports data parallelism that Annotation-based approaches also have efficient imple- occurs when all iterations of a for loop are independent. mentations. However, no existing approach is general The divide/replicate annotation supports data par- enough to support all the computational patterns (dwarfs) allelism that occurs when shared data can be decomposed defined in [3]. into independent regions, and the same code is to be ex- This paper describes Gossamer, an annotation-based ecuted on each region. approach that is general as well as simple and efficient. For synchronization, join is used to wait for forked Gossamer has three components: (1) a set of high-level threads to complete; copy is used when an array annotations that one adds to a sequential program (C in needs to be passed by value; barrier provides bar- our case) in order to specify concurrency and synchro- rier synchronization within replicated code blocks; ∗This work was supported in part by NSF Grants CNS-0410918 and orderedfcodeg delays the execution of code until the CNS-0615347. predecessor sibling thread has also finished executing 1 Use Annotations int solutions = 0; Concurrency fork, parallel, divide/replicate void putqueen(char **board, int row) { int j; atomic, barrier, buffered, copy Synchronization if (row == n) { join, ordered, shared atomic { solutions++; } mr space, mr list return; MapReduce mr put, mr getkey, mr getvalue } for (j = 0; j < n; j++) { if (isOK(board, row, j)) { Table 1: Gossamer Annotations board[row][j] = ’Q’; fork putqueen(copy board[n][n], row+1); void qsort(int *begin, int *end) { board[row][j] = ’-’; if (begin != end) { } int *middle; } end--; join; middle = partition(begin, end, *end); } swap(end, middle); fork qsort(begin, middle); fork qsort(++middle, ++end); Figure 2: N-Queens Problem join; } } int main(int argc, char **argv) { ... while (!feof(infp)) { Figure 1: Quicksort insize = fread(in, 1, BLKSIZE, infp); fork compressBlk(copy in[insize], insize); } join; code; bufferedfwrite callsg buffers the writes in ... a local buffer, which is flushed when the thread termi- } nates; buffered(ordered) causes thread buffers to void compressBlk(char *in, int insize) { ... be flushed in sibling order; and atomicfcodeg causes BZ2_bzCompress(in, insize, out, &outsize); code to be executed atomically. ordered { fwrite(out, 1, outsize, outfp); } Gossamer supports the MapReduce programming ... model by means of three operations: mr put, which de- } posits a (key,value) pair into an associative memory having type mr space; mr getkey, which returns from an Figure 3: Bzip2 mr space a key and a list (of type mr list) of all values having that key; and mr getvalue, which returns the next value from an mr list. the board array is passed by reference on each call to Below we illustrate the use of the annotations by putqueen(), which would cause the board to be shared means of several examples, in which the annotations are among the putqueen() threads. A copy annotation is highlighted in boldface. used to give each thread its own copy of the board. Quicksort is a classic divide-and-conquer algorithm Bzip2 [23] compression uses the Burrows-Wheeler that divides a list into two sub-lists and then recursively algorithm, followed by a move-to-front transform, and sorting each sub-list. Since the sub-lists are independent, then Huffman coding. Compression is performed on in- they can be sorted in parallel. An annotated version of dependent blocks of data, which lends itself to block- quicksort is shown in Figure 1. level task parallelism. A parallel version of bzip2 using Gossamer annotations is shown in Figure 3. N-Queens is a depth-first backtracking algorithm [9]. The main() function reads blocks of data and forks It tries to solve the problem of placing N chess queens on compressBlock() to compress each block in parallel; an NxN chessboard such that no one queen can capture it then waits for all threads to complete. The ordered any other. An annotated version of a recursive N-Queens annotation is used within compressBlock to ensure that algorithm is shown in Figure 2. Each attempt at placing compressed blocks are output in the same order that un- a queen on the board is forked and checked in parallel compressed blocks are read from input. using the recursive putqueen() function. Two issues arise from parallelizing putqueen(). Matrix Multiplication is an example of iterative data First, the global variable solutions is incremented ev- parallelism. The program in Figure 4 gives a version of ery time a solution is found, so an atomic annotation matrix multiplication that uses cache efficiently by trans- is added to ensure that updates are atomic. Second, posing the inner loops of the usual algorithm. All itera- 2 double **A, **B, **C; // initialize the arrays FILE *out_fp; parallel for (i = 0; i < n; i++) char *data; for (k = 0; k < n; k++) int size, run, val; for (j = 0; j < n; j++) divide data[size] C[i][j] += A[i][k] * B[k][j]; where data[divide_left] != data[divide_right] replicate { while (size > 0) { Figure 4: Matrix Multiplication val = *data++; size--; run = 1; double **old, **new; while (val == *data && size > 0) { int i, j, it, n, m; run++; data++; size--; if (run == RUNMAX) { break; } old++; new++; n-=2; } buffered (ordered){ divide old[n][], new[n][] replicate { fwrite(&val, sizeof(int), 1, out_fp); for (it = 0; it < MAXITERS; it += 2) { fwrite(&run, sizeof(int), 1, out_fp); for (i = 0; i < n; i++) } for (j = 1; j < m-1; j++) } new[i][j] = (old[i-1][j] + old[i+1][j] + } old[i][j-1] + old[i][j+1]) * 0.25; barrier; for (i = 0; i < n; i++) Figure 6: Run Length Encoding for (j = 1; j < m-1; j++) old[i][j] = (new[i-1][j] + new[i+1][j] + new[i][j-1] + new[i][j+1]) * 0.25; barrier; continuing. The update process is repeated MAXITERS } times. } Run Length Encoding (RLE) compresses data by Figure 5: Jacobi Iteration converting sequences of the same value to (value,count) pairs. Figure 6 shows an RLE implementation that scans an array byte-by-byte, recording each run and writing tions of the outer loop are independent because they work the (value,count) pairs to output. The divide annota- on different rows of the result matrix, so the outer loop tion partitions data into equal-sized chunks and assigns is prefixed by the parallel annotation. This results in one to each processor. The original sequential code in- n tasks. side the replicate annotation is executed concurrently on each chunk. Note that here the replicated code is a Jacobi Iteration is a simple method for approximat- while loop over the input. ing the solution of a partial differential equation such The where annotation adjusts chunk boundaries to the as Laplace’s equation in two dimensions: r2(Φ) = 0. right as necessary to ensure that data is not split at points Given boundary values for a region, the solution is the that would break up runs of identical values.

Load more