Parallelizing Bzip2: Parallel Programming

featureparallel programming Parallelizing Bzip2: A Case Study in Multicore Software Engineering Victor Pankratius, Ali Jannesari, and Walter F. Tichy, University of Karlsruhe ulticore chips integrate several processors on a single die, and they’re As multicore quickly becoming widespread. Being affordable, they make it possible computers become for every PC user to own a truly parallel computer, but they also make mainstream, parallel programming a concern for more software developers than ever developers need Mbefore. Not only is parallel programming considered difficult, but experience with to know which parallel software is limited to a few areas, such as scientific computing, operating sys- approaches to tems, and databases. Now that parallelism is within reach for new application classes, parallelism work. new software engineering questions arise. trivial, but, with 8,000 LOC, the application is Four teams In the young field of multicore software en- small enough to manage in a course. competitively gineering, many fundamental questions are still The study occurred during the last three weeks open, such as what language constructs are use- of a multicore software engineering course. Eight parallelized the ful, which parallelization strategies work best, and graduate computer science students participated, Bzip2 compression how existing sequential applications can be reen- working in independent teams of two to parallelize gineered for parallelism. At this point, there is no Bzip2 in a team competition. The winning team re- algorithm. The substitute for answering these questions than to ceived a special certificate of achievement. authors report try various approaches and evaluate their effective- ness. Previous empirical studies focused on either Competing Team Strategies lessons learned. numeric applications or computers with distributed Prior to the study, all students had three months’ memory,1–3 but the resulting observations don’t extensive training in parallelization with Posix necessarily carry over to nonnumeric applications Threads (PThreads) and OpenMP (see the side- and shared-memory multicore computers. bar, “Parallel Programming with PThreads and We conducted a case study of parallelizing a OpenMP”) and in profiling strategies and tools. real program for multicore computers using cur- The teams received no hints for the Bzip2 paral- rently available libraries and tools. We selected lelization task. They could try anything, as long the sequential Bzip2 compression program for the as they preserved compatibility with the sequential study because it’s a computing-intensive, widely version. They could reuse any code—even from ex- used, and relevant application in everyday life. Its isting Bzip2 parallel implementations,4–6 although source code is available, and its algorithm is well- these implementations were based on older versions documented (see the sidebar “Bzip Compression of the sequential program and weren’t fully com- Fundamentals”). In addition, the algorithm is non- patible with the current version. 70 IEEE SOFTWARE Published by the IEEE Computer Society 0740-7459/09/$26.00 © 2009 IEEE Bzip Compression Fundamentals Bzip uses a combination of techniques to compress data in technique is applied to the vector obtained in the previ- a lossless way. It divides an input file into fixed-sized blocks ous stage. that are compressed independently. It feeds each block into a pipeline of algorithms, as depicted in Figure A. An Julian Seward developed the open source implementa- output file stores the compressed blocks at the pipeline’s tion of Bzip2 that we used in our case study.3 It lets block end in the original order. All transformations are revers- sizes vary in a range of 100 to 900 Mbytes. A low-level ible, and the stages are passed in the opposite direction for library comprises functions that compress and decompress decompression. data in main memory. The sorting algorithm that’s part of the BWT includes a sophisticated fallback mechanism to im- ■ Pipeline stage 1. A Burrows-Wheeler transformation prove performance. The high-level interface provides wrap- (BWT) reorders the characters on a block in such a pers for the low-level functions and adds functionality for way that similar characters have a higher probability of dealing with I/O. being closer to one another.1 BWT changes neither the length of the block nor the characters. ■ Pipeline stage 2. A move-to-front (MTF) coding applies a References locally adaptive algorithm to assign low integer values 1. M. Burrows and D.J. Wheeler, A Block-Sorting Lossless Data Compression to symbols that reappear more frequently.2 The resulting Algorithm, tech. report 124, Digital Equipment Corp., 10 May 1994. 2. J.L. Bentley et al., “A Locally Adaptive Data Compression Scheme,” vector can be compressed efficiently. Comm. ACM, vol. 29, no. 4, 1986, pp. 320–330. ■ Pipeline stage 3. The well-known Huffman compression 3. J. Seward, Bzip2 v. 1.0.4, 20 Dec. 2006; www.bzip.org. Figure A. The Bzip2 stages. The input Burrows-Wheeler Move-to-front Huffman file is divided into fixed-block sizes Input transformation (BWT) (MTF) coding compression Compressed that are compressed independently in le Stage 1 Stage 2 Stage 3 output a pipeline of techniques. le We asked the teams to document their work beginning, the team invested two hours to get a from the beginning—including their initial strat- code overview and find the files that were relevant egies and expectations, the difficulties they en- for parallelization. They spent another three to countered during parallelization, their approach, four hours to create execution profiles with gprof and their effort. In addition to these reports, (www.gnu.org/software/binutils), KProf (http:// we collected evidence from personal observa- kprof.sourceforge.net), and Valgrind (http:// tions, the submitted code, the final presenta- valgrind.org). tions, and interviews with the students after their The team realized that they had to choose in- presentations.7 put data carefully to find the critical path and keep Because of space limitations, we omit a number the data sizes manageable. They invested another of details here, but more information (including two hours in understanding code along the criti- threats to validity) is available elsewhere.8 cal path. Understanding the code generally and studying the algorithm took another six hours.9 Team 1 Thereafter, they decided that parallel processing The first team tried several strategies. They started of data blocks was the most promising approach, with a low-level approach, using a mixture of but they had problems unraveling existing data OpenMP and PThreads. Then they restructured dependencies. the code by introducing classes. As the submis- The team continued with a parallelization at sion deadline approached, they reverted to an a low abstraction level, taking about 12 hours. In earlier snapshot and applied some ideas from the particular, they parallelized frequently called code BzipSMP parallelization.5 fragments with OpenMP and exchanged a sort- Team 1’s plan was to understand the code base ing routine for a parallel Quicksort implementa- (one week), parallelize it (one week), and test and tion using PThreads. However, the speedup was debug the parallel version (one week). Actual work disappointing. quickly diverged from the original plan. At the The team decided to refactor the code and November/December 2009 IEEE SOFTWARE 71 Parallel Programming with PThreads and OpenMP PThreads and OpenMP add parallelism to C in two different #pragma omp parallel for //OpenMP annotation ways. PThreads is a thread library, while OpenMP extends for(i=0; i<N; i++) { //usual C code the language. c[i] = a[i]+b[i]; } PThreads Posix Threads (PThreads) is a threading library with an inter- In this example, OpenMP creates several threads that handle face specified by an IEEE standard. PThreads programming is iterations of the loop in parallel. The example also illustrates quite low level. For example, pthread_create(...) creates a thread OpenMP’s idea of incrementally parallelizing a sequential that executes a function; pthread_mutex_lock(l) blocks lock l. For program by inserting one pragma after the other in the code. details, David Butenhof has written a good text.1 When a sequential host compiles the code, it simply ignores the pragmas and runs the code as a sequential version. In our OpenMP real-world study, OpenMP had limited applicability (see the OpenMP defines pragmas—that is, annotations—for insertion lessons learned in the main text, under the subhead “Incre- in a host language to indicate the code segments that might mental Parallelization Doesn’t Work”). be executed in parallel. Effectively, OpenMP thus extends the OpenMP is standardized and available for C and For- host language. In contrast to PThreads, OpenMP abstracts tran.2 Porting OpenMP to other languages is ongoing. away details such as the explicit creation of threads. However, the developer is still responsible for correctly handling locking and synchronization. With OpenMP, you parallelize a loop with independent it- References 1. D.R. Butenhof, Programming with Posix Threads, Addison-Wesley, 2007. erations by inserting a pragma before the loop. The following 2. B. Chapman et al., Using OpenMP: Portable Shared Memory Parallel example illustrates a parallel vector addition: Programming, MIT Press, 2008. improve its readability by introducing classes. Af- The first week went mostly to analyzing code and ter eight hours of work, the execution times didn’t profiling the sequential Bzip2 with Valgrind and differ much from the previous version, but the code gprof. In the remaining time, they concentrated was now easier to understand. The restructured on restructuring and preparing the code for par- code also made it easier to implement parallel-data- allelization. Two days before submission, they block processing, which took about 12 hours. Only were still refactoring. They performed the actual a few lines had to be changed to introduce paral- parallelization on the last day. lelism, but the team found it difficult to assess the The team rewrote the entire Bzip2 library impact of those changes.

Load more