featureparallel programming

Parallelizing Bzip2: A Case Study in Multicore Software Engineering

Victor Pankratius, Ali Jannesari, and Walter F. Tichy, University of Karlsruhe

ulticore chips integrate several processors on a single die, and they’re As multicore quickly becoming widespread. Being affordable, they make it possible computers become for every PC user to own a truly parallel computer, but they also make mainstream, parallel programming a concern for more software developers than ever developers need Mbefore. Not only is parallel programming considered difficult, but experience with to know which parallel software is limited to a few areas, such as scientific computing, operating sys- approaches to tems, and databases. Now that parallelism is within reach for new application classes, parallelism work. new software engineering questions arise. trivial, but, with 8,000 LOC, the application is Four teams In the young field of multicore software en- small enough to manage in a course. competitively gineering, many fundamental questions are still The study occurred during the last three weeks open, such as what language constructs are use- of a multicore software engineering course. Eight parallelized the ful, which parallelization strategies work best, and graduate computer science students participated, Bzip2 compression how existing sequential applications can be reen- working in independent teams of two to parallelize gineered for parallelism. At this point, there is no Bzip2 in a team competition. The winning team re- algorithm. The substitute for answering these questions than to ceived a special certificate of achievement. authors report try various approaches and evaluate their effective- ness. Previous empirical studies focused on either Competing Team Strategies lessons learned. numeric applications or computers with distributed Prior to the study, all students had three months’ memory,1–3 but the resulting observations don’t extensive training in parallelization with Posix necessarily carry over to nonnumeric applications Threads (PThreads) and OpenMP (see the side- and shared-memory multicore computers. bar, “Parallel Programming with PThreads and We conducted a case study of parallelizing a OpenMP”) and in profiling strategies and tools. real program for multicore computers using cur- The teams received no hints for the Bzip2 paral- rently available libraries and tools. We selected lelization task. They could try anything, as long the sequential Bzip2 compression program for the as they preserved compatibility with the sequential study because it’s a computing-intensive, widely version. They could reuse any code—even from ex- used, and relevant application in everyday life. Its isting Bzip2 parallel implementations,4–6 although source code is available, and its algorithm is well- these implementations were based on older versions documented (see the sidebar “Bzip Compression of the sequential program and weren’t fully com- Fundamentals”). In addition, the algorithm is non- patible with the current version.

70 IEEE SOFTWARE Published by the IEEE Computer Society 0740-7459/09/$26.00 © 2009 IEEE Bzip Compression Fundamentals

Bzip uses a combination of techniques to data in technique is applied to the vector obtained in the previ- a lossless way. It divides an input file into fixed-sized blocks ous stage. that are compressed independently. It feeds each block into a pipeline of algorithms, as depicted in Figure A. An Julian Seward developed the open source implementa- output file stores the compressed blocks at the pipeline’s tion of Bzip2 that we used in our case study.3 It lets block end in the original order. All transformations are revers- sizes vary in a range of 100 to 900 Mbytes. A low-level ible, and the stages are passed in the opposite direction for library comprises functions that compress and decompress decompression. data in main memory. The sorting algorithm that’s part of the BWT includes a sophisticated fallback mechanism to im- ■■ Pipeline stage 1. A Burrows-Wheeler transformation prove performance. The high-level interface provides wrap- (BWT) reorders the characters on a block in such a pers for the low-level functions and adds functionality for way that similar characters have a higher probability of dealing with I/O. being closer to one another.1 BWT changes neither the length of the block nor the characters. ■■ Pipeline stage 2. A move-to-front (MTF) coding applies a References locally adaptive algorithm to assign low integer values 1. M. Burrows and D.J. Wheeler, A Block-Sorting Lossless to symbols that reappear more frequently.2 The resulting Algorithm, tech. report 124, Digital Equipment Corp., 10 May 1994. 2. J.L. Bentley et al., “A Locally Adaptive Data Compression Scheme,” vector can be compressed efficiently. Comm. ACM, vol. 29, no. 4, 1986, pp. 320–330. ■■ Pipeline stage 3. The well-known Huffman compression 3. J. Seward, Bzip2 v. 1.0.4, 20 Dec. 2006; www.bzip.org.

Figure A. The Bzip2 stages. The input Burrows-Wheeler Move-to-front Huffman file is divided into fixed-block sizes Input transformation (BWT) (MTF) coding compression Compressed that are compressed independently in le Stage 1 Stage 2 Stage 3 output a pipeline of techniques. le

We asked the teams to document their work beginning, the team invested two hours to get a from the beginning—including their initial strat- code overview and find the files that were relevant egies and expectations, the difficulties they en- for parallelization. They spent another three to countered during parallelization, their approach, four hours to create execution profiles with gprof and their effort. In addition to these reports, (www.gnu.org/software/binutils), KProf (http:// we collected evidence from personal observa- kprof.sourceforge.net), and Valgrind (http:// tions, the submitted code, the final presenta- valgrind.org). tions, and interviews with the students after their The team realized that they had to choose in- presentations.7 put data carefully to find the critical path and keep Because of space limitations, we omit a number the data sizes manageable. They invested another of details here, but more information (including two hours in understanding code along the criti- threats to validity) is available elsewhere.8 cal path. Understanding the code generally and studying the algorithm took another six hours.9 Team 1 Thereafter, they decided that parallel processing The first team tried several strategies. They started of data blocks was the most promising approach, with a low-level approach, using a mixture of but they had problems unraveling existing data OpenMP and PThreads. Then they restructured dependencies. the code by introducing classes. As the submis- The team continued with a parallelization at sion deadline approached, they reverted to an a low abstraction level, taking about 12 hours. In earlier snapshot and applied some ideas from the particular, they parallelized frequently called code BzipSMP parallelization.5 fragments with OpenMP and exchanged a sort- Team 1’s plan was to understand the code base ing routine for a parallel Quicksort implementa- (one week), parallelize it (one week), and test and tion using PThreads. However, the speedup was debug the parallel version (one week). Actual work disappointing. quickly diverged from the original plan. At the The team decided to refactor the code and

November/December 2009 IEEE SOFTWARE 71 Parallel Programming with PThreads and OpenMP

PThreads and OpenMP add parallelism to C in two different #pragma omp parallel for //OpenMP annotation ways. PThreads is a library, while OpenMP extends for(i=0; i

improve its readability by introducing classes. Af- The first week went mostly to analyzing code and ter eight hours of work, the execution times didn’t profiling the sequential Bzip2 with Valgrind and differ much from the previous version, but the code gprof. In the remaining time, they concentrated was now easier to understand. The restructured on restructuring and preparing the code for par- code also made it easier to implement parallel-data- allelization. Two days before submission, they block processing, which took about 12 hours. Only were still refactoring. They performed the actual a few lines had to be changed to introduce paral- parallelization on the last day. lelism, but the team found it difficult to assess the The team rewrote the entire Bzip2 library impact of those changes. as well as the I/O routines using a producer- Although the refactoring approach was prom- consumer pattern. Thereafter, they used PThreads ising, the group ran out of time as the deadline to introduce parallelism. They realized early on approached and decided to abandon this strategy. that a fine-grained parallelization wouldn’t yield The team reverted to the version without classes sufficient speedup, so they tried to achieve paral- and began to integrate some parallelization ideas lelism on higher abstraction levels. Massive refac- from Bzip2SMP (two hours).5 Additionally, they torings were indispensable to resolve data de- spent three hours making the files ready for pendencies and enable blockwise compression in submission. their producer-consumer approach. In the end, Team 1 reported that fine-grained Although the team identified several other parallelization was inappropriate. It would have hotspots for parallelization, they didn’t have required too much effort to restructure the code enough time to tackle them. For example, they for higher speedups. had no time left for fine-tuning the parallel ver- sion or for a plan to improve throughput with Team 2 pipelining. The second team focused on extensive restruc- Team 2 reported that it drastically underes- turing of the sequential program before starting timated the time needed for refactoring. They with the parallelization. Their plan was to ana- found refactoring to be frustrating because it took lyze and profile the code (one week), refactor it a long time before an executable parallel version (one week), and parallelize it (one week). The was available. Nevertheless, the team knew that team spent about 2 × 50 hours of work in total. drastic restructurings were indispensable.

72 IEEE SOFTWARE www.computer.org/software Team 3 terms of person hours. During the post-competi- The third team started with a fine-grained par- tion interview, they estimated to have spent about allelization strategy using OpenMP and aban- 70 percent of their time implementing and debug- Team 2 won doned it later in favor of a master-worker ap- ging ideas and only 30 percent actually reading proach using PThreads. and understanding the sequential code. They per- the competition The team initially planned to begin with pro- ceived program understanding as one of the most by obtaining gram and algorithm understanding, followed difficult tasks. The team misunderstood large an impressive by parallelization with OpenMP. They reported parts of the code during their first parallelization working six or more hours a day. During the first attempts, and they failed to gain a thorough un- 10.3 speedup 10 days, they spent two to three hours a day on derstanding of the compression algorithm. using 51 understanding code and algorithms and trying Another difficulty was that many parts of the out OpenMP directives. They profiled the se- sequential code weren’t parallelizable right away, threads quential code with gprof to find performance due to data dependencies, function-call side ef- on a Sun bottlenecks. fects, and sequential-execution optimizations. In After trying different ways of fine-grained addition, the sequential version implemented many Niagara T1. parallelization with different OpenMP directives loops in a while(true){...} style that didn’t permit en- (for example, parallel for), the team realized that closing them with the parallel loops of OpenMP. the speedups weren’t promising and that changes Consequently, the team started to refactor the would have to be much more invasive. However, loops. They focused on loops with no function they didn’t want to make large modifications to calls, thus avoiding side effects in the parallel case. the Bzip2 library, so they decided to focus on par- They unraveled data dependencies, which led to allelism at a higher abstraction level and imple- code that could be wrapped by parallel OpenMP mented a master-worker approach in which they loops. Unfortunately, this effort resulted in only compressed different file blocks independently. In minor speedups. this design, the master fills a buffer with blocks, Team 4 explained that they thought OpenMP while workers take blocks from the buffer to com- would be a good, scalable approach for parallel- press them. ization in general. However, parallelizing Bzip2 The team had difficulties with the thread- would have required a much more fine-grained synchronization mechanism between master and synchronization between individual threads to workers. They used sequence diagrams to design preserve data dependencies. The use of OpenMP the mechanism and conditional variables and required massive refactorings to make the sequen- locks to implement it. Another difficulty was the tial code parallelizable. This work was difficult to file output, which required a sequence adjustment complete within the given time. Given the oppor- of the compressed blocks to obtain the original tunity to start over, they said they would have re- order. sorted to PThreads instead. Unfortunately, Team 3 didn’t finish the paral- lel version by the deadline, so they were excluded Quantitative Comparisons from the final competition. The main reason was Quantitative comparisons of the parallel Bzip2 a trivial bug in an I/O routine, but the team said code produced by the four teams reveal some they were too tired to find and fix it. However, interesting points. Table 1 shows the total LOC they submitted a working version one week after without blank lines and comment lines, along the deadline, which we used for benchmarking. with the number of lines containing parallel con- structs, such as pthread_create, pthread_mutex_lock, Team 4 and #pragma omp. Compared to sequential Bzip2, This team used a trial-and-error approach for the LOC of the parallel versions vary about ±15 parallelization, working from the bottom up. percent. Only a few lines—less than 2 percent— Their plan was to create execution profiles of the express parallelism. sequential code with gprof and KProf, find the Although the total LOC doesn’t vary widely, the critical path, and parallelize the code along this number of modified lines is quite high in the case path. They chose OpenMP as a means for par- of Team 1 (49 percent). Teams 2 and 3 modified allelization, which they considered to be simpler 12 and 17 percent of the original code, also a sig- and superior to PThreads. nificant refactoring effort. Team 4 modified about Team 4 reported that its actual work was dom- 3 percent of the original code, but failed to produce inated by trying out spontaneous ideas, which a speedup. was why they didn’t accurately log their effort in Team 2 won the competition by obtaining an

November/December 2009 IEEE SOFTWARE 73 Table 1 LOC comparisons for Sequential Bzip2 and four team efforts to parallelize its code Total Total LOC without com- LOC from previous column LOC LOC LOC Total effort in Program LOC ments or blank lines with parallelism constructs modified added removed person-hours

Sequential 8,090 5,102 0 — — — — Bzip2 (100%)

Team 1 7,030 4,228 49 (1.2%) 2,476 801 1,675 ~2∗50† (87%) (49%) (15.7%) (32.8%)

Team 2 8,515 5,356 48 (0.9%) 600 427 173 ~2∗50† (105%) (12%) (8.4%) (3.4%)

Team 3 9,270 5,915 82 (1.4%) 861 837 24 ~2∗30† (115%) (17%) (16.4%) (0.5%)

Team 4 8,207 5,170 8 (0.2%) 156 112 44 N/A (101%) (3%) (2.2%) (0.9%)

†~2∗x represents the effort of two people working for x hours each.

impressive 10.3 speedup using 51 threads on a Sun percent, but compared to a mere 0.2 percent paral- Niagara T1 (8 processors, 32 hardware threads in lelization constructs in its code, the refactoring is total). Speedups greater than the number of cores still significant. on the Niagara are possible, as the Sun T1 proces- Although Teams 1, 3, and 4 favored OpenMP sor provides four hardware threads per core and the at first, they soon realized that it required more machine switches to a different hardware thread if refactoring than PThreads. Teams 1 and 3 turned an active thread waits for data. In general, it can to PThreads instead, trading more explicit (and pay off to have many more software than hard- potentially more error-prone) thread program- ware threads—that is, more than 32 threads—to ming for less refactoring. The teams chose a sub- keep the processors busy. optimal parallelization strategy because of the Figure 1 compares performance results for the high refactoring cost that they would incur to do four teams. We compiled the benchmarked pro- it right. grams with the Gnu Compiler Collection (GCC) Team 2 reported that refactoring for parallel- 4.2, using a 900-Mbyte block size and eclipse- ization was frustrating because it took a long time java-europa-fall2-linux-gtk. (79 Mbytes) as an to see the results. However, refactoring was im- input file. We executed each algorithm five times portant for winning the competition. Team 1 also and averaged the results. The figure shows that felt frustrated by refactoring and stopped doing it Team 4 achieved a speedup of less than 1. Their under time pressure. execution time was about 15 percent slower than Obviously, developers need some tools to help the sequential program. Moreover, their pro- prepare sequential programs for parallelization. gram’s design allowed only 1, 2, 4, 8, or 16 paral- Automated support tools could increase produc- lel threads. tivity and reduce error rates. At the same time, tools could relieve stress and help developers fo- Lessons Learned cus on parallelization issues. Further research We offer eight lessons learned from our experience. must define the typical refactoring tasks for paral- lelization and how to automate them. Don’t Despair While Refactoring Parallelizing a sequential program might not be Incremental Parallelization Doesn’t Work possible right away. It might require massive refac- A purported strength of OpenMP is its enabling torings to prepare the code for parallelization. of incremental parallelization, which means you Refactorings improve code modularity, eliminate can start with sequential code and parallilize it by function-call side effects, and remove unnecessary simply adding pragmas, one by one. This might data dependencies. be possible for simple cases, but we have yet to en- Team 1 refactored almost half the code, while counter a real program where the incremental ap- Teams 2 and 3 refactored 12 and 17 percent, re- proach works. Real programs like Bzip are full of spectively (see Table 1). Team 4 refactored only 3 side effects, data dependencies, optimizations for

74 IEEE SOFTWARE www.computer.org/software 360 12 340 Team 2 Team 4 11 320 300 10 280 Team 1 9 260 Sequential version 240 8 220 7 200 180 6 Team 2 160 Team 3 Speedup 5 140 120 4

Execution time (seconds) 100 3 80 Team 4 60 2 Team 3 40 1 20 Team 1 Team 2 0 0 1 16 31 46 61 76 91 106 121 1 16 31 46 61 76 91 106 121 (a) No. of threads (b) No. of threads

Figure 1. Performance comparisons for the four teams: (a) execution times and (b) speedups for the parallelized Bzip2 programs. All data points are averages of five total executions including I/O. sequential execution, and tricky code. Significant Fine-Grained Parallelization refactoring is necessary. Isn’t the Only Choice PThreads might require less refactoring than Parallelizing loops, even if they’re on the critical OpenMP because it supports more explicit par- execution path, yielded only minor speedups (see allel programming, but the effort can still be the results for Teams 1 and 4). Successful paral- significant. lelizations constructed larger tasks for parallel On the basis of this and other case studies,10 execution. we believe the idea that sequential code can be Replacing a sequential sorting routine with a parallelized without significant modification is a parallel version didn’t significantly improve per- myth. formance either, as Team 1’s experience showed. We don’t have enough data at this time, but we Look beyond the Critical Path suspect that replacing standard library routines Some parallel programming textbooks advo- with parallel versions might not generate signifi- cate using a profiler to find the critical execu- cant speedups for some real applications. If this is tion path and parallelizing along that path.11 true, developing parallel libraries might remain an This advice sounds reasonable, but with hind- interesting exercise with little effect on practice. sight, we don’t think that it’s a good approach. Parallelizing the critical path worked for none Evaluate High-Level Parallelism’s Potential of the four teams in this study, nor for our The most successful teams tackled parallelization other case studies.10 on high abstraction levels by introducing a pro- Profiling sequential code certainly provides in- ducer-consumer pattern (Team 2) or master-worker formation about performance bottlenecks. How- pattern (Team 3). ever, these bottlenecks might be closely related You might assume that high-level parallelization to the way the sequential version implements the would be easier because it would involve less detail. specification. A sequential implementation typi- In our experience, the opposite is true. The origi- cally involves design choices that preclude others nal program had nonlocal data dependencies to un- representing degrees of freedom that effective par- ravel, and the functions had side effects to under- allelization might require. It’s not enough to study stand before any changes could be made. Team 4 the sequential implementation. You must also modified the code at loop level, which was actually study the specification as well and consider alter- easier but less successful. High-level improvements native, parallel algorithms for the parallel version. affected more parts of the program and were much This is why we gave the students an article that in- more difficult to realize, as documented by Team cluded Bzip’s specification.9 2’s refactoring effort.

November/December 2009 IEEE SOFTWARE 75 We need further research on how to give de- and we should work at the proper level to avoid be- velopers parallelization support at appropriate ab- ing penny-wise and pound-foolish.” This is excel- The students straction levels. For example, we need to identify lent advice for parallelization as well. useful parallel patterns, as Timothy Mattson and reported a his colleagues have done,12 and make these pat- If You Start from Scratch ... big difference terns configurable. We also need techniques to When you’re writing a new sequential program between simplify the integration of such patterns into exist- from scratch, ask whether it might be parallelized ing code. in the future. If so, write it in a way that saves parallelizing future refactoring work. For example, keep com- small “toy” Trial and Error Is Risky putational tasks modular and free of side effects Team 3 started with trial and error in OpenMP so that individual threads can execute them. You programs and and quickly realized that it wasn’t the way to go. should parameterize functions right away to sim- a real-world Team 4 didn’t extricate themselves from this plify the division of work. This controls the parts mode. It became difficult to measure progress of data to which the functions apply and enables application. or decide whether to abandon certain choices. In domain decomposition. It’s also important to short, this approach failed. write functions in a thread-safe manner that al- It’s easy to be seduced into a trial-and-error lows many instances to run simultaneously. mode when working with code, but it’s espe- It might help to think in terms of parallel de- cially dangerous when parallelizing because you sign patterns when writing a multicore program face many new choices and unknowns. Instead, from scratch.12 For example, the master-worker you need a plan of attack. First, figure out the pattern manages independently running, parallel “big picture.” Then develop some hypotheses computations, while the pipeline and producer- about where parallelization might yield speedups. consumer patterns break computations into phases Then start identifying the most promising choices that can be overlapped. You can combine parallel and eliminating poor ones. Some back-of-the- patterns—for instance, the master-worker pattern envelope calculations might help. combines with the domain decomposition pat- tern—and you can split individual worker com- Parallel Programming Isn’t a Black Art putations in pipeline fashion. Ideally, you would Most people still think of parallel programming design all computational steps so that they can be as a black art. However, our graduate students packaged into threads. We recommend trying out had only one semester’s experience with parallel prototypes early to check feasibility and perfor- programming and didn’t find it overwhelming. mance. Prototypes are useful because you might Dealing with parallelism requires new con- not know cache effects or overhead costs of various cepts, new algorithms, and special attention to building blocks (such as thread creation, commu- new types of defects. However, we believe that nication, or I/O) ahead of time. These effects and parallel system design and parallel programming costs might also deviate from what you expect. aren’t black arts that only the elite can master. We’re hopeful that, with appropriate concepts, Practical Training languages, tools, and training, competent pro- None of the students experienced difficulties in grammers will handle multicore system chal- creating small parallel programs from scratch. lenges just fine. Given some time, the professional However, things changed when they had to par- community will identify the engineering princi- allelize Bzip2. Despite intensive training in par- ples to support general-purpose, parallel software allelization, the students reported a big differ- and the ways to teach and apply these principles ence between parallelizing small “toy” programs successfully. and a real-world application. They couldn’t readily apply concepts such as loop paralleliza- Other Issues tion or data partitioning. Bzip2 was not only Beyond this study, the following advice might be more complex but also heavily optimized for se- helpful. quential performance, making parallelization a difficult task. Consider Multiple Levels for Parallelization It’s well known that computer science stu- Jon Louis Bentley identified different levels to dents need exposure to real systems, and multi- consider when writing efficient, sequential pro- core in no way diminishes this need. Fortunately, grams:13 “In most systems there are a number of many students are excited about the current shift design levels at which we can increase efficiency, to parallel computation and eager to learn and

76 IEEE SOFTWARE www.computer.org/software explore. They are further motivated by the pros- About the Authors pect of a virtual job guarantee from being able to work with multicore/manycore software. Victor Pankratius is the head of the young investigator group on multicore software engineering at the University of Karlsruhe. In multicore software engineering, his interests include autotuning, language extensions, debugging, empirical studies, and mak- ing multicore programming accessible to many developers. Pankratius has a Dr.rer.pol. from ulticore software isn’t merely a matter the University of Karsruhe. He’s a member of the IEEE Computer Society, the ACM, and the of programming in the small. Many German Computer Science Society. Contact him at [email protected]. aspects of classical software engineer- Ming apply and must be adapted, such as design patterns, performance modeling, prototyping, Ali Jannesari is a PhD student in the University of Karlsruhe’s Multicore Software Engineering group. His interests include debugging parallel programs. He received a refactoring, and tool support. Multicore software master’s degree in computer science from the University of Stuttgart. Contact him at engineering is an area that seeks to build a core of [email protected]. systematic, empirically validated engineering prin- ciples for general-purpose, parallel software. We hope this limited study helps open the door to this research. Walter F. Tichy is a professor of computer science at the University of Karlsruhe. He’s also director of the Forschungszentrum Informatik, a technology transfer institute. His primary research interests are software engineering and parallelism. Tichy has a PhD in Acknowledgments computer science from Carnegie Mellon University. He’s a cofounder of ParTec, a company We thank our student assistant Kai-Bin Bao and the specializing in cluster computing, and a member of the IEEE Computer Society, the ACM, and course students for their support. We also appreciate the German Computer Science Society. Contact him at [email protected]. the support of the excellence initiative at the Univer- sity of Karlsruhe.

References 1. D. Szafron and J. Schaeffer, “An Experiment to Mea- sure the Usability of Parallel Programming Systems,” Concurrency: Practice and Experience, vol. 8, no. 2, 1996, pp. 147–166. 2. L. Hochstein et al., “Parallel Programmer Productivity: A Case Study of Novice Parallel Programmers,” Proc. 2005 ACM/IEEE Conf. Supercomputing, IEEE CS Press, 2005, p. 35. 3. L. Hochstein and V.R. Basili, “The ASC-Alliance Projects: A Case Study of Large-Scale Parallel Scientific Code Development,” Computer, vol. 41, no. 3, 2008, pp. 50–58. 4. N. Werensteijn, Smpbzip2, May 2003; http://home. student.utwente.nl/n.werensteijn/smpbzip2. 5. Bzip2smp v. 1.0, Dec. 2005; http://bzip2smp. sourceforge.net. 6. J. Gilchrist, Parallel Bzip2 v. 1.0.2, 25 July 2007; http:// compression.ca/pbzip2. 7. R.K. Yin, Case Study Research: Design and Methods, 3rd ed., Sage Publications, 2002. 8. V. Pankratius et al., “Parallelizing Bzip2: A Case Study in Multicore Software Engineering, tech. report, IPD Inst., Univ. of Karlsruhe, Apr. 2008; www. multicore-systems.org/research. 9. M. Tamm, “Data Compression with the BWT Algo- rithm,” C’t, vol. 16, 2000, pp. 194–201 (in German). The magazine Top-flight departments in each issue! 10. V. Pankratius et al., “Software Engineering for Multi- Book Reviews Scientific Programming core Systems: An Experience Report,” Proc. 1st Int’l of computational Computer Simulations Technologies Workshop Multicore Software Eng. (IWMSE 08), ACM Press, 2008, pp. 53–60. tools and methods Education Views and Opinions 11. S. Akhter and J. Roberts, Multi-Core Programming, News Visualization Corner Intel Press, 2006. for 21st century 12. T.G. Mattson et al., Patterns for Parallel Programming, MEMBERS $47/year Addison-Wesley, 2004. science. for print and online 13. J.L. Bentley, Writing Efficient Programs, Prentice Hall, 1982.

Subscribe to CiSE online at http://cise.aip.org For more information on this or any other computing topic, please visit our and www.computer.org/cise Digital Library at www.computer.org/csdl.

November/December 2009 IEEE SOFTWARE 77