SPES

Software Plattform Embedded Systems 2020

- Beschreibung der Fallstudie „Multi-core and Many-core Evaluation“ -

Version: 1.0

Projektbezeichnung SPES 2020 Verantwortlich Richard Membarth QS-Verantwortlich Mario Körner, Frank Hannig Erstellt am 18.06.2010 Zuletzt geändert 18.06.2010 16:09 Freigabestatus Vertraulich für Partner Projektöffentlich X Öffentlich Bearbeitungszustand in Bearbeitung vorgelegt X fertig gestellt Weitere Produktinformationen

Erzeugung Richard Membarth

Mitwirkend Frank Hannig, Mario Körner, Wieland Eckert

Änderungsverzeichnis

Änderung Geänderte Beschreibung der Änderung Autor Zustand Kapitel Nr. Datum Version

1 22.06.10 1.0 Alle Finale Reporterstellung

Prüfverzeichnis Die folgende Tabelle zeigt einen Überblick über alle Prüfungen – sowohl Eigenprüfungen wie auch Prüfungen durch eigenständige Qualitätssicherung – des vorliegenden Dokumentes.

Geprüfte Neuer Datum Anmerkungen Prüfer Version Produktzustand

Contents

1 Evaluation Application and Criteria 7 1.1 2D/3D Image Registration ...... 7 1.2 Checklist ...... 9 1.3 Profiling ...... 10 1.4 Parallelization Approaches ...... 12

2 Multi-Core Frameworks 15 2.1 OpenMP ...... 15 2.2 Cilk++ ...... 29 2.3 Threading Building Blocks ...... 43 2.4 RapidMind ...... 57 2.5 OpenCL ...... 70 2.6 Discussion ...... 82

3 Many-Core Frameworks 89 3.1 RapidMind ...... 89 3.2 PGI Accelerator ...... 92 3.3 OpenCL ...... 99 3.4 CUDA ...... 105 3.5 Ct ...... 112 3.6 Larrabee ...... 112 3.7 Related Frameworks ...... 112 3.7.1 Bulk-Synchronous GPU Programming ...... 112 3.7.2 HMPP Workbench ...... 113 3.7.3 Goose ...... 113 3.7.4 YDEL for CUDA ...... 113 3.8 Discussion ...... 113

4 Conclusion 117

Bibliography 121

3

Abstract

In this study, different parallelization frameworks for standard shared memory multi-core processors as well as parallelization frameworks for many-core processors like graphics cards are evaluated. To evaluate the frameworks, not only perfor- mance numbers are considered, but also other criteria like scalability, productivity, and other techniques supported by the framework like pipelining. For evaluation, a computational intensive application from medical imaging, namely 2D/3D image registration is considered. In 2D/3D image registration, a preoperatively acquired volume is registered with an X-ray image. A 2D projection from the volume is generated and aligned with the X-ray image by means of translating and rotating the volume according to the three coordinate axes. This alignment step is repeated to find the best match between the projected 2D image and the X-ray image. To evaluate the parallelization frameworks, two parallelization strategies are con- sidered. One the one hand, fine-grained data parallelism, and on the other hand, coarse-grained task parallelism. We show that for most multi-core frameworks both strategies are applicable, whereas many-core frameworks support only fine-grained data parallelism. We compare relevant and widely used frameworks like OpenMP, Cilk++, Threading Building Blocks, RapidMind, and OpenCL for shared mem- ory multi-core architectures. These include Open Source as well as commercially available solutions. Similarly, for many-core architectures like graphics cards, we consider RapidMind, PGI Accelerator, OpenCL, and CUDA. The frameworks take different approaches to provide parallelization support for the programmer. These range from library solutions or directive based compiler extensions to language ex- tensions and completely new languages.

5

1 Evaluation Application and Criteria

Felix, qui potuit rerum cognoscere causas.

(Vergil)

In this chapter, the medical application chosen for the evaluation, namely the 2D/3D image registration, will be introduced, as well the criteria for evaluation of the different frameworks. At the end, the implications from the profiling of the reference implementation for the evaluation are given as well as a description of the employed parallelization approaches.

1.1 2D/3D Image Registration

In medical settings, images of the same modality or different modalities are of- ten needed in order to provide precise diagnoses. However, a meaningful usage of different images is only possible if the images are beforehand correctly aligned. Therefore, an image registration algorithm is deployed. In the investigated 2D/3D image registration, a previously stored volume is registered with an X-ray im- age [KDF+08, WPD+97]. The X-ray image results from the attenuation of the X-rays through an object from the source to the detector. Goal of the registration is to align the volume and the image. Therefore, the volume can be translated and rotated according to the three coordinate axes. For such a transformation an artificial X-ray image is generated by iterating through the transformed volume and calculating the attenuated intensity for each pixel. In order to evaluate the quality of the registration, the reconstructed X-ray image is compared with the original X-ray image using various similarity measures. To obtain the best alignment of the volume with the X-ray image, the parameters for the transformation are optimized until no further improvement is achieved. In this optimization step, for each eval- uation one artificial X-ray image is generated for the transformation parameters and compared with the original image. The similarity measures include sequential algorithms like the summation of values over the whole image and have in parts

7 1 Evaluation Application and Criteria

Figure 1.1: Work flow of the 2D/3D image registration. random memory access patterns, for example for histogram generation. The work flow of the complete 2D/3D image registration as shown in Figure 1.1 consists of two major computational intensive parts. Firstly, a digitally recon- structed radiograph (DRR) is generated according to the parameter vector x = (tx, ty, tz, rx, ry, rz) describing the translation in mm along the axes and rotation according to the Rodrigues vector [TV98]. Ray casting is used to generate the radiograph from the volume, casting one ray for each pixel through the volume. On its way through the volume, the intensity of the ray is attenuated depending on the material it passes. A detailed description with mathematical formulas on how the attenuation is calculated can be found in [KDF+08]. Secondly, intensity- based similarity measures are calculated in order to evaluate how well the digitally reconstructed radiograph and the X-ray image match. We consider three similar- ity measures, namely sum of square differences (SSD), normalized cross correlation (NCC), and mutual information (MI). These similarity measures are weighted to asses the quality of the current parameter vector x. To align the two images best, optimization techniques are used to find the best parameter vector. Therefore, we use two different optimization techniques. In a first step local search is used to find the best parameter vector evaluating randomly changed parameter vectors. The changes to the parameter vector have to be within a predefined range, so that only parameter vectors similar to the input parameter vector are evaluated. The best of these parameter vectors is used in the second optimization technique, hill climb- ing. In hill climbing always one of the parameters in the vector is changed—once increased by a fixed value and once decreased by the same value. This is done for all parameters and the best parameter vector is taken afterwards for the next evaluation, now, using a smaller value to change the parameters. This is done until

8 1.2 Checklist no further improvement is found.

1.2 Checklist

The following criteria have been chosen as being relevant for the evaluation of the multi-core and many-core frameworks. The criteria should be evaluated for each of the frameworks and, afterwards, a summary of the evaluation of all frameworks should be given.

• Time required to get familiar with a particular framework: This includes the time to understand how the framework works and how the framework can be utilized for parallelization, as well as experimenting with some small examples.

• Effort of time to map reference code to a particular framework, how much code (percent) has to be changed or added: This item describes the effort that has to be invested in order to map the reference code to the new framework. This corresponds mainly to the re- quired time to rewrite parts of the source code, but also the effort to express sequential algorithms in parallel counterparts.

• Support of parallelization by the framework: To what extend does the framework support the programmer to parallelize a program, for example, abstraction of available resources (cores), management of those resources, or support of auto-parallelization.

• Support of data partitioning by the framework: Data is often tiled and processed by different threads and cores, respectively. Does the framework support to partition automatically data, or has this to be done by hand.

• Applicability for different algorithm classes and target platforms: Which type of parallel algorithms can be expressed. We consider in particular the support of task parallelism and data parallelism.

• Advantages, drawbacks, and difficulties of a particular framework: Here, the unique properties of the framework that distinguishes it from other frameworks are listed, but also problems encountered during employment of the framework.

• Effort to get a certain result, for example, performance or throughput: How much effort has to be spent in order to achieve a given result. For instance this may be a speed of 3x using a system with four cores.

9 1 Evaluation Application and Criteria

• Scalability of the framework with respect to architecture family, new hard- ware: How good does the framework abstract from the hardware so that the programs can be executed on different hardware of the same architecture fam- ily.

• Scalability in terms of problem size: How does the execution time scale with the problem size. Does the framework scale well even for smaller problem sizes.

• Resource awareness of framework (run-time system): Typically a program tries to use as many resources as available. However, this does not mean that the resources are idle and can be occupied exclusively. Does the run-time system of the framework detect such situations and adapt its resources accordingly?

• Resource management for multiple sub-algorithms (image pipeline): Pipelining is a common concept in image processing, where the output of one algorithm is directly passed to the next algorithm. Does the framework sup- port for this type of processing? Are the resources for this type of processing treated in a special way?

• Support of streaming by the framework: In many domains a continuous stream of data has to be processed. This can be a sequence of images in image processing. Does the framework provide support to handle such a streaming concept, for example, to prepare and load the next image while the current image is processed.

1.3 Profiling

For profiling, tools like gprof, cachegrind, callgrind, and likwid1 are used to eval- uate the computational intensive parts of the 2D/3D image registration as well as to identify the expected behavior when changing the resolution of the volume and image. Cachegrind and callgrind simulate the instruction and data accesses for a given program compiled with debugging support. They model level one and level two caches and give detailed information like cache misses for instructions. This information can be annotated to the source code for further evaluations. Similar data can be obtained by likwid, reading and interpreting the corresponding hard- ware counters. In contrast, gprof profiles an application at run-time by sampling and gives a detailed breakdown of the time spent in the different parts of an appli- cation. The output of gprof for a volume of 256 × 256 × 189 voxels and an image resolution of 320 × 240 pixels is as follows: Each sample counts as 0.01 seconds. % cumulative self self total

1http://code.google.com/p/likwid

10 1.3 Profiling

time seconds seconds calls s/call s/call name 72.02 44.16 44.16 2845058374 0.00 0.00 getNN 21.08 57.09 12.92 161 0.08 0.38 get_drr 3.08 58.98 1.89 11978400 0.00 0.00 intersect 2.61 60.58 1.60 23945362 0.00 0.00 norm 0.38 60.81 0.23 1 0.23 0.23 load_volume 0.31 61.00 0.19 160 0.00 0.00 gc 0.29 61.18 0.18 11972434 0.00 0.00 m_mul_v 0.13 61.26 0.08 1 0.08 0.08 reduce_volume 0.05 61.29 0.03 main 0.03 61.31 0.02 160 0.00 0.00 mi 0.02 61.32 0.01 644 0.00 0.00 m_init 0.02 61.33 0.01 160 0.00 0.00 quality 0.02 61.34 0.01 1 0.01 0.01 load_image 0.00 61.34 0.00 483 0.00 0.00 m_mul_m 0.00 61.34 0.00 161 0.00 0.00 applyTransform 0.00 61.34 0.00 161 0.00 0.00 mReg 0.00 61.34 0.00 31 0.00 0.00 vec_div 0.00 61.34 0.00 3 0.00 0.00 reduce_image 0.00 61.34 0.00 2 0.00 0.00 write_image 0.00 61.34 0.00 1 0.00 22.73 hill_climbing 0.00 61.34 0.00 1 0.00 37.88 local_search It can be seen that most of the time is spent in the functions to generate the digitally reconstructed radiograph (getNN, get_drr, intersect, and norm), while the time for the quality measures is insignificant (gc, mi—ssd is not even listed). The most intensive function accesses the volume with a pattern that does not utilizes the cache (getNN). For the representation of the image and the volume we use throughout the pro- gram single precision floating point representation, only for the calculation of the transformation according to the vector x, we use double precision floating point representation. The accuracy of floats is for the generation of the radiograph and the quality measures sufficient and allows a fair comparison of the results on the different architectures—most graphics cards support only single precision floating point representation. Even the values of the volume are stored as single precision floating point numbers, although the more compact short integer representation would be sufficient. This saves value conversions, but requires more memory for storage and memory bandwidth. For the evaluation of the scalability in terms of problem size, the three volume and image resolutions listed in Table 1.1 are considered. The computational intensity moving from one resolution to the next increases for the quality measure by a factor of 4 since the number of pixels in the image increases by this factor. However, for the generation of the radiograph, an additional factor of 2 has to be considered since the number of voxels passed by each ray casted through the volume increases by this factor. Thus, the execution time should increase roughly by a factor of 8 when the resolution of the volume and image doubles. Using likwid to count the instructions when moving to a different resolution, the number of instructions increases by a factor of 7.3 (7.6) moving from the small volume to the middle volume (middle

11 1 Evaluation Application and Criteria volume to the large volume). On the main system used for benchmarking, the execution time increases by a factor of 8.6 (8.5) moving from the small volume to the middle volume (from the middle volume to the large volume). The execution time should also increase roughly by the same factor when using one of the frameworks.

Table 1.1: Different volume and image sizes considered for evaluation. Volume resolution Image resolution Volume size (MB) 128 × 128 × 94 160 × 120 6 256 × 256 × 189 320 × 240 49 512 × 512 × 378 640 × 480 396

1.4 Parallelization Approaches

For parallelization on the software side, there exist mainly two prominent strategies, namely fine-grained data parallel parallelization and coarse-grained task parallel parallelization. Both approaches are applicable for the 2D/3D image registration and described in detail here.

Fine-grained Parallelization describes the parallel processing that is focused on data elements, that is, typically one thread is operating per data element. This means for the 2D/3D image registration that one thread per pixel is used to generate the radiograph in parallel or to calculate the quality measures in parallel. The threads are lightweight and do only little work before they finish compared to coarse- grained parallelization. More precisely, each iteration of a loop processing an image will be executed by one thread. This type of parallelization is typically used on massively parallel architectures like graphics cards.

Coarse-grained Parallelization describes the parallel execution of different tasks. Compared to the fine-grained data parallelism, each thread executes not only few operations on single data elements, but performs typically an operation on the whole data set. This means for the 2D/3D image registration that one thread performs the evaluation of one parameter vector x. Different parameter vectors are evalu- ated in parallel by different threads. This type of parallelization is typically used on standard multi-core processors.

These two parallelization approaches are used to evaluate the different frameworks in Chapter 2 and Chapter 3. While for most frameworks for standard shared mem- ory multi-core architectures both approaches are applicable, for many-core graphics cards only the fine-grained approach can be used.

12 1.4 Parallelization Approaches

In order to evaluate the overhead of the two parallelization approaches on stan- dard multi-core processors, we calculate the sequential part of the fully parallelized part of the 2D/3D image registration using curve fitting. Therefore, we use the formula of Amdahl’s law [Amd67] in Equation (1.1). The sequential part of the ap- plication is denoted by α in the formula and the number of processors by n. Using the measured speedup S(n), α can be estimated using curve fitting.

1 S(n) = 1−α (1.1) α + n

Histogram Generation is part of one of the similarity measures, namely mutual information, and has been identified as one of the most challenging algorithm for parallelization. While the sequential implementation on a single core as in List- ing 1.1 is straightforward, the parallelization of this code raise challenges to the programmer. Firstly, the access pattern to the histogram is not regular, but de- pends on the value of the current value. Secondly, when updating the same bin by different threads in parallel, may lead to race conditions. Therefore, the histogram generation will be investigated in detail for each parallelization framework.

1 for ( int y=0; y

13

2 Multi-Core Frameworks

In this chapter frameworks for programming standard multi-core systems found in todays desktop computers are evaluated. The frameworks we considered are to a large extend those frameworks being relevant to and used in industry. For evaluation a system consisting of four Intel Xeon Dunnington CPUs are used, each hosting six cores running at 2.67 GHz. This allows to use up to 24 cores. Only for OpenCL, a different system had to be chosen due to incompatibility of the OpenCL library with the glibc version on the Dunnington system. Therefore, we use a system consisting of two Intel Nehalem-EP CPUs, each hosting four cores running at 2.66 GHz. The CPUs support hyperthreading, that is, simultaneous multithreading, the ability to run two threads per core, and, hence, up to 16 threads in total. Table 2.1 lists the compiler and framework version used for each of the investigated frameworks.

Table 2.1: Compiler version and framework version used for eval- uation. Framework Version OpenMP gcc/4.4.2, OpenMP 3.0 Cilk++ gcc/4.2.4, Cilk++ 1.10 TBB icc/11.1, TBB 3.0 RapidMind gcc/4.4.2, RapidMind 4.0.1 OpenCL gcc/4.4.2, OpenCL 1.0, Stream 2.0.1/2.1

2.1 OpenMP

Framework The Open Multi-Processing (OpenMP) is a standard that defines an application programming interface (API) to specify shared memory parallelism in C, C++, and Fortran programs [Ope09]. The OpenMP specification [Ope08] is implemented by all major compilers like Microsoft’s compiler for Visual C++ (MSVC), the GNU Compiler Collection (gcc), or Intel’s C++ compiler (icc). OpenMP provides prepro- cessor directives, so called pragmas to express parallelism. These pragmas specify which parts of the code should be executed in parallel and how data should be

15 2 Multi-Core Frameworks shared. The basic idea behind OpenMP is a fork-join model, where one master thread executes throughout the whole program and forks off threads to process parts of the program in parallel. OpenMP provides different work-sharing con- structs which allow to express two types of parallelism, namely task parallelism and data parallelism. The main source of data parallelism are loop programs iterat- ing over a data set, performing the same operation on each element. This can be expressed in OpenMP using a parallel for loop: #pragma omp parallel for // parallel loop For task parallelism, OpenMP 3. 0 introduced the task concept to create indepen- dent tasks, which are processed by different threads: #pragma omp task // call to first function executed as task

#pragma omp task // call to second function executed as task Only these work-sharing constructs need to be specified in order to execute the annotated code fragments in parallel. The actual parallelization is done by the compiler, mapping loop iterations or tasks to the underlying threading concept like POSIX threads (Pthreads) on GNU Linux. Executing parallel code on shared memory machines may lead to race conditions when multiple threads access the same data. Therefore, OpenMP provides clauses to specify how the data is shared between threads, for example, data can be shared or private to a thread. In addition it is possible to synchronize access to variables, for instance using the critical directive. OpenMP provides also a library with user-level functions to get information about OpenMP related system and program properties. Some of these properties like the number of threads to be used can be adjusted, too. Alternatively, environment variables can be used to adjust these properties. Listing 2.1 shows how histogram generation can be parallelized using OpenMP. The pragma specifies that the outer loop should be executed in parallel. The width and hight parameters are as well private, but initialized by the value the variables hold before entering the loop. The image and histogram are shared between the threads, while the loop iterators are private to each thread. Since multiple threads may update the same bin of the histogram, the access to the bins has to be syn- chronized. This is done using the omp critical pragma. However, here, every time a bin counter is increased, a critical section is en- tered and only one thread can increase the bin counters at time. This reduces the parallelism and involves some overhead to synchronize the access to the bin coun- ters. Therefore, we use a second alternative described in Listing 2.2, where each thread creates and works on a temporary histogram. These thread-local histograms are eventually merged into the final histogram using a critical section. This ap- proach has less synchronization overhead and is faster compared to the sequential implementation while the implementation from Listing 2.1 takes longer than the sequential implementation.

16 2.1 OpenMP

1 #pragma omp parallel for default (none) shared(img, hist) private ( x , y ) firstprivate(height , width) 2 for ( int y=0; y

1 int hist_tmp [MAX_INTENSITY ] ; 2 memset ( hist_tmp , 0x0 , MAX_INTENSITY∗ sizeof ( int )); 3 4 #pragma omp parallel default (none) shared(img, hist) private ( x , y ) firstprivate(height , width, hist_tmp) 5 { 6 #pragma omp for 7 for ( int y=0; y

17 2 Multi-Core Frameworks

Evaluation Time required to get familiar with a particular framework: OpenMP requires little time to get familiar. After one day the main features and the concept behind OpenMP is throughout understood and the application can be mapped parallelized using OpenMP. Effort of time to map reference code to a particular framework, how much code (percent) has to be changed or added: Existing code can be parallelized with almost no change to the original code, only pragmas have to be inserted into the code to indicate the compiler the parts of the program to be parallelized. A few days are required for the whole application. Only the optimized histogram generation took serious amount of time to come up with a proper solution and realize it using OpenMP. For the coarse-grained approach some restructuring of the code is necessary. Support of parallelization by the framework: The parallelization is completely done by the framework, only the code fragments to be parallelized have to be annotated. In addition, hints for data-sharing and synchronization are required. Support of data partitioning by the framework: The main data partitioning is done by the user, that is, the loops iterating over the data set are not altered. The schedule of loop iterations to threads, however, can be specified by the schedule directive. For example, the schedule(static, 1) tells the compiler to use a static scheduling and to assign always iterations of a chunk size of one in a round-robin fashion to threads. Alternatively, dynamic scheduling can be used, where threads request for new chunks after they finished processing the current chunk. Applicability for different algorithm classes and target platforms: OpenMP supports fine-grained data parallelism as well as coarse-grained task par- allelism. Support for real task parallelism was added in OpenMP 3. 0. Advantages, drawbacks, and difficulties of a particular framework: The biggest advantage of OpenMP is that it is relatively easy to parallelize the code. No new syntax has to be learned, only hints for the compiler have to be given. One major drawback of OpenMP is that no support is provided to detect errors specific to , like race conditions. Initially, static memory management was used to assign each thread its memory from a preallocated memory pool for the coarse-grained approach. However, in some cases, the results were different from our reference implementation. The thread-id was mapped to a certain memory region that a particular thread could work on. Sometimes the execution times of the iterations scheduled to the cores took not the same amount of time. This way, the newly assigned memory region was still used by a different thread when the thread-id of the thread changed. Presumably the assigned thread-id for the next iteration was still used by a different thread in the current iteration. Using a barrier synchronization solved this problem. Since the static memory management has no advantages over dynamic memory management, it was not further considered.

18 2.1 OpenMP

Effort to get a certain result, for example, performance or throughput: The effort to get a reasonable speedup is relatively small, since only minor changes to the code have to be done. Scalability of the framework with respect to architecture family, new hardware: Figure 2.1, 2.2, and 2.3 show the execution times of the fine-grained and coarse- grained OpenMP implementation on up to 24 cores for different volume and image resolutions. The exact times are in addition listed in Table 2.2 for the fine-grained implementation and in Table 2.3 for the coarse-grained implementation. The graphs show that for the fine-grained implementation five cores are required to compensate the parallelization overhead and to catch up with the reference implementation. In contrast, the coarse-grained implementation runs already on one core as fast as the baseline implementation. Also, getting to the saturation point when adding fur- ther cores yields no improvement is reached for the coarse-grained implementation earlier. While the saturation is reached at about 3.5 s for the coarse-grained imple- mentation, only about 5.6 s can be obtained using the fine-grained implementation for the middle resolution. Plotting the speedup of both implementation for the different volume and image resolutions in Figure 2.4, 2.5, and 2.6 shows that the obtained speedups of the fine- grained implementation is far below those of the coarse-grained implementation, however, improving with the problem size. Using Amdahl’s law to determine the sequential part of each implementation results in an alpha of round about 2 % (2.55, 1.70, and 1.94) for the coarse-grained implementation, while the alpha for the fine-grained implementation is much higher and fluctuates more (24.55, 10.14, and 7.29). This shows that the parallelization overhead of OpenMP is much higher for the fine-grained implementation and can be compensated by larger data sets.

19 2 Multi-Core Frameworks

Table 2.2: Execution times in seconds using the fine-grained OpenMP implementa- tion on up to 24 cores for different volume resolutions. # cores 128 × 128 × 94 256 × 256 × 189 512 × 512 × 378 1 33.98 269.08 2,158.20 2 16.64 130.28 1,040.89 3 10.74 82.23 662.75 4 7.77 59.54 475.15 5 5.98 44.59 358.76 6 4.83 35.77 282.91 7 4.26 29.43 231.31 8 3.39 23.78 186.58 9 3.11 20.23 160.94 10 2.58 17.04 135.79 11 2.38 15.20 117.38 12 2.11 13.06 103.09 13 2.15 12.06 90.52 14 1.95 11.32 83.50 15 1.72 9.72 73.27 16 1.76 8.77 65.80 17 1.81 8.96 62.35 18 1.61 8.31 56.93 19 1.63 7.53 55.01 20 1.50 6.87 49.59 21 1.52 6.79 46.66 22 1.51 6.24 44.46 23 1.55 6.31 42.23 24 1.40 5.63 39.68

20 2.1 OpenMP

Table 2.3: Execution times in seconds using the coarse-grained OpenMP implemen- tation on up to 24 cores for different volume resolutions. # cores 128 × 128 × 94 256 × 256 × 189 512 × 512 × 378 1 6.43 55.56 472.89 2 3.23 27.79 237.50 3 2.19 18.81 160.35 4 1.63 13.97 119.03 5 1.43 12.21 104.36 6 1.11 9.42 80.91 7 1.03 8.72 74.88 8 0.95 8.02 69.23 9 0.91 7.71 66.30 10 0.84 7.03 60.26 11 1.14 7.02 60.31 12 0.60 4.93 63.34 13 0.57 4.54 40.53 14 0.56 4.59 40.12 15 0.52 4.22 36.62 16 0.72 4.23 36.47 17 0.49 3.83 33.58 18 0.49 3.82 33.61 19 0.49 3.89 34.15 20 0.45 3.52 30.73 21 0.46 3.53 30.72 22 0.46 3.56 30.68 23 0.46 3.56 30.46 24 0.46 3.53 31.20

21 2 Multi-Core Frameworks

24

22

20

18

16

14

12 time (s) 10

8 naïve 6

4

2 fine-grained coarse-grained 0 0 2 4 6 8 10 12 14 16 18 20 22 24 cores

Figure 2.1: Execution times of fine-grained and coarse-grained OpenMP paralleliza- tions on up to 24 cores for the 128 × 128 × 94 volume compared with the naïve implementation.

22 2.1 OpenMP

100

90

80

70

60 naïve

50 time (s) 40

30

20

10 fine-grained coarse-grained 0 0 2 4 6 8 10 12 14 16 18 20 22 24 cores

Figure 2.2: Execution times of fine-grained and coarse-grained OpenMP paralleliza- tions on up to 24 cores for the 256 × 256 × 189 volume compared with the naïve implementation.

23 2 Multi-Core Frameworks

600

500 naïve

400

300 time (s)

200

100 fine-grained coarse-grained

0 0 2 4 6 8 10 12 14 16 18 20 22 24 cores

Figure 2.3: Execution times of fine-grained and coarse-grained OpenMP paralleliza- tions on up to 24 cores for the 512 × 512 × 378 volume compared with the naïve implementation.

24 2.1 OpenMP

24

22

20

18

16 coarse-grained 14 fitted alpha: 2.55 % 12 speedup 10

8

6 fine-grained 4 fitted alpha: 24.55 % 2

0 0 2 4 6 8 10 12 14 16 18 20 22 24 cores

Figure 2.4: Speedup of fine-grained and coarse-grained OpenMP implementation on up to 24 cores for the 128 × 128 × 94 volume compared with the naïve implementation.

25 2 Multi-Core Frameworks

24

22

20

18 coarse-grained 16 fitted alpha: 1.7 % 14

12

speedup fine-grained 10

8

6 fitted alpha: 10.14 %

4

2

0 0 2 4 6 8 10 12 14 16 18 20 22 24 cores

Figure 2.5: Speedup of fine-grained and coarse-grained OpenMP implementation on up to 24 cores for the 256 × 256 × 189 volume compared with the naïve implementation.

26 2.1 OpenMP

24

22

20

18

16 coarse-grained fitted alpha: 1.94 % 14 fine-grained 12 speedup 10

8 fitted alpha: 7.29 %

6

4

2

0 0 2 4 6 8 10 12 14 16 18 20 22 24 cores

Figure 2.6: Speedup of fine-grained and coarse-grained OpenMP implementation on up to 24 cores for the 512 × 512 × 378 volume compared with the naïve implementation.

27 2 Multi-Core Frameworks

Scalability in terms of problem size: As seen in Figure 2.4, 2.5, and 2.6, the sequential portion (i. e., the overhead that comes with the parallelization) of the fine-grained implementation is huge in OpenMP. For the small volume, 24.55 % of the parallelized part of the 2D/3D image registration is still sequential. This decreases to 7.29 % for the large volume. For the coarse-grained implementation only insignificant differences can be seen. The same can be seen when plotting the execution time ratio of two volume resolutions. As discussed in Section 1.3, the number of instructions and execution time increases roughly by a factor of 8. While this is the case for the coarse-grained implementa- tion as seen in Figure 2.8, the ratio drops with increasing number of cores for the fine-grained implementation (see Figure 2.7). Hence, the fine-grained implementa- tion scales well for larger problem sizes with increasing number of cores. However, this is mainly due to the large parallelization overhead of OpenMP compensated by more cores.

12

10

8 256 → 512

6 execution time ratio 4 128 → 256

2

0 0 2 4 6 8 10 12 14 16 18 20 22 24 cores

Figure 2.7: Scalability of fine-grained OpenMP parallelization: Shown is the execu- tion time ratio when moving to a different volume resolution.

Resource awareness of framework (run-time system): OpenMP is to our knowledge not resource aware. Resource management for multiple sub-algorithms (image pipeline): OpenMP does not provide support for resource management for pipelining.

28 2.2 Cilk++

12

10 256 → 512

8

128 → 256

6 execution time ratio 4

2

0 0 2 4 6 8 10 12 14 16 18 20 22 24 cores

Figure 2.8: Scalability of coarse-grained OpenMP parallelization: Shown is the ex- ecution time ratio when moving to a different volume resolution.

Support of streaming by the framework: OpenMP does not provide streaming support for series of images.

2.2 Cilk++

Framework Cilk++ [Lei09] is a commercial version of the Cilk [BJK+95] language developed at the MIT for multithreaded parallel programming. Cilk++ was recently acquired by Intel and is since then available from them. It is an extension to the C++ language adding three basic keywords to process loops in parallel, launch new tasks, and synchronize between tasks. These keywords allow to express task as well as data parallelism. In addition, Cilk++ provides hyperobjects, constructs that solve data race problems created by parallel access of global variables without locks. For code generation, Cilk++ provides two compilers, one based on MSVC for Windows platforms and one based on gcc for GNU Linux. Cilk++ provides also tools to detect race conditions and to estimate the achievable speedup and inherent parallelism of cilkified programs. Its run-time system implements an efficient work-stealing

29 2 Multi-Core Frameworks

1 cilk ::mutex c_mutex; 2 3 c i l k _ f o r ( int y=0; y

algorithm that distributes the workload to idle processors.

The three basic keywords are cilk_for to execute loops in parallel, cilk_spawn to launch new tasks, and cilk_sync to synchronize between different tasks previously launched. To cope with race conditions, Cilk++ provides also low-level synchroniza- tion directives like mutexes in addition to hyperobjects. Calculating a histogram in parallel requires the programmer only to replace the for keyword by the Cilk++ cilk_for keyword and to synchronize the access to the histogram bins using a mutex as described in Listing 2.3.

To avoid the synchronization overhead on each bin counter update, we try to use again thread-local histograms. However, compared to OpenMP, there is no parallel closure that could be used to create a thread-local histogram that encloses the parallelized loop. Instead, thread-local variables have to be declared inside of cilk_for loop and would be recreated for each iteration. Therefore, we add an additional outermost loop with an iteration count equal to the number of threads available. The iteration count of the loop iterating over the image is accordingly adjusted as shown in Listing 2.4. After the thread-local histograms have been created, they are merged into the final histogram synchronizing the access to the bin counters. This way, the synchronization overhead is minimized and a significant speedup is achieved compared to the implementation of Listing 2.3.

Furthermore, in some cases the calculations provided wrong results. Declaring only those functions as Cilk++ which use Cilk++ keywords and all other functions explicitly as C functions, solved these problems. To detect race conditions, Cilk++ provides the Cilkscreen Race Detector. This tool runs the parallel application on one core and reports any location in the program that may result in race conditions, that is, writing to the same memory location. To estimate the achievable speedup of an application, the Cilkscreen Parallel Performance Analyzer counts the instructions of the executed application and gives information of the potential inherent parallelism of the application (see below).

30 2.2 Cilk++

1 cilk ::mutex c_mutex; 2 3 int num_workers = cilk :: current_worker_count() ; 4 int nheight = ( int ) c e i l ( ( float )(height)/( float )num_workers) ; 5 6 #pragma cilk_grainsize = 1 7 c i l k _ f o r ( int k=0; k= height) break ; 14 for ( int x=0; x

31 2 Multi-Core Frameworks

Evaluation Time required to get familiar with a particular framework: Cilk++ requires little time to get familiar. After one day the main features and the concept behind Cilk++ is throughout understood and the application can be mapped parallelized using Cilk++. Effort of time to map reference code to a particular framework, how much code (percent) has to be changed or added: Existing code can be parallelized with almost no change to the original code, only the three basic keywords have to be inserted into the code to indicate the compiler the parts of the program to be parallelized. However, when porting an application to Cilk++, the parallelized functions use a different (internal) calling syntax and have to be declared as extern "Cilk++", hence, the declaration of all cilkified func- tions and their calling functions have to be declared that way. Only a few days are required to parallelize the whole application. Only the optimized histogram gener- ation took serious amount of time to come up with a proper solution and realize it using Cilk++. For the coarse-grained approach some restructuring of the code is necessary. Support of parallelization by the framework: The parallelization is completely done by the framework, only the three basic key- words have to be used to express the parallelism. Support of data partitioning by the framework: The main data partitioning is done by the user, that is, the loops iterating over the data set are not altered. The schedule of loop iterations to threads, however, can be specified by the pragma cilk_grainsize = clause, assigning n consecutive loop iterations to the same thread. Applicability for different algorithm classes and target platforms: Cilk++ supports fine-grained data parallelism as well as coarse-grained task paral- lelism. Advantages, drawbacks, and difficulties of a particular framework: Similarly to OpenMP, it is relatively easy to parallelize existing code. Only minor changes to the source code are required. To convert an existing application to an cilkified version, the Cilk++ Programmer’s Guide suggest to “rename the Cilk++ source files, replacing the .cpp extension with .cilk” before compiling the program. However, this introduced additional overhead and the execution took about 30% longer compared to the OpenMP implementation. Cilkifying only the parallelized files and functions removes this overhead again. The biggest advantages of Cilk++ are the additional tools it provides. To avoid race conditions, the compiler emits warnings where race conditions may arise (which has not to be the case): histogram.cilk:101: warning: writes to ’hist’ in loop body may race Race conditions encountered during execution of a program are reported by the Cilkscreen Race Detector. Therefore, the program is executed by only one thread and reads/writes to each memory location are analyzed for real race conditions:

32 2.2 Cilk++

Race condition on location 0x6b9bd8 write access at 0x40953d: (histogram.cilk:244, __cilk_loop_d_004+0xd5) read access at 0x409520: (histogram.cilk:244, __cilk_loop_d_004+0xb8) called by 0x7f6acc5a3379 : (_Z15cilkscreen_loopIPFvPvmmEmEQbT_S0_T0_+0xa9) called by 0x7f6acc5a390e : (_ZN4cilk13cilk_for_loopEQvPFvPvmmES0_mm+0x15e) called by 0x40bb68: (main.cilk:144, _Z9cilk_mainQiiPPc+0xbd4) called by 0x40d2e3: (_ZN4cilk9main_wrapEQiPv+0x53) called by 0x7f6acc5a25d1 : (_Z20cilk_run_wrapper_intQvPv+0x51) called by 0x7f6acc5a470a: (__cilkrts_init_helper+0x5)

The second tool, the Cilkscreen Scalability Analyzer provides a profile of parallelism and parallelization overhead of an executed program. This includes also speedup estimates for different core counts. For example, the output for the best neighbor search with up to 12 neighbors being investigated in parallel gives the output below. The work reported by the tool reports the total work to be executed by all threads and the span describes the work to be executed on the critical path within the application. Using these numbers, the maximal inherent parallelism of the current implementation can be evaluated. For a high span, the parallelization approach may be reconsidered.

1) Parallelism Profile Work: 15196207411instructions Span: 1266809092instructions Burdenedspan: 1266809092instructions Parallelism: 12.00 Burdenedparallelism: 12.00 Numberofspawns/syncs: 11 Average instructions / strand : 446947276 Strandsalongspan: 3 Average instructions / strand on span : 422269697 Total number of atomic instructions : 0 Framecount: 25 2) Speedup Estimate 2procs: 1.75 − 2 . 0 0 4procs: 2.81 − 4 . 0 0 8procs: 4.02 − 8 . 0 0 16procs: 5.12 − 12 . 0 0 32procs: 5.93 − 12 . 0 0

Effort to get a certain result, for example, performance or throughput: The effort to get a reasonable speedup is relatively small, since only minor changes have to the code have to be done. Scalability of the framework with respect to architecture family, new hardware: Figure 2.9, 2.10, and 2.11 show the execution times of the fine-grained and coarse- grained Cilk++ implementation on up to 24 cores for different volume and image resolutions. The exact times are in addition listed in Table 2.4 for the fine-grained implementation and in Table 2.5 for the coarse-grained implementation. The graphs show that for the fine-grained implementation two cores are required to compensate the parallelization overhead and to catch up with the reference implementation. In contrast, the coarse-grained implementation runs already on one core as fast as the baseline implementation. Getting to the saturation point when adding further cores yields no improvement is reached roughly at the same time for both imple- mentations. While the saturation is reached at about 3.6 s for the coarse-grained

33 2 Multi-Core Frameworks implementation, only about 8.7 s can be obtained using the fine-grained implemen- tation for the middle resolution. Plotting the speedup of both implementation for the different volume and image resolutions in Figure 2.12, 2.13, and 2.14 shows that the obtained speedups of the fine-grained implementation is far below those of the coarse-grained implementation, however, improving with the problem size. Using Amdahl’s law to determine the sequential part of each implementation results in an alpha of round about 2 % (2.12, 1.87, and 2.32) for the coarse-grained implementation, while the alpha for the fine-grained implementation is much higher and fluctuates more (30.97, 13.68, and 8.79). This shows that the parallelization overhead of Cilk++ is much higher for the fine-grained implementation and can be compensated by larger data sets.

24

22

20

18

16

14

12 time (s) 10

8 naïve 6

4 fine-grained 2 coarse-grained 0 0 2 4 6 8 10 12 14 16 18 20 22 24 cores

Figure 2.9: Execution times of fine-grained and coarse-grained Cilk++ paralleliza- tions on up to 24 cores for the 128 × 128 × 94 volume compared with the naïve implementation.

34 2.2 Cilk++

Table 2.4: Execution times in seconds using the fine-grained Cilk++ implementa- tion on up to 24 cores for different volume resolutions. # cores 128 × 128 × 94 256 × 256 × 189 512 × 512 × 378 1 7.87 89.26 538.22 2 4.74 42.79 295.20 3 6.48 53.01 438.89 4 4.85 31.50 300.80 5 3.33 31.88 189.28 6 2.96 18.04 137.86 7 2.99 16.42 130.14 8 2.67 21.09 143.92 9 2.58 16.67 110.09 10 2.41 15.25 112.32 11 2.40 11.27 92.26 12 2.19 12.59 100.78 13 2.45 12.07 72.37 14 2.40 12.18 73.38 15 2.26 11.16 75.01 16 2.10 10.76 79.82 17 2.10 9.77 73.37 18 2.15 10.05 69.77 19 2.14 10.17 65.14 20 2.08 9.36 68.12 21 2.14 8.60 51.84 22 2.19 8.69 54.83 23 2.22 8.69 50.28 24 2.21 8.87 47.34

35 2 Multi-Core Frameworks

Table 2.5: Execution times in seconds using the coarse-grained Cilk++ implemen- tation on up to 24 cores for different volume resolutions. # cores 128 × 128 × 94 256 × 256 × 189 512 × 512 × 378 1 6.67 57.01 487.14 2 3.35 33.18 244.52 3 2.31 19.67 167.74 4 1.69 15.71 122.39 5 1.51 12.62 108.02 6 1.18 10.05 85.64 7 1.06 8.89 79.09 8 0.97 8.44 69.88 9 0.92 7.75 66.30 10 0.88 7.30 62.76 11 0.84 7.01 59.46 12 0.60 4.99 43.60 13 0.56 5.84 50.58 14 0.55 4.61 40.58 15 0.52 4.31 37.76 16 0.51 4.24 38.34 17 0.47 3.93 36.42 18 0.47 3.99 35.65 19 0.47 3.90 34.66 20 0.47 3.61 31.53 21 0.43 3.55 32.23 22 0.43 3.58 33.10 23 0.43 3.60 35.12 24 0.43 3.58 32.37

36 2.2 Cilk++

100

90

80

70

60 naïve

50 time (s) 40

30

20

10 fine-grained coarse-grained 0 0 2 4 6 8 10 12 14 16 18 20 22 24 cores

Figure 2.10: Execution times of fine-grained and coarse-grained Cilk++ paralleliza- tions on up to 24 cores for the 256 × 256 × 189 volume compared with the naïve implementation.

37 2 Multi-Core Frameworks

600

500 naïve

400

300 time (s)

200

100 fine-grained coarse-grained

0 0 2 4 6 8 10 12 14 16 18 20 22 24 cores

Figure 2.11: Execution times of fine-grained and coarse-grained Cilk++ paralleliza- tions on up to 24 cores for the 512 × 512 × 378 volume compared with the naïve implementation.

38 2.2 Cilk++

24

22

20

18

16 coarse-grained

14 fitted alpha: 2.12 %

12 speedup 10

8

6

4 fine-grained

2 fitted alpha: 30.97 %

0 0 2 4 6 8 10 12 14 16 18 20 22 24 cores

Figure 2.12: Speedup of fine-grained and coarse-grained Cilk++ implementation on up to 24 cores for the 128 × 128 × 94 volume compared with the naïve implementation.

39 2 Multi-Core Frameworks

24

22

20

18

16 coarse-grained fitted alpha: 1.87 % 14

12 speedup 10

8 fine-grained 6 fitted alpha: 13.68 % 4

2

0 0 2 4 6 8 10 12 14 16 18 20 22 24 cores

Figure 2.13: Speedup of fine-grained and coarse-grained Cilk++ implementation on up to 24 cores for the 256 × 256 × 189 volume compared with the naïve implementation.

40 2.2 Cilk++

24

22

20

18

16 coarse-grained

14 fitted alpha: 2.32 % 12

speedup fine-grained 10

8 fitted alpha: 8.79 % 6

4

2

0 0 2 4 6 8 10 12 14 16 18 20 22 24 cores

Figure 2.14: Speedup of fine-grained and coarse-grained Cilk++ implementation on up to 24 cores for the 512 × 512 × 378 volume compared with the naïve implementation.

41 2 Multi-Core Frameworks

Scalability in terms of problem size: As seen in Figure 2.12, 2.13, and 2.14, the sequential portion (i. e., the overhead that comes with the parallelization) of the fine-grained implementation is huge in Cilk++. For the small volume, 30.97 % of the parallelized part of the 2D/3D image registration is still sequential. This decreases to 8.79 % for the large volume. For the coarse-grained implementation only insignificant differences can be seen. The same can be seen when plotting the execution time ratio of two volume resolutions. As discussed in Section 1.3, the number of instructions and execution time increases roughly by a factor of 8. While this is the case for the coarse-grained implementation as seen in Figure 2.16, the ratio drops with increasing number of cores for the fine- grained implementation (see Figure 2.15). Hence, the fine-grained implementation scales well for larger problem sizes with increasing number of cores. However, this is mainly due to the large parallelization overhead of Cilk++ compensated by more cores.

12

10

8

6 256 → 512 execution time ratio 4 128 → 256

2

0 0 2 4 6 8 10 12 14 16 18 20 22 24 cores

Figure 2.15: Scalability of fine-grained Cilk++ parallelization: Shown is the execu- tion time ratio when moving to a different volume resolution.

Resource awareness of framework (run-time system): Cilk++ is to our knowledge not resource aware. Resource management for multiple sub-algorithms (image pipeline): Cilk++ does not provide support for resource management for pipelining.

42 2.3 Threading Building Blocks

12

10 256 → 512

8 128 → 256

6 execution time ratio 4

2

0 0 2 4 6 8 10 12 14 16 18 20 22 24 cores

Figure 2.16: Scalability of coarse-grained Cilk++ parallelization: Shown is the ex- ecution time ratio when moving to a different volume resolution.

Support of streaming by the framework: Cilk++ does not provide streaming support for series of images.

2.3 Threading Building Blocks

Framework Threading Building Blocks (TBB) [Rei07] is a template library for C++ developed by Intel to parallelize programs and is available as an open source version as well as a commercial version providing further support. Parallel code is encapsulated in special classes and invoked from the program. TBB allows to express task and data parallelism. All major compilers can be used to generate binaries from TBB programs. Instead of encapsulating the parallel code in classes, lambda functions can be used to express parallelism in a more compact way. There are, however, only few compilers supporting lambda functions of the upcoming c++0x standard. TBB provides also concurrent container classes for hash maps, vectors, or queues as well as own mutex and lock implementations. The run-time system of TBB schedules the tasks using a work-stealing algorithm similar as Cilk++.

43 2 Multi-Core Frameworks

1 spin_mutex hist_mutex; 2 3 parallel_for( int ( 0 ) , int ( height ) , int ( 1 ) , [&] ( int y ) { 4 for ( int x=0; x

Parallel activities are initiated by special keywords like parallel_for, which exe- cutes the instance of a special class implementing the interface for a parallel_for loop. For some keywords, like parallel_for, a lambda function can be used instead of a separate class. Calculating a histogram in parallel using a lambda function is shown in Listing 2.5. The first two parameters to the parallel_for functions are the lower and upper iteration limit of the loop, followed by the loop counter increment. The last argument is the loop to be parallelized as lambda function (the outer loop is missing, since the outer loop is parallelized by TBB). To synchronize the access to the bin counters, a mutex is used. For a more efficient implementation, thread-local memory is used to store tempo- rary histogram results. Therefore, the parallel_reduce keyword is used. For paral- lel_reduce, several functions have to be defined, and, hence, no lambda function can be used here. Besides the operator function for the operations executed in parallel, also a split constructor and a join function are required. While the former is used to initialize data when additional threads are created, the latter is responsible for merging the results when threads finish execution. Listing 2.6 shows this class and how parallel_reduce is used to calculate the histogram.

Evaluation Time required to get familiar with a particular framework: TBB requires little time to get familiar. After one day the main features and the concept behind TBB is throughout understood and the application can be mapped parallelized using TBB. Effort of time to map reference code to a particular framework, how much code (percent) has to be changed or added: In order to parallelize existing code, major restructuring is required. The function- ality has to be encapsulated in a separate class, which means that the code moves into a different place within the source file, or even to a different file. This sepa- rates the invocation and implementation of parallel functionality. Almost all source code of the function to be parallelized had to be modified and separated into the

44 2.3 Threading Building Blocks

1 class CalcHist { 2 int ∗my_img ; 3 int my_width, my_height; 4 public : 5 int ∗my_hist ; 6 7 void operator ()( const blocked_range& r ) { 8 int width = my_width; 9 int height = my_height; 10 int ∗img = my_img ; 11 12 for ( int y=r.begin(); y!=r.end(); ++y) { 13 for ( int x=0; x(0, height),ch,auto_partitioner()); 49 h i s t = ch . my_hist ; Listing 2.6: Histogram generation in TBB using temporary thread-local histograms.

45 2 Multi-Core Frameworks two parts. Accordingly, more time was required to parallelize the whole application using TBB compared to OpenMP and Cilk++. About twice the time was required. Using the parallel_reduce to calculate histogram using thread-local memory was more straightforward compared to the other frameworks. Support of parallelization by the framework: The parallelization is completely done by the framework, only the parallel function- ality has to be encapsulated in a separate class. Support of data partitioning by the framework: The main data partitioning is done by the user, that is, the loops iterating over the data set are not altered. The schedule of loop iterations to threads, however, can be specified by the user. The parallel_for and parallel_reduce functions expect a partitioner as last argument. Instead of using the auto_partitioner as used in the examples, also the simple_partitioner or affinity_partitioner can be used. The simple_partitioner can be used to specify a grainsize similar to Cilk++, and the affinity_partitioner tries to optimize the grainsize such that the cache is utilized best. Applicability for different algorithm classes and target platforms: TBB supports fine-grained data parallelism as well as coarse-grained task paral- lelism. Advantages, drawbacks, and difficulties of a particular framework: Compared to OpenMP and Cilk++, TBB is a library approach that integrated smoothly into existing object-oriented code. This comes, however, at the cost of additional source code. Encapsulating the parallel functionality in a class comes along with a lot of boilerplate code for initializing class private data in the constructor and assigning these variables again to the local variables in the operator of the class. For the parallel_for construct this overhead can be reduced using lambda functions, which are, however, only supported by some compilers. One lead developer of TBB at Intel acknowledged that the syntax overhead of TBB is huge compared to other frameworks like Cilk++ and that the source code does not look neat, even with the new lambda syntax (A. Kukanov, personal communication). Effort to get a certain result, for example, performance or throughput: The effort to get a reasonable speedup is moderate. Although the code has to be encapsulated in a class, the way a function works is preserved. Scalability of the framework with respect to architecture family, new hardware: Figure 2.17, 2.18, and 2.19 show the execution times of the fine-grained and coarse- grained TBB implementation on up to 24 cores for different volume and image resolutions. The exact times are in addition listed in Table 2.6 for the fine-grained implementation and in Table 2.7 for the coarse-grained implementation. The graphs show that for both implementations the execution times using one core is only slightly slower compared to the baseline implementation. Getting to the saturation point when adding further cores yields no improvement is reached roughly at the same time for both implementations. While the saturation is reached at about 3.9 s for the coarse-grained implementation, only about 5.7 s can be obtained using the

46 2.3 Threading Building Blocks

fine-grained implementation for the middle resolution. Plotting the speedup of both implementation for the different volume and image resolutions in Figure 2.20, 2.21, and 2.22 shows that the obtained speedups of the fine-grained implementation is far below those of the coarse-grained implementation, however, improving with the problem size. Using Amdahl’s law to determine the sequential part of each implementation results in an alpha of round about 3 %(2.85, 2.58, and 3.90) for the coarse-grained implementation, while the alpha for the fine- grained implementation is much higher and fluctuates more (17.93, 6.48, and 4.04). This shows that the parallelization overhead of TBB is much higher for the fine- grained implementation and can be compensated by larger data sets.

24

22

20

18

16

14

12 time (s) 10

8 naïve 6

4

2 fine-grained coarse-grained 0 0 2 4 6 8 10 12 14 16 18 20 22 24 cores

Figure 2.17: Execution times of fine-grained and coarse-grained TBB paralleliza- tions on up to 24 cores for the 128 × 128 × 94 volume compared with the naïve implementation.

47 2 Multi-Core Frameworks

Table 2.6: Execution times in seconds using the fine-grained TBB implementation on up to 24 cores for different volume resolutions. # cores 128 × 128 × 94 256 × 256 × 189 512 × 512 × 378 1 7.23 63.28 561.25 2 4.24 33.71 297.89 3 3.13 22.84 198.09 4 2.61 17.80 152.47 5 2.32 14.62 122.29 6 2.17 13.00 106.97 7 1.95 11.50 92.08 8 1.85 12.38 81.44 9 2.00 9.89 76.65 10 1.66 9.07 67.28 11 1.61 8.53 65.50 12 1.57 8.27 61.29 13 1.54 7.85 57.39 14 1.49 7.38 53.35 15 1.46 7.04 50.71 16 1.47 6.78 47.03 17 1.44 6.64 46.36 18 1.43 6.53 45.44 19 1.42 6.30 43.14 20 1.42 6.16 40.77 21 1.47 5.98 39.72 22 1.40 5.87 38.56 23 1.41 5.76 37.62 24 1.39 5.70 37.19

48 2.3 Threading Building Blocks

Table 2.7: Execution times in seconds using the coarse-grained TBB implementation on up to 24 cores for different volume resolutions. # cores 128 × 128 × 94 256 × 256 × 189 512 × 512 × 378 1 7.05 62.73 559.35 2 3.53 31.49 281.66 3 2.39 23.67 200.06 4 1.78 15.75 142.72 5 1.59 15.60 139.08 6 1.24 11.00 100.21 7 1.11 9.77 88.46 8 1.03 8.96 81.29 9 0.98 8.51 76.18 10 0.97 8.10 69.23 11 0.88 7.71 68.51 12 0.63 5.43 51.16 13 0.73 5.42 51.10 14 0.59 5.08 51.02 15 0.55 4.63 51.46 16 0.55 4.66 43.91 17 0.52 4.27 51.56 18 0.52 4.63 44.40 19 0.51 4.26 41.91 20 0.52 4.25 40.66 21 0.48 3.87 40.62 22 0.47 3.90 37.00 23 0.48 3.87 37.94 24 0.46 3.87 37.00

49 2 Multi-Core Frameworks

100

90

80

70

60 naïve

50 time (s) 40

30

20

10 fine-grained coarse-grained 0 0 2 4 6 8 10 12 14 16 18 20 22 24 cores

Figure 2.18: Execution times of fine-grained and coarse-grained TBB paralleliza- tions on up to 24 cores for the 256 × 256 × 189 volume compared with the naïve implementation.

50 2.3 Threading Building Blocks

600

500 naïve

400

300 time (s)

200

100 fine-grained coarse-grained 0 0 2 4 6 8 10 12 14 16 18 20 22 24 cores

Figure 2.19: Execution times of fine-grained and coarse-grained TBB paralleliza- tions on up to 24 cores for the 512 × 512 × 378 volume compared with the naïve implementation.

51 2 Multi-Core Frameworks

24

22

20

18

16 coarse-grained 14 fitted alpha: 2.85 % 12 speedup 10

8

6 fine-grained 4 fitted alpha: 17.93 %

2

0 0 2 4 6 8 10 12 14 16 18 20 22 24 cores

Figure 2.20: Speedup of fine-grained and coarse-grained TBB implementation on up to 24 cores for the 128 × 128 × 94 volume compared with the naïve implementation.

52 2.3 Threading Building Blocks

24

22

20

18

16 coarse-grained 14 fitted alpha: 2.58 % 12

speedup fine-grained 10 fitted alpha: 6.48 % 8

6

4

2

0 0 2 4 6 8 10 12 14 16 18 20 22 24 cores

Figure 2.21: Speedup of fine-grained and coarse-grained TBB implementation on up to 24 cores for the 256 × 256 × 189 volume compared with the naïve implementation.

53 2 Multi-Core Frameworks

24

22

20

18

16 coarse-grained 14 fitted alpha: 3.9 % 12 fine-grained speedup 10 fitted alpha: 4.04 %

8

6

4

2

0 0 2 4 6 8 10 12 14 16 18 20 22 24 cores

Figure 2.22: Speedup of fine-grained and coarse-grained TBB implementation on up to 24 cores for the 512 × 512 × 378 volume compared with the naïve implementation.

54 2.3 Threading Building Blocks

Scalability in terms of problem size: As seen in Figure 2.20, 2.21, and 2.22, the sequential portion (i. e., the overhead that comes with the parallelization) of the fine-grained implementation is huge in TBB. For the small volume, 17.93 % of the parallelized part of the 2D/3D image registration is still sequential. This decreases to 4.04 % for the large volume. For the coarse-grained implementation only insignificant differences can be seen. The same can be seen when plotting the execution time ratio of two volume resolutions. As discussed in Section 1.3, the number of instructions and execution time increases roughly by a factor of 8. While this is the case for the coarse-grained implementation as seen in Figure 2.24, the ratio drops with increasing number of cores for the fine- grained implementation (see Figure 2.23). Hence, the fine-grained implementation scales well for larger problem sizes with increasing number of cores. However, this is mainly due to the large parallelization overhead of TBB compensated by more cores.

12

10

8 256 → 512

6 execution time ratio 4 128 → 256

2

0 0 2 4 6 8 10 12 14 16 18 20 22 24 cores

Figure 2.23: Scalability of fine-grained TBB parallelization: Shown is the execution time ratio when moving to a different volume resolution.

Resource awareness of framework (run-time system): TBB is to our knowledge not resource aware. Resource management for multiple sub-algorithms (image pipeline): TBB does provide support for resource management for pipelining. It provides a

55 2 Multi-Core Frameworks

12

10 256 → 512

8 128 → 256

6 execution time ratio 4

2

0 0 2 4 6 8 10 12 14 16 18 20 22 24 cores

Figure 2.24: Scalability of coarse-grained TBB parallelization: Shown is the execu- tion time ratio when moving to a different volume resolution.

pipeline and a filter class. Each pipeline stage is represented by one filter. For each pipeline stage, one filter is added to the pipeline. The order in which the filters are added corresponds to the order of the pipeline stages. Non-linear pipelines are not supported and dependencies between pipeline stages have to be resolved by the programmer, that is, parallel pipeline stages are sequentialized added to the pipeline class. When a pipeline is executed, each filter can be executed in parallel on different chunks of the input data, according to the corresponding filter configuration. The number of data chunks being processed by a pipeline at the same time can be limited. In addition to parallel processing of data chunks in the same pipeline stage, different pipeline stages can be active at the same time, too.

Support of streaming by the framework: TBB does not provide streaming support for series of images.

56 2.4 RapidMind

2.4 RapidMind

Framework RapidMind [Rap09] is a commercial solution that emerged from a high-level pro- gramming language for graphics cards, called Sh [MDT04] and was recently acquired by Intel. While Sh was targeting originally only graphics cards, RapidMind takes a data-parallel approach that maps well onto many-core hardware as well as on standard shared memory multi-core architectures and the Cell Broadband Engine. RapidMind programs follow the single program, multiple data (SPMD) paradigm where the same function is applied data parallel to all elements of a large data set. These programs use an own language and data types, are compiled at run-time, and called from standard C/C++ code. Through dynamic compilation, the code can be optimized for the underlying hardware at run-time and the same code can be executed on different back ends like standard multi-core processors and graphics cards. All this functionality is provided by libraries and works with the standard compilers on Windows, GNU Linux, and Mac OS X. RapidMind programs are by design free of deadlocks and race conditions. This is possible, since the changes to the output variables of a RapidMind program are only visible after the program finished and only regular writes to the output data are allowed. Inasmuch as RapidMind was originally targeted for graphics cards, the functions executed in parallel are small computational kernels called Programs. A RapidMind program contains the operations that are applied to all elements of large data sets, that is, the program contains no loop iterating over the iteration space, but applies the operations automatically to each element. Variables that are passed to a Rapid- Mind program are declared by the In keyword, output variables correspondingly by the Out keyword. Inside of a RapidMind program and in the normal program code, RapidMind data types have to be used, like Value1i for a single integer. Also the control flow within a RapidMind program is defined by own keywords like FOR and IF. The end of control flow sections are marked by corresponding keywords like ENDFOR and ENDIF. A simple program to square all data elements of a set is defined as follows: Program square = BEGIN { In in_data; Out out_data;

out_data = in_data ∗ in_data ; } END; To apply a RapidMind program to a set of data, the data has to be stored in RapidMind Arrays in the C/C++ part of the source code. To access the elements of arrays, RapidMind provides functions that return a pointer to the memory asso- ciated with the array. Alternatively, a RapidMind array can be bound to existing memory. The array can be passed afterwards as parameter to the square RapidMind program defined above and the output of the RapidMind program is stored to a

57 2 Multi-Core Frameworks second array. If the size of an array is not specified, it is determined at run-time by the RapidMind run-time environment: Array<1, Value1i> input(10000); Array<1, Value1i> output;

// get access to the data contained in input float ∗ input_data = input.write_data();

// initialize input data ... output = square(input); For reductions, RapidMind provides predefined collective operations, for example to calculate the sum of all elements in a set, or to get the maximum of all elements. Implementing regular data-parallel programs in RapidMind is straightforward, since only the operation on one single data element has to be defined. The rest is handled by the RapidMind run-time environment. However, irregular problems require a different approach in RapidMind. For example, calculating the histogram is not regular since the bin depends on the pixel value and is, hence, random. Also the number of input and output elements has to match. Therefore, the histogram resolution determines the degree of parallelism. We use one thread per histogram bin and iterate for each bin sequentially over the complete image to calculate the bin count. Listing 2.7 shows the implementation of this approach in RapidMind. The grid function provides a virtual array of contiguous integers which corresponds to the image intensity for the histogram calculation in our case. This implementation, however, suffers from low data reuse since the complete image has to be read to calculate a single histogram bin value. To increase data reuse, we use thread-local histograms in the RapidMind program in order to read the image only once. As output of the RapidMind program a complete array is used. Therefore, the output array is partitioned into sub-arrays using the dice function. To use more than one thread, one histogram per image line is calculated. In a second step, these histograms are merged into the final histogram. The implementation of this approach in RapidMind is shown in Listing 2.8.

Evaluation Time required to get familiar with a particular framework: RapidMind requires little time to get familiar, although it provides an own language and a different work flow than the previously investigated frameworks. After a few days the main features and the concept behind RapidMind is well understood and the application can be parallelized using RapidMind. However, some essential functions are not documented in the documentation of RapidMind, for example, how to read the value of one single RapidMind data type. Also, there is no support available through an online community like for the other frameworks. Even though

58 2.4 RapidMind

1 Value1i rm_height = height; 2 Value1i rm_width = width; 3 Array <1, Value1i> h i s t (MAX_INTENSITY) ; 4 Array<2, Value1i> rm_img(width, height); 5 6 rm_hist = BEGIN { 7 In intensity; 8 Out out; 9 Value1i count= 0; 10 11 FOR (Value1i y=0, y

our RapidMind license includes technical support, the responsiveness of the support team leaves a lot to be desired. Effort of time to map reference code to a particular framework, how much code (percent) has to be changed or added: In order to parallelize existing code, almost the complete source code has to be restructured and rewritten. The kernels have to be written in an own language, RapidMind specific data types have to be used within RapidMind programs. Simi- lar to TBB, the invocation and implementation of parallel functionality is separated. However, in contrast to TBB, RapidMind programs define only the operation on a single element, which means that also the code in a RapidMind program is dif- ferent to the original source code. The guarantee of deadlock and race condition free RapidMind programs comes at the cost of only being able to express regular problems in RapidMind. To solve irregular problems, like the histogram example, requires considerable effort to come up with a proper solution. The time to map the complete application to RapidMind was more than one week, and to get correct results as well as acceptable execution times a second week. Support of parallelization by the framework: The parallelization is completely done by the framework, only the operations that should be applied to each data element in a set have to be defined. Support of data partitioning by the framework: The data partitioning is done by the user, that is, the operations within a RapidMind program define the data partition that is assigned to one core. This is typically one

59 2 Multi-Core Frameworks

1 /∗ temporary array with storing one histogram per line ∗/ 2 Array<1, Value1i> hist_tmp = (MAX_INTENSITY∗ height ) ; 3 Array<1, ArrayAccessor<1, Value1i> > hist_tmp_accessor = dice(hist_tmp , MAX_INTENSITY) ; 4 Array <1, Value1i> h i s t (MAX_INTENSITY) ; 5 Array<2, Value1i> rm_img(width, height); 6 7 /∗ calculate one histogram per line ∗/ 8 Program rm_hist_diced = BEGIN { 9 In line; 10 Out > r e s (MAX_INTENSITY) ; 11 12 FOR ( Value1i i =0, i pos ; 24 Out r e s ; 25 26 r e s = 0 ; 27 FOR (Value1i i=0, i

60 2.4 RapidMind data element. However, the assignment of single or multiple data partitions to the cores is completely handled by RapidMind. The user has no influence on the granularity, apart from doing so by hand within a RapidMind program. Applicability for different algorithm classes and target platforms: RapidMind supports fine-grained data parallelism, but no coarse-grained task paral- lelism. Our attempt to implement a coarse-grained version failed, due to the size of the RapidMind program getting to big in terms of maintainability and possibilities for debugging. Advantages, drawbacks, and difficulties of a particular framework: RapidMind is like TBB a library approach, no additional compiler is required. Hence, RapidMind integrates smoothly in existing frameworks. However, the com- pilation takes considerably longer compared to normal source compilation. Linking and compiling the 2D/3D image registration against the RapidMind libraries takes more than 22 seconds, while normal compilation takes less than 2 seconds. This disrupts the development process to some extend. Quite more disturbing is the fact that not all errors are discovered when compiling the C/C++ sources. Only when the RapidMind programs are compiled just-in-time by the RapidMind run-time en- vironment and these programs are executed, some errors are detected. For these errors, neither the line where the error occurred, nor the file, nor the name of the RapidMind program was reported. RapidMind programs are just-in-time compiled every time they are used, unless an instance of that program is kept over different evaluation steps in the 2D/3D image registration. Despite of the additional time required for recompiling the RapidMind programs for every evaluation, the gener- ated code did not yield the same results for every iteration. Hence, one instance for every RapidMind program was once generated and kept afterwards for the entire 2D/3D image registration. The way RapidMind programs are written and called, match precisely the needs for image processing. The RapidMind source code is concise, easy to read and comprehensible. Compared to normal C/C++, little to no overhead in terms of lines of code is needed to express the functionality of kernels operating on images. Border handling for a 2D array containing an image is done by setting the boundary mode of the array. There are several predefined boundary modes like clamp, repeat, or constant [Rap09]. Also for accessing neighboring pixels within a RapidMind program, accessors are available to determine the neighbor relative to the current pixel. No complex index calculations are required. Effort to get a certain result, for example, performance or throughput: The effort to get a reasonable speedup is huge. Reasonable speedups are only obtained when the RapidMind programs are only once just-in-time compiled during program initialization. For example, there were some problems passing the pointer of a variable from within one RapidMind program to another RapidMind program creating the 2D projection of the volume. Using pointers that function took about 43 s to execute and the output was wrong. Using instead of the pointer a reference, however, took only 0.6 s to execute and the output was correct. Scalability of the framework with respect to architecture family, new

61 2 Multi-Core Frameworks hardware: Figure 2.25, 2.26, and 2.27 show the execution times of the fine-grained RapidMind implementation on up to 24 cores for different volume and image resolutions. The exact times are in addition listed in Table 2.8. The graphs show that for the fine- grained implementation nine cores are required for the small volume and three cores for the other volumes to compensate the parallelization overhead and to catch up with the reference implementation. The saturation point when adding further cores yields no improvement is reached roughly at about 12.5 s for the middle resolution. Plotting the speedup for the different volume and image resolutions in Figure 2.28, 2.29, and 2.30 shows that the obtained speedups of the fine-grained implementation is far below the optimal speedup, however, improving with the problem size. Using Amdahl’s law to determine the sequential part of each implementation results in a high alpha that fluctuates (105.66, 22.31, and 8.82). This shows that RapidMind has the highest parallelization overhead so far for the fine-grained implementation.

24

22

20

18

16

14

12 time (s) 10

8 fine-grained naïve 6

4

2

0 0 2 4 6 8 10 12 14 16 18 20 22 24 cores

Figure 2.25: Execution times of fine-grained RapidMind parallelizations on up to 24 cores for the 128 × 128 × 94 volume compared with the naïve im- plementation.

62 2.4 RapidMind

Table 2.8: Execution times in seconds using the fine-grained RapidMind implemen- tation on up to 24 cores for different volume resolutions. # cores 128 × 128 × 94 256 × 256 × 189 512 × 512 × 378 1 24.06 141.54 968.07 2 15.32 74.43 502.86 3 11.66 51.51 338.07 4 9.91 39.90 256.36 5 8.18 34.30 210.10 6 7.50 28.49 172.36 7 7.07 25.45 152.35 8 6.71 22.20 130.13 9 6.30 20.35 117.13 10 6.13 18.75 107.20 11 6.33 17.75 96.41 12 6.18 17.35 89.54 13 6.12 16.21 84.12 14 6.11 15.60 78.86 15 5.77 15.61 74.04 16 6.12 14.54 68.49 17 6.07 14.26 76.17 18 5.82 13.47 62.60 19 6.48 13.43 60.49 20 6.21 12.96 57.06 21 6.52 13.75 63.89 22 6.75 12.35 53.35 23 7.46 12.28 51.79 24 7.73 13.21 50.25

63 2 Multi-Core Frameworks

100

90

80

70

60 naïve

50 time (s) 40

30

20 fine-grained 10

0 0 2 4 6 8 10 12 14 16 18 20 22 24 cores

Figure 2.26: Execution times of fine-grained RapidMind parallelizations on up to 24 cores for the 256 × 256 × 189 volume compared with the naïve implementation.

64 2.4 RapidMind

600

500 naïve

400

300 time (s)

200

100 fine-grained

0 0 2 4 6 8 10 12 14 16 18 20 22 24 cores

Figure 2.27: Execution times of fine-grained RapidMind parallelizations on up to 24 cores for the 512 × 512 × 378 volume compared with the naïve implementation.

65 2 Multi-Core Frameworks

24

22

20

18

16

14

12 speedup 10

8

6

4

2 fine-grained fitted alpha: 105.66 % 0 0 2 4 6 8 10 12 14 16 18 20 22 24 cores

Figure 2.28: Speedup of fine-grained RapidMind implementation on up to 24 cores for the 128×128×94 volume compared with the naïve implementation.

66 2.4 RapidMind

24

22

20

18

16

14

12 speedup 10

8

6 fine-grained 4 fitted alpha: 22.31 % 2

0 0 2 4 6 8 10 12 14 16 18 20 22 24 cores

Figure 2.29: Speedup of fine-grained RapidMind implementation on up to 24 cores for the 256×256×189 volume compared with the naïve implementation.

67 2 Multi-Core Frameworks

24

22

20

18

16

14

12 speedup 10 fine-grained

8 fitted alpha: 8.82 % 6

4

2

0 0 2 4 6 8 10 12 14 16 18 20 22 24 cores

Figure 2.30: Speedup of fine-grained RapidMind implementation on up to 24 cores for the 512×512×378 volume compared with the naïve implementation.

68 2.4 RapidMind

Scalability in terms of problem size: As seen in Figure 2.28, 2.29, and 2.30, the sequential portion (i. e., the overhead that comes with the parallelization) of the fine-grained implementation is huge in RapidMind. For the small volume, 105.66 % of the parallelized part of the 2D/3D image registration is still sequential. This decreases to 8.82 % for the large volume. The same can be seen when plotting the execution time ratio of two volume reso- lutions. As discussed in Section 1.3, the number of instructions and execution time increases roughly by a factor of 8. Here, the ratio is less and drops further with increasing number of cores for the fine-grained implementation (see Figure 2.31). Hence, the fine-grained implementation scales well for larger problem sizes with in- creasing number of cores. However, this is mainly due to the large parallelization overhead of RapidMind compensated by more cores.

12

10

8

6

execution time ratio 256 → 512 4

2 128 → 256

0 0 2 4 6 8 10 12 14 16 18 20 22 24 cores

Figure 2.31: Scalability of fine-grained RapidMind parallelization: Shown is the ex- ecution time ratio when moving to a different volume resolution.

Resource awareness of framework (run-time system): RapidMind for multi-core processors is to our knowledge not resource aware. Resource management for multiple sub-algorithms (image pipeline): RapidMind does provide support for resource management for pipelining. It allows to pipeline the results of one RapidMind program directly to the next RapidMind

69 2 Multi-Core Frameworks program. For example, to execute three RapidMind programs, program_1, pro- gram_2, and program_3 one after the other, the following syntax can be used: output = (program_3 << program_2 << program_1)(input); However, there is no parallel execution of different pipeline stages. Only after one pipeline stage has completely finished, the data are passed to the next pipeline stage. Support of streaming by the framework: RapidMind does not provide streaming support for series of images.

2.5 OpenCL

Framework The Open Computing Language (OpenCL) is a standard for programming hetero- geneous parallel platforms [Mun09]. OpenCL was initiated by Apple and created and maintained by the Khronos Group. OpenCL is a platform independent specifi- cation for parallel programming like OpenGL is for graphics programming. OpenCL supports currently programming standard multi-core processors, graphics cards as well as the Cell Broadband Engine with planned support for other accelerators like DSPs. OpenCL allows to express data and task parallelism. The functionality is provided by a library and OpenCL programs are just-in-time compiled by the run- time environment like in RapidMind. The kernels are stored in strings, just like in OpenGL. It is also possible to share resources between OpenCL and OpenGL. The OpenCL standard is implemented by hardware vendors like AMD, IBM, and NVIDIA, but also by operating systems like Apples Mac OS X. All major compilers can be used to link against the OpenCL library. Like in RapidMind, kernels are written for one data element of a large date set, and this kernel is applied to each element of the data set by the OpenCL run-time environment. Kernels are defined by the __kernel keyword in OpenCL and between different memory locations is differentiated. Global memory visible to all threads is annotated by the __global keyword and local memory visible only to the currently executing processor, by the __local keyword. Unlike in RapidMind, the locating of the current thread has to be calculated explicitly and also the data has to be retrieved and stored manually. Therefore, build-in commands like get_global_id are available to retrieve the index of the current thread of the current iteration space. To calculate a histogram in OpenCL, we use the same approach as in RapidMind. One histogram is calculated per image line and these histograms are afterwards merged into the final histogram. The implementation of the OpenCL kernels is shown in Listing 2.9. This source code is stored in a string and compiled at run- time. Before the code can be compiled, a device has to be chosen for execution. This can be done by the clGetDeviceIDs command, using CL_DEVICE_TYPE_CPU as parameter to select a CPU as target platform. Similarly, a graphics card can be requested as device using CL_DEVICE_TYPE_GPU:

70 2.5 OpenCL

1 __kernel void ocl_hist(__global int ∗ hist , __global float ∗img , int width ) { 2 int i = get_global_id(0); 3 int y = get_global_id(1); 4 5 int count = 0 ; 6 for ( int k=0; k

// get CPU device clGetDeviceIDs(cl_platform_id, CL_DEVICE_TYPE_CPU, 1, &data.cl_device, NULL) ; ... // create OpenCL kernel for ocl_hist cl_kernel_hist = clCreateKernel(cl_program, "ocl_hist", &cl_err);

After selecting a device, further initialization has to be done like context creation, command queue creation, OpenCL program object generation from the source code string, compilation of the OpenCL kernels, and, eventually, creation of each OpenCL kernel. Before executing an OpenCL kernel, each parameter of the kernel has to be set using the clSetKernelArg function. The OpenCL kernel passed as argument to this function is associated with the kernel name in the source code string. For ex- ample in Listing 2.10 the OpenCL kernel object cl_kernel_hist refers to the kernel ocl_hist from Listing 2.9. The execution of the kernel is initiated by the clEnqueueN- DRangeKernel command, and the iteration space is defined by global_work_size. In this case a two-dimensional workspace with MAX_INTENSITY×height is used.

71 2 Multi-Core Frameworks

1 size_t global_work_size[2]; 2 global_work_size [ 0 ] = MAX_INTENSITY; 3 global_work_size[1] = height; 4 5 // set parameters for histogram calculation 6 clSetKernelArg(cl_kernel_hist , 0, sizeof (cl_mem) , cl_tmp_hist); 7 clSetKernelArg(cl_kernel_hist , 1, sizeof (cl_mem) , cl_img); 8 clSetKernelArg(cl_kernel_hist , 2, sizeof ( int ) , &width ) ; 9 10 // execute kernel over entire range of the data set 11 clEnqueueNDRangeKernel(cl_queue , cl_kernel_hist , 2, NULL, global_work_size , NULL, 0, NULL, NULL); 12 13 /∗ merge histogram ∗/ 14 global_work_size[1] = 1; 15 16 /∗ set parameters ∗/ 17 clSetKernelArg(cl_kernel_merge_hist , 0, sizeof (cl_mem) , cl_hist); 18 clSetKernelArg(cl_kernel_merge_hist , 1, sizeof (cl_mem) , cl_tmp_hist); 19 clSetKernelArg(cl_kernel_merge_hist , 2, sizeof ( int ) , height ) ; 20 /∗ execute kernel ∗/ 21 clEnqueueNDRangeKernel(cl_queue , cl_kernel_merge_hist , 1, NULL, global_work_size , NULL, 0, NULL, NULL); Listing 2.10: OpenCL kernel launch for histogram calculation.

Evaluation

Time required to get familiar with a particular framework: OpenCL requires some time to get familiar with the terminology used and to under- stand the way the OpenCL platform works. Knowing CUDA beforehand, however, getting familiar with OpenCL is not difficult since only the terminology changes. For multi-core hardware, currently two vendors provide OpenCL, namely Apple and AMD. Getting a simple hello world example to work takes almost the same time as the complete parallelization in OpenMP, because a smörgåsbord of different initialization steps is required before the actual OpenCL code can be executed and both vendors have different entry barriers at disposal. Effort of time to map reference code to a particular framework, how much code (percent) has to be changed or added: In order to parallelize existing code, almost the complete source code has to be restructured and rewritten. The kernels are written in a language based on C99 and have to be stored in strings. Since using strings to store source code is not practical, boilerplate code has to be added in order to load the OpenCL source code from a file. Similar to RapidMind, the invocation and implementation of parallel functionality is separated and the kernel code operates only on one data type. The time to map the complete application to OpenCL was about two weeks due to the bugs and errors in the OpenCL implementation provided by AMD.

72 2.5 OpenCL

Support of parallelization by the framework: The parallelization is completely done by the framework, only the operations that should be applied to each data element in a set have to be defined. Support of data partitioning by the framework: The data partitioning is done by the user, that is, the operations within a OpenCL kernel define the data partition that is assigned to one core. This is typically one data element. The tiling of the data set into partition is also done by the framework, but can be specified by the user, too, specifying the one- or two-dimensional size of the tile. Applicability for different algorithm classes and target platforms: OpenCL supports fine-grained data parallelism, as well as coarse-grained task par- allelism. However, the attempt to implement a coarse-grained version failed, due to the unsound OpenCL implementation provided by AMD. Running coarse-grained 2D/3D image registration hangs in the first kernel invocation if debugging is dis- abled. Advantages, drawbacks, and difficulties of a particular framework: The biggest advantage of OpenCL is its potential to provide support for hetero- geneous parallel platforms ranging from standard multi-core processors to graphics cards. The OpenCL standard is maintained by an independent consortium and receives fair amount of interest from many sides. However, the current implemen- tation provided by AMD for standard multi-core systems, is far from being usable in deployment systems. The developer has to fight with an immature OpenCL implementation (ATI Stream SDK 2.0.1). This crops up in the form of random segmentation faults of the internal just-in-time compiler and in incorrect code gen- erated by the just-in-time compiler. The latter issue was solved for our 2D/3D implementation with the latest update from AMD, but the former problem exists still. A further issue with the current OpenCL framework (ATI Stream SDK 2.1) is that the number of parameters and type of the parameter for an OpenCL kernel is not checked by the compiler. Only at run-time this may lead to errors, although not necessarily. More details on the evaluation of OpenCL on many-core graphics cards and their implications is given in Section 3.3. Effort to get a certain result, for example, performance or throughput: The effort to get a reasonable speedup on the CPU is surprisingly low, once a working OpenCL implementation and boilerplate code for initialization is at hand. Even without further device-specific optimizations, a naïve implementation results in appealing speedups. Scalability of the framework with respect to architecture family, new hardware: Figure 2.32, 2.33, and 2.34 show the execution times of the fine-grained OpenCL implementation on up to 16 cores for different volume and image resolutions. The exact times are in addition listed in Table 2.9. The graphs show that for the smallest volume no speedup can be achieved. For larger volumes, however, the fine-grained implementation running even on only a single core is faster than the baseline. This may be due to the optimizations performed by the just-in-time compiler of OpenCL.

73 2 Multi-Core Frameworks

The saturation point when adding further cores yields no improvement is reached roughly at about 9.5 s for the middle resolution. Plotting the speedup for the different volume and image resolutions in Figure 2.35, 2.36, and 2.37 shows that the obtained speedups of the fine-grained implementation is far below the optimal speedup, however, improving with the problem size. Using Amdahl’s law to determine the sequential part of each implementation results in a high alpha that fluctuates (89.88, 15.92, and 3.51). Also for OpenCL, the fine- grained implementation has a high parallelization overhead.

24

22

20

18

16

14

12 time (s) 10

8

6 naïve fine-grained 4

2

0 0 2 4 6 8 10 12 14 16 18 20 22 24 cores

Figure 2.32: Execution times of fine-grained OpenCL parallelizations on up to 16 cores for the 128 × 128 × 94 volume compared with the naïve im- plementation.

74 2.5 OpenCL

100

90

80

70

60

50 naïve time (s) 40

30

20

10 fine-grained

0 0 2 4 6 8 10 12 14 16 18 20 22 24 cores

Figure 2.33: Execution times of fine-grained OpenCL parallelizations on up to 16 cores for the 256 × 256 × 189 volume compared with the naïve implementation.

75 2 Multi-Core Frameworks

600

500

naïve

400

300 time (s)

200

100

fine-grained

0 0 2 4 6 8 10 12 14 16 18 20 22 24 cores

Figure 2.34: Execution times of fine-grained OpenCL parallelizations on up to 16 cores for the 512 × 512 × 378 volume compared with the naïve implementation.

76 2.5 OpenCL

Table 2.9: Execution times in seconds using the fine-grained OpenCL implementa- tion on up to 16 cores for different volume resolutions. # cores 128 × 128 × 94 256 × 256 × 189 512 × 512 × 378 1 8.11 43.92 312.71 2 7.49 38.68 276.10 3 5.91 27.12 189.26 4 5.00 20.83 141.34 5 5.42 18.15 115.41 6 5.42 15.68 96.13 7 5.60 13.99 83.70 8 5.17 12.70 73.86 9 6.30 13.41 67.71 10 6.01 13.25 61.53 11 6.31 12.81 57.40 12 4.80 10.63 50.56 13 5.00 10.55 47.60 14 5.16 9.95 44.37 15 4.71 9.58 41.78 16 4.68 9.38 38.82

77 2 Multi-Core Frameworks

24

22

20

18

16

14

12 speedup 10

8

6

4

2 fine-grained fitted alpha: 89.88 % 0 0 2 4 6 8 10 12 14 16 18 20 22 24 cores

Figure 2.35: Speedup of fine-grained OpenCL implementation on up to 16 cores for the 128 × 128 × 94 volume compared with the naïve implementation.

78 2.5 OpenCL

24

22

20

18

16

14

12 speedup 10

8

6 fine-grained

4 fitted alpha: 15.92 %

2

0 0 2 4 6 8 10 12 14 16 18 20 22 24 cores

Figure 2.36: Speedup of fine-grained OpenCL implementation on up to 16 cores for the 256 × 256 × 189 volume compared with the naïve implementation.

79 2 Multi-Core Frameworks

24

22

20

18

16

14

12 fine-grained fitted alpha: 3.51 % speedup 10

8

6

4

2

0 0 2 4 6 8 10 12 14 16 18 20 22 24 cores

Figure 2.37: Speedup of fine-grained OpenCL implementation on up to 16 cores for the 512 × 512 × 378 volume compared with the naïve implementation.

80 2.5 OpenCL

Scalability in terms of problem size: As seen in Figure 2.35, 2.36, and 2.37, the sequential portion (i. e., the overhead that comes with the parallelization) of the fine-grained implementation is huge in OpenCL. For the small volume, 89.88 % of the parallelized part of the 2D/3D im- age registration is still sequential. This decreases to 3.51 % for the large volume. The same can be seen when plotting the execution time ratio of two volume reso- lutions. As discussed in Section 1.3, the number of instructions and execution time increases roughly by a factor of 8. Here, the ratio is less and drops further with increasing number of cores for the fine-grained implementation (see Figure 2.38). Hence, the fine-grained implementation scales well for larger problem sizes with in- creasing number of cores. However, this is mainly due to the large parallelization overhead of OpenCL compensated by more cores.

12

10

8

6

256 → 512 execution time ratio 4

2 128 → 256

0 0 2 4 6 8 10 12 14 16 18 20 22 24 cores

Figure 2.38: Scalability of fine-grained OpenCL parallelization: Shown is the exe- cution time ratio when moving to a different volume resolution.

Resource awareness of framework (run-time system): OpenCL for multi-core processors is to our knowledge not resource aware. Resource management for multiple sub-algorithms (image pipeline): OpenCL does not provide support for resource management for pipelining. Support of streaming by the framework: OpenCL does not provide streaming support for series of images.

81 2 Multi-Core Frameworks

2.6 Discussion

To summarize the results of the frameworks, the achieved performance of the multi- core frameworks is compared. Figure 2.39, 2.40, and 2.41 show the execution times of the fine-grained imple- mentation of each framework on up to 24 cores for different volume and image resolutions. The times of the OpenCL implementation have been aligned to the baseline execution times of the other frameworks. It can be seen that in particular OpenMP and RapidMind have an enormous initialization overhead for the small volume. For a single multi-core processor with four cores, even the execution time of the baseline implementation is not met for the small volume. With increasing number of cores, however, these frameworks scale well and catch up with the perfor- mance of the other frameworks, notably OpenMP. The framework that scales best and achieves the best results for all problem sizes is TBB. Cilk++ scales just as well, but takes slightly longer. OpenCL shows also good performance, though, only for the big and middle volume.

24

22

20

18

16

14

12 time (s) 10

8 RapidMind naïve 6 OpenCL 4 Cilk++ 2 OpenMP TBB 0 0 2 4 6 8 10 12 14 16 18 20 22 24 cores

Figure 2.39: Execution times of all parallelization frameworks for the fine-grained implementation on up to 24 cores for the 128 × 128 × 94 volume com- pared with the naïve implementation.

82 2.6 Discussion

100

90

80

70

60 naïve

50 time (s) 40

30

20 RapidMind 10 Cilk++ OpenCL OpenMP TBB 0 0 2 4 6 8 10 12 14 16 18 20 22 24 cores

Figure 2.40: Execution times of all parallelization frameworks for the fine-grained implementation on up to 24 cores for the 256 × 256 × 189 volume compared with the naïve implementation.

83 2 Multi-Core Frameworks

600

500 naïve

400

300 time (s)

200

100 Cilk++ RapidMind OpenCL OpenMP 0 TBB 0 2 4 6 8 10 12 14 16 18 20 22 24 cores

Figure 2.41: Execution times of all parallelization frameworks for the fine-grained implementation on up to 24 cores for the 512 × 512 × 378 volume compared with the naïve implementation.

84 2.6 Discussion

Figure 2.42, 2.43, and 2.44 show the execution times of the coarse-grained im- plementation of each framework on up to 24 cores for different volume and image resolutions. There are no implementations for RapidMind and OpenCL, because it was too time-consuming and troublesome to get a correct implementation that runs. The graphs show that there are no major differences between the OpenMP, Cilk++, and TBB for the coarse-grained implementation. All frameworks scale well and achieve good performance.

24

22

20

18

16

14

12 time (s) 10

8 naïve 6

4 TBB 2 Cilk++ OpenMP 0 0 2 4 6 8 10 12 14 16 18 20 22 24 cores

Figure 2.42: Execution times of all parallelization frameworks for the coarse-grained implementation on up to 24 cores for the 128 × 128 × 94 volume com- pared with the naïve implementation.

85 2 Multi-Core Frameworks

100

90

80

70

60 naïve

50 time (s) 40

30

20 TBB 10 Cilk++ OpenMP 0 0 2 4 6 8 10 12 14 16 18 20 22 24 cores

Figure 2.43: Execution times of all parallelization frameworks for the coarse-grained implementation on up to 24 cores for the 256 × 256 × 189 volume compared with the naïve implementation.

86 2.6 Discussion

600

500 naïve

400

300 time (s)

200

100 TBB Cilk++ OpenMP

0 0 2 4 6 8 10 12 14 16 18 20 22 24 cores

Figure 2.44: Execution times of all parallelization frameworks for the coarse-grained implementation on up to 24 cores for the 512 × 512 × 378 volume compared with the naïve implementation.

87 2 Multi-Core Frameworks

In summary, the best scaling framework for all volume sizes is TBB followed by Cilk++ and OpenMP. While TBB requires parallel functionality to be encapsu- lated in special classes, Cilk++ and OpenMP require almost no change of existing source code. OpenCL has the potential to catch up with these frameworks, once the OpenCL library is mature and coarse-grained implementations can be imple- mented without the problems of the current implementation. One major advantage of Cilk++ are the two tools provided, the race detector and parallel performance analyzer. RapidMind programs are concise, easy to read and comprehensible. Un- fortunately, the achieved performance scales only for large problems.

88 3 Many-Core Frameworks

In this chapter, frameworks for programming many-core accelerators such as graph- ics cards are evaluated. The frameworks we considered are to a large extend those frameworks being relevant to and used in industry. For evaluation several graphics cards are used. On the one hand, NVIDIA graph- ics cards are used for most parts, since they are supported by most of the frame- works. Of the current generation a high-end Quadro FX 5800 with 240 streaming processors is used as well as a high-end Tesla C 2050 with 448 streaming proces- sors of the follow-up Fermi architecture. On the other hand, a Radeon HD 5870 with 1600 streaming processors from ATI is used. Table 3.1 lists the compiler and framework version used for each of the investigated frameworks.

Table 3.1: Compiler version and framework version used for evaluation. Framework Version RapidMind gcc/4.4.2, RapidMind 4.0.1 PGI Accelerator pgcc/10.4 OpenCL gcc/4.4.2, OpenCL 1.0, Stream 2.0.1/2.1, CUDA 3.0 CUDA gcc/4.3.4, CUDA 3.0

3.1 RapidMind

Framework A detailed characterization and elaborative description of RapidMind is given in Section 2.4. Only differences using RapidMind for graphics card will be covered here. RapidMind provides several back ends to generate code for. These are, for ex- ample, debug for debugging, cuda to generate code for CUDA-enabled devices, or glsl for OpenGL-enabled devices. By default, RapidMind chooses the most eligible back end. To choose a particular back end, the set_backend command can be used, for example to select the CUDA back end. The following command is sufficient to generate from the same RapidMind program as in Section 2.4 for graphics cards using the CUDA back end: :: set_backend("cuda");

89 3 Many-Core Frameworks

Table 3.2: Execution times in seconds using the CUDA back end of RapidMind on the Quadro FX 5800 and Tesla C 2050. # cores 128 × 128 × 94 256 × 256 × 189 512 × 512 × 378 240 (Quadro) 5.82 13.90 51.08 448 (Tesla) 7.22 9.69 22.63

Evaluation Since RapidMind was already evaluated in Section 2.4, only those criteria are listed that differ from the previous evaluation. Advantages, drawbacks, and difficulties of a particular framework:

After changing the back end to CUDA, the application crashed with a segmenta- tion fault while executing a RapidMind program. As it turned out, some RapidMind variables were not initialized by default in the CUDA back end, while they were initialized with zero in the back end. Afterwards, the program run without errors. Using the OpenGL back end, no local arrays are allowed and, hence, the op- timized histogram implementation described in Section 2.4 is not supported. Using the implementation without local arrays yielded incorrect results on the NVIDIA card, while the program hang on an ATI card. In summary, the CUDA back end worked well, while the OpenGL puts more restrictions on the programs and did not work as expected. Scalability of the framework with respect to architecture family, new hardware:

Table 3.2 shows the execution times of the fine-grained implementation using the CUDA back end of RapidMind on the Quadro FX 5800 and Tesla C 2050 for different volume and image resolutions and Figure 3.1 shows the corresponding speedup compared to the baseline implementation. It can be seen that the CUDA back end of RapidMind scales not well on the new Tesla C 2050 for small problem sizes and takes even longer to execute. For big problem sizes, the implementation scales disproportionately high with the number of cores and the new architecture without any change to the source code. Scalability in terms of problem size: As discussed in Section 1.3, the number of instructions and execution time increases roughly by a factor of 8. Figure 3.2 shows execution time ratio when moving to a bigger volume resolution. It can be seen that the ratio is for both graphics cards far below 8. Hence, the implementation scales well for larger problem sizes. However, this is mainly due to the high degree of parallelism required to utilize the graphics hardware best. Resource awareness of framework (run-time system): RapidMind for many-core processors is not resource aware. Kernels executed on

90 3.1 RapidMind

Quadro FX 5800 Tesla C 2050 160

140

120

100

80 speedup 60

40 20.98 20 5.75 9.29 1.11 0.9 4.01 0 128 × 128 × 94 256 × 256 × 189 512 × 512 × 378

Figure 3.1: Speedup of the RapidMind implementation on the CUDA back end com- pared to the baseline implementation for different volume resolutions. graphics hardware occupy exclusively the processors.

91 3 Many-Core Frameworks

Quadro FX 5800 Tesla C 2050 8

7

6

5

4 3.68

3 2.39 2.34 execution time ratio 2 1.34 1

0 128 → 256 256 → 512

Figure 3.2: Scalability of fine-grained implementation using the CUDA back end of RapidMind: Shown is the execution time ratio when moving to a different volume resolution.

3.2 PGI Accelerator

Framework

The PGI Accelerator model from the Portland Group is a directive based high-level programming model for accelerators, such as graphics cards [Wol10]. Its design is similar to OpenMP and is supported by the C and Fortran compilers from the Portland Group. Their compilers are commercially available and temporary for evaluation for Windows, GNU Linux, and Mac OS X. The PGI Accelerator prepro- cessor directives define accelerator regions that are mapped and executed in parallel on the graphics card. The compiler generates automatically code from loops within accelerator regions to be executed on the graphics card. This allows data paral- lel execution, but no task parallelism. Further directives allow the programmer to influence the way a loop is executed in parallel and which data have to be copied to and from the graphics cards. The compilation is feedback based, that is, the programmer gets feedback if a loop is executed in parallel or if not, why the loop cannot be parallelized. To declare an accelerator region, the acc region pragma is used:

92 3.2 PGI Accelerator

#pragma acc r e g i o n { // f o r l o o p s ... } Data directives are used to tell the compiler which parts of an array are required on the device for computation. Data regions can be used to embed several accelerator region that work on the same data: #pragma acc data region copyin(img[0:width ∗ height]) copyout(result [0: width ∗ height ] ) { #pragma acc r e g i o n { ... } #pragma acc r e g i o n { ... } } To compile the code for the graphics card, the target architecture has to be specified as compiler flag. This is done by the -ta=nvidia flag. It is also possible to compile a unified binary containing two versions of the code, one for the accelerator and one for the host by specifying also host as target architecture. In case no graphics card is available, the code is executed on the host. In order to implement the histogram using the PGI Accelerator model, a similar approach has to be used as was used for RapidMind. The compiler does not allow to write to random locations, since that would lead to race conditions. The index on the left side of an assignment may only depend on the index of the enclosing loop. In a first step, one histogram per line is generated and stored to temporary memory that is only allocated on the graphics card using the local data clause. For each bin we iterate over the image line and count the frequency and store it to the temporary memory. The compiler, however, complains, that the loop cannot be parallelized: accelerator restriction: scalar variable live-out from loop: count. Telling the compiler that the count variable is a private variable for each thread, the different threads do not interfere and the loop can be processed in parallel. In the second step, the histograms of each image line are merged into the final histogram as seen in Listing 3.1.

Evaluation Time required to get familiar with a particular framework: PGI Accelerator requires little time to get familiar. After one day the main fea- tures and the concept behind PGI Accelerator is throughout understood and the application can be parallelized using the PGI Accelerator model.

93 3 Many-Core Frameworks

1 #pragma acc region copyin(img[0:width ∗ height]) local(tmp[0:height − 1 ] [ 0 :MAX_INTENSITY] ) copyout ( h i s t [ 0 :MAX_INTENSITY] ) 2 { 3 #pragma acc for 4 for ( int y=0; y

94 3.2 PGI Accelerator

Effort of time to map reference code to a particular framework, how much code (percent) has to be changed or added: Existing C or Fortran code can be parallelized with almost no change to the original code using only pragmas to express the parallelism. However, C++ features like classes, or even C structs, are not allowed within accelerator regions. The code has to be restructured in order to conform to the ANSI C99 standard. Also function calls within accelerator regions are not allowed and have to be inlined manually. Most of the time is spent to get the code ANSI C99 conform and using no structs. In particular when using a class for matrix and vector operations, a lot of effort has to be invested to rewrite and inline those operations. While only a few days are required to annotate the code with pragmas, almost two weeks were required to rewrite the code base to conform to the ANSI C99 standard. Support of parallelization by the framework: The parallelization is completely done by the framework, only the code fragments to be parallelized have to be annotated. In addition, hints for data transfers and the data parallelization strategy are required. Support of data partitioning by the framework: The main data partitioning is done by the user, that is, the loops iterating over the data set are not altered. The user, however, can influence the parallelization strategy of loops. For example, using the kernel directive, the body of the corresponding loop is chosen as computational kernel. The parallel and vector directives can be used to influence the mapping to blocks and threads, respectively, of the underlying CUDA architecture. In addition there are directives to execute a loop sequentially (seq), to unroll a loop, or execute the loop on the host. Most of the directives have an optional width parameter to specify the number of iterations to be considered for the corresponding scheduling clause. Applicability for different algorithm classes and target platforms: PGI Accelerator supports fine-grained data parallelism, but no coarse-grained task parallelism. The source for data parallelism are loops iterating over big data sets. Advantages, drawbacks, and difficulties of a particular framework: The biggest advantage of the PGI Accelerator model—like it is the case with OpenMP—is that it is relatively easy to parallelize the code. No new syntax has to be learned, only hints for the compiler have to be given. All the code to man- age resources on the graphics card, to transfer data to and back from the graphics memory is completely handled by the framework. The feedback based compilation path, helps also the programmer to get code that maps well to the hardware. The feedback of compiling the square sum of difference loop shows what data is automat- ically copied to the graphics card, which loop is used to generate the kernel, which vector width is used for execution, and that a reduction is automatically calculated for the variable sum: ssd : 24, Generating copyin(img_drr[: height ∗ width ] ) Generating copyin(img_fll [: height ∗ width ] ) 27, Loop is parallelizable

95 3 Many-Core Frameworks

Accelerator kernel generated 27, #pragma acc for parallel, vector(256) 31, Sum reduction generated for sum 28, Loop is parallelizable Some feedback is, however, not easy to understand without background knowledge in compiler technology and transformations: mi : 42, Accelerator restriction: scalar variable live −out from loop: count The PGI Accelerator compilers provide also support to time kernels automati- cally. Therefore, the time option has to be specified when specifying the target architecture: -ta=nvidia,time. When the program finishes execution, the statics for each acceleration region is printed, what time is spent for data transfers, kernel execution: quality_measures.c ssd 24: region entered 160 times time(us): total=293282 init=86 region=293196 kernels=73109 data=220087 w/o init : total=293196 max=2060 min=1801 avg=1832 27: kernel launched 320 times g r i d : [1 −2] block: [256] time(us): total=73109 max=454 min=7 avg=228 For multi-device support, the PGI Accelerator model can be used in combination with OpenMP. For each OpenMP thread, a context is created and associated with one graphics card accelerator. The most obvious drawback of the PGI Accelerator model is the lack of support for C++. The code base has to be rewritten to conform to the ANSI C99. Furthermore, there is currently no support for device permanent data, that is, data cannot be kept on the device across accelerator data region. Since the data regions are limited to function scope, this limits also the lifetime of data on the graphics memory. Constant data like the volume in the 2D/3D image registration has to be copied each time to the device memory. For future versions of the Accelerator model, there is planned support to support code generation for standard shared memory multi- core processors from accelerator annotations, but not to support graphics cards from ATI, and to define support for device permanent data (D. Miles, personal communication). Effort to get a certain result, for example, performance or throughput: The effort to get a reasonable speedup is huge. As long as no device permanent data is supported, most time is spent for data transfers unless the complete program is completely rewritten to have all computations in one huge function. Scalability of the framework with respect to architecture family, new hardware:

96 3.2 PGI Accelerator

Table 3.3: Execution times in seconds using PGI Accelerator on the Quadro FX 5800 and Tesla C 2050. # cores 128 × 128 × 94 256 × 256 × 189 512 × 512 × 378 240 (Quadro) 9.13 30.17 141.33 448 (Tesla) 15.07 39.91 134.13

Table 3.3 shows the execution times of the fine-grained implementation using PGI Accelerator on the Quadro FX 5800 and Tesla C 2050 for different volume and image resolutions and Figure 3.3 shows the corresponding speedup compared to the baseline implementation. It can be seen that PGI Accelerator model scales not well on the new Tesla C 2050 for small and medium problem sizes and takes even longer to execute. For big problem sizes, the implementation scales disproportionately with the number of cores and is only marginally faster compared to the Quadro FX 5800.

Quadro FX 5800 Tesla C 2050 160

140

120

100

80 speedup 60

40

20 0.71 0.43 1.85 1.4 3.36 3.54 0 128 × 128 × 94 256 × 256 × 189 512 × 512 × 378

Figure 3.3: Speedup of the PGI Accelerator implementation compared to the base- line implementation for different volume resolutions.

Scalability in terms of problem size: As discussed in Section 1.3, the number of instructions and execution time increases roughly by a factor of 8. Figure 3.4 shows execution time ratio when moving to a

97 3 Many-Core Frameworks bigger volume resolution. It can be seen that the ratio is for both graphics cards far below 8. Hence, the implementation scales well for larger problem sizes. However, this is mainly due to the high degree of parallelism required to utilize the graphics hardware best.

Quadro FX 5800 Tesla C 2050 8

7

6

5

4 3.68

3 2.39 2.34 execution time ratio 2 1.34 1

0 128 → 256 256 → 512

Figure 3.4: Scalability of fine-grained implementation using PGI Accelerator: Shown is the execution time ratio when moving to a different volume resolution.

Resource awareness of framework (run-time system): PGI Accelerator is not resource aware. Kernels executed on graphics hardware occupy exclusively the processors.

Resource management for multiple sub-algorithms (image pipeline): PGI Accelerator does not provide support for resource management for pipelining.

Support of streaming by the framework: PGI Accelerator does not provide streaming support for series of images.

98 3.3 OpenCL

3.3 OpenCL

Framework A detailed characterization and elaborative description of OpenCL is given in Sec- tion 2.5. Only differences using OpenCL for graphics card will be covered here. To run an OpenCL kernel on a graphics card, only a different device has to be chosen as target platform. Using CL_DEVICE_TYPE_GPU selects a graphics card, while CL_DEVICE_TYPE_CPU selects a standard shared memory multi- core processor. The histogram implementation on the host from Section 2.5 utilizes not the local memory of graphics processors. This is, however, essential to get good performance. Therefore, OpenCL introduces the keyword __local, that defines that the data is stored in the fast on-chip memory of the graphics processors. In Listing 3.2 this memory is used to calculate one histogram per image line in local memory before storing it back to global memory. To synchronize the access between different threads to the local memory, the barrier command is provided. To avoid the race condition when several threads update the same bin, an atomic increment function provided of current graphics cards is used. To enable support for atomic functions an OpenCL compiler directive is required. The code to call the kernels change slightly, since the number of threads to cal- culate the histogram of one image line is now fixed to the number of bins. This is specified by the additional local_work_size parameter to clEnqueueNDRangeKer- nel in Listing 3.3. If the local work size is specified, the global work size has to be a multiple of the local work size, and has to be accordingly adjusted.

Evaluation Since OpenCL was already evaluated in Section 2.5, only those criteria are listed that differ from the previous evaluation. Advantages, drawbacks, and difficulties of a particular framework: While it is enough to change one parameter to generate code for a different tar- get platform, to get good performance, the code has to adapted to the underlying hardware architecture. For graphics cards, the faster on-chip local memory has to be used to store intermediate results and atomic functions are used for computa- tions free of race conditions. This means, that different code has to be written for each target architecture. Depending on the available devices the implementation is chosen. For different problem sizes, the number of pixel per line differ and, hence, also the number of elements that have to calculate the histogram. For large images (e. g., image width greater than 512 pixels in one direction), the maximal number of elements that can be specified for the local work size is exceeded. That is, a different approach is needed like processing multiple pixels per thread. OpenCL libraries are provided by the two major graphics hardware manufac- turer, NVIDIA and ATI. While the core functionality of the OpenCL libraries is

99 3 Many-Core Frameworks

1 #pragma OPENCL EXTENSION cl_khr_local_int32_base_atomics : enable 2 3 __kernel void ocl_hist_lmem(__global int ∗ hist , __global float ∗img , int width ) { 4 int i = get_global_id(0); 5 int j = get_global_id(1); 6 int tid = get_local_id(0); 7 8 __local unsigned int tmp_hist [MAX_INTENSITY ] ; 9 10 /∗ initialize memory ∗/ 11 i f ( t i d < MAX_INTENSITY) { 12 tmp_hist [ t i d ] = 0 ; 13 } 14 15 b a r r i e r (CLK_LOCAL_MEM_FENCE) ; 16 17 /∗ create histogram for image line ∗/ 18 i f ( i < width ) { 19 int img_val = ( int ) img [ i + j ∗ width ] ; 20 atom_inc(&tmp_hist[img_val]); 21 } 22 23 b a r r i e r (CLK_LOCAL_MEM_FENCE) ; 24 25 /∗ write histogram of image line to global memory ∗/ 26 i f ( t i d < MAX_INTENSITY) { 27 hist[(get_group_id(0) + get_num_groups(0) ∗get_group_id(1)) ∗ MAX_INTENSITY + t i d ] = tmp_hist [ t i d ] ; 28 } 29 } 30 31 __kernel void ocl_hist_merge(__global int ∗ hist , __global int ∗ tmp_hist , int height ) { 32 int i = get_global_id(0); 33 34 int count = 0 ; 35 for ( int k=0; k

100 3.3 OpenCL

1 size_t global_work_size[2] , local_work_size[2]; 2 local_work_size [ 0 ] = MAX_INTENSITY; 3 local_work_size[1] = 1; 4 global_work_size[0] = ( int ) c e i l ( ( float )(width)/local_work_size [0]) ∗ local_work_size [0]; 5 global_work_size[1] = height; 6 7 // set parameters for histogram calculation 8 clSetKernelArg(cl_kernel_hist , 0, sizeof (cl_mem) , cl_tmp_hist); 9 clSetKernelArg(cl_kernel_hist , 1, sizeof (cl_mem) , cl_img); 10 clSetKernelArg(cl_kernel_hist , 2, sizeof ( int ) , &width ) ; 11 12 // execute kernel over entire range of the data set 13 clEnqueueNDRangeKernel(cl_queue , cl_kernel_hist , 2, NULL, global_work_size , local_work_size , , 0, NULL, NULL); 14 15 /∗ merge histogram ∗/ 16 global_work_size [ 1 ] = MAX_INTENSITY; 17 global_work_size[1] = 1; 18 int num_subhist = height ∗ ( int ) c e i l ( ( float )(width)/local_work_size [ 0 ] ) ; 19 20 /∗ set parameters ∗/ 21 clSetKernelArg(cl_kernel_merge_hist , 0, sizeof (cl_mem) , cl_hist); 22 clSetKernelArg(cl_kernel_merge_hist , 1, sizeof (cl_mem) , cl_tmp_hist); 23 clSetKernelArg(cl_kernel_merge_hist , 2, sizeof ( int ) , num_subhist); 24 /∗ execute kernel ∗/ 25 clEnqueueNDRangeKernel(cl_queue , cl_kernel_merge_hist , 1, NULL, global_work_size , NULL, 0, NULL, NULL); Listing 3.3: OpenCL kernel launch for histogram calculation.

101 3 Many-Core Frameworks the same, there are a few differences when developing with the different frameworks. Switching from one OpenCL framework to the other implies also switching between the comfort functionality provided by each framework to load source files, etc. The core implementation differs, however, as well. While NVIDIA implemented almost the complete OpenCL specification, some features like support for various image formats are still missing in the ATI implementation. There are also major differ- ences in the quality and correctness of the just-in-time compiler outputs. While the NVIDIA just-in-time compiler (CUDA toolkit 3.0) worked with no errors and provided correct results, the ATI just-in-time compiler (ATI Stream SDK 2.0.1) generated incorrect code when local memory was used and crashed with internal segmentation faults compiling some source files. With the latest updates from ATI (ATI Stream SDK 2.1), the compiler generated correct code, but had still problems with some input files. Renaming variables or changing the amount of whitespace resolved randomly the compilation issues. For the largest volume size, the ATI run- time did not allow to allocate more than 256 MB for one single array and quited with a CL_INVALID_BUFFER_SIZE error. Furthermore, to increase the locality when accessing voxels in the volume, 3D textures were used. However, this was only possible using the OpenCL implementation of NVIDIA. Since the required image format is not yet implemented in ATI’s OpenCL implementation. Furthermore, compiling the implementation presented here, which utilizes local memory, for the standard multi-core platform yields no errors, however, the results obtained are in- correct. Extensions (e. g., atomic functions) are only provided by some hardware device and the programmer has to take care of this and provide, hence, different implementations. Effort to get a certain result, for example, performance or throughput: For good performance, the program has to be adjusted to utilize the fast on-chip local memory of the graphics processors. This is an additional factor that has to be considered compared to programming standard multi-core processors using OpenCL. The execution times obtained with OpenCL are, however, inferior to those obtained with CUDA on the same graphics card. Scalability of the framework with respect to architecture family, new hardware: Table 3.4 shows the execution times of the fine-grained implementation using the GPU back end of OpenCL on the Quadro FX 5800, Tesla C 2050, and Radeon HD 5870 for different volume and image resolutions and Figure 3.5 shows the cor- responding speedup compared to the baseline implementation. It can be seen that the GPU back end of OpenCL scales well on the new Tesla C 2050, even for small problem sizes. The execution times on the Radeon HD 5870 fall far short of the NVIDIA cards, even though having the highest single precision peak performance (2.7 TFLOPS vs. 933.12 GFLOPS and 1.288 GFLOPS). For big problem sizes, the implementation scales disproportionately high with the number of cores and the new architecture without any change to the source code. Scalability in terms of problem size: As discussed in Section 1.3, the number of instructions and execution time increases

102 3.3 OpenCL

Table 3.4: Execution times in seconds using OpenCL on the Quadro FX 5800, Tesla C 2050, and Radeon HD 5870. 240 cores (Quadro) 128 × 128 × 94 256 × 256 × 189 512 × 512 × 378 Naïve 5.92 46.45 391.59 + Local Memory 0.47 1.97 13.12 + 3D Texture 0.43 1.71 10.98

448 cores (Tesla) 128 × 128 × 94 256 × 256 × 189 512 × 512 × 378 Naïve 7.60 60.86 481.12 + Local Memory 0.36 1.33 8.23 + 3D Texture 0.37 1.39 8.35

1600 cores (Radeon HD) 128 × 128 × 94 256 × 256 × 189 512 × 512 × 378 Naïve 8.18 18.59 – + Local Memory 1.37 3.08 – + 3D Texture – – –

103 3 Many-Core Frameworks

Quadro FX 5800 Tesla C 2050 Radeon HD 5870 160

140

120

100

80 speedup 60 56.82 40.17 43.23 40 32.61

20 14.9217.65 18.11 4.72 0 128 × 128 × 94 256 × 256 × 189 512 × 512 × 378

Figure 3.5: Speedup of the OpenCL implementation compared to the baseline im- plementation for different volume resolutions. roughly by a factor of 8. Figure 3.4 shows execution time ratio when moving to a bigger volume resolution. It can be seen that the ratio is for both graphics cards far below 8 when moving from the small to the middle volume resolution. Moving from the middle to the high resolution, the ratio is at a factor of 6. Hence, the implementation scales better for small problem sizes. However, this is mainly due to the high degree of parallelism required to utilize the graphics hardware best. Resource awareness of framework (run-time system): OpenCL for many-core processors is not resource aware. Kernels executed on graph- ics hardware occupy exclusively the processors.

104 3.4 CUDA

Quadro FX 5800 Tesla C 2050 Radeon HD 5870

8

7 6.42 6.02 6

5

4 3.94 3.79

3

execution time ratio 2.25 2

1

0 128 → 256 256 → 512

Figure 3.6: Scalability of fine-grained implementation using the GPU back end of OpenCL: Shown is the execution time ratio when moving to a different volume resolution.

3.4 CUDA

Framework

The Compute Unified Device Architecture (CUDA) is a parallel computing architec- ture developed by NVIDIA [LNOM08]. CUDA provides a application programming interface that allows to harness the processing power of NVIDIA’s graphics cards for data parallel non-graphics computations. CUDA extends the C language with some keywords to launch programs on the graphics card, to get unique identifiers of threads executing on the graphics hardware and to synchronize the execution be- tween threads. CUDA provides a low level driver API as well as a high level runtime API with different level of details. The latest graphics architecture from NVIDIA supports also C++ on the graphics card [NVI09]. NVIDIA provides an own com- piler nvcc for Windows, GNU Linux, and Mac OS X to compile CUDA source files that can be later on linked with standard C/C++ code. A compiler for Fortran

105 3 Many-Core Frameworks

CUDA is provides by the Portland Group. After the success of CUDA, OpenCL was created to define a similar hardware and vendor independent API and NVIDIA provided also a OpenCL implementation. The original CUDA implementation is called since then C for CUDA. OpenCL and CUDA code on the device are to a large extend the same and only the keywords used differ. Therefore, the same approach is used to calculate the histogram in CUDA as was used in Section 3.3 for OpenCL. Listing 3.4 shows the corresponding kernels, the __kernel and __local are replaced by __global__ and __shared__, __synchthreads() is used as barrier function and the index is calculated using different variables, but the functionality is exactly the same. The CUDA SDK provides commands to initialize the graphics card, create the corresponding context, etc.: // initialize GPU CUT_DEVICE_INIT( a r g c , argv ) ; Unlike to OpenCL, the launch of a kernel is done by a call to the kernel. The parameters are passed like in a normal function call and the execution configuration for the kernel is specified between <<<...>>> comparators as seen in Listing 3.5.

Evaluation Time required to get familiar with a particular framework: CUDA requires some time to get familiar with the terminology used and to under- stand the way the CUDA platform works. Starting with some simple examples is relatively easy since NVIDIA ships many examples with its SDK and the technology is mature in the meantime. Effort of time to map reference code to a particular framework, how much code (percent) has to be changed or added: In order to parallelize existing code, almost the complete source code has to be restructured and rewritten. The kernels are written in a language based on C/C++ and are compiled by NVIDIA’s nvcc compiler. Similar to OpenCL, the invocation and implementation of parallel functionality is separated and the kernel code oper- ates only on one data element. The kernels are almost identical with the ones of OpenCL apart from some keywords. The host code to manage the graphics card, to launch kernels, etc., however, is much more compact compared to OpenCL. The time required for parallelization is less than for OpenCL. Support of parallelization by the framework: The parallelization is completely done by the framework, only the operations that should be applied to each data element in a set have to be defined. Support of data partitioning by the framework: The data partitioning is done by the user, that is, the operations within a CUDA kernel define the data partition that is assigned to one core. This is typically one data element. The tiling of the data set into partitions is done by the user, specifying the one- or two-dimensional size of the tile.

106 3.4 CUDA

1 __global__ void cu_hist_smem( int ∗ h i s t , float ∗img , int width ) { 2 const int i = blockDim.x ∗ blockIdx.x + threadIdx.x; 3 const int j = blockDim.y ∗ blockIdx.y + threadIdx.y; 4 int tid = threadIdx.x; 5 6 __shared__ int tmp_hist [MAX_INTENSITY ] ; 7 8 /∗ initialize memory ∗/ 9 i f ( t i d < MAX_INTENSITY) { 10 tmp_hist [ t i d ] = 0 ; 11 } 12 13 __syncthreads ( ) ; 14 15 /∗ create histogram for image line ∗/ 16 i f ( i < width ) { 17 int img_val = ( int ) img [ i + j ∗ width ] ; 18 atomicAdd(&tmp_hist[img_val], 1); 19 } 20 21 __syncthreads ( ) ; 22 23 /∗ write histogram of image line to global memory ∗/ 24 i f ( t i d < MAX_INTENSITY) { 25 hist[(blockIdx.x+gridDim.x∗ blockIdx.y) ∗MAX_INTENSITY + t i d ] = tmp_hist[tid ]; 26 } 27 } 28 29 __global__ void cu_hist_merge( int ∗ h i s t , int ∗tmp_hist , int height ) { 30 const int i = blockDim.x ∗ blockIdx.x + threadIdx.x; 31 32 int count = 0 ; 33 for ( int k=0; k

107 3 Many-Core Frameworks

1 dim3 threads(256, 1); 2 dim3 grid(( int ) c e i l ( ( float )width/threads.x), ( int ) c e i l ( ( float ) height / threads.y)); 3 4 /∗ calculate histogram ∗/ 5 cu_hist_smem<<>>(tmp_hist , img, width); 6 7 /∗ merge histograms ∗/ 8 unsigned int num_subhist = height ∗ g r i d . x ; 9 g r i d . x = 1 ; 10 g r i d . y = 1 ; 11 threads . x = MAX_INTENSITY; 12 cu_hist_merge<<>>(hist , tmp_hist, num_subhist); Listing 3.5: CUDA kernel launch for histogram calculation.

Applicability for different algorithm classes and target platforms: CUDA supports fine-grained data parallelism, but no coarse-grained task paral- lelism. Advantages, drawbacks, and difficulties of a particular framework: CUDA is currently the de facto standard for programming graphics cards for non- graphics problems. It’s runtime API is easy to program and less error-prone com- pared to the low-level API of OpenCL. CUDA supports, however, only graphics cards from NVIDIA. Then again, NVIDIA favors CUDA compared its own OpenCL implementation, being able to implement the latest feature of its graphics cards in CUDA. The performance of the same code is in CUDA more efficient, too. Effort to get a certain result, for example, performance or throughput: For good performance, the program has to be adjusted to utilize the fast on-chip local memory of the graphics processors. This is an additional factor that has to be considered compared to programming standard multi-core processors. Similar to OpenCL the logic of the program when utilizing shared memory, depends also on the problem size. Scalability of the framework with respect to architecture family, new hardware: Table 3.5 shows the execution times of the fine-grained implementation using the CUDA on the Quadro FX 5800 and Tesla C 2050 for different volume and image resolutions and Figure 3.7 shows the corresponding speedup compared to the base- line implementation. It can be seen that CUDA scales well on the new Tesla C 2050, even for small problem sizes. Scalability in terms of problem size: As discussed in Section 1.3, the number of instructions and execution time increases roughly by a factor of 8. Figure 3.4 shows execution time ratio when moving to a bigger volume resolution. It can be seen that the ratio is for both graphics cards far below 8 when moving from the small to the middle volume resolution. Moving from the middle to the high resolution, the ratio is at a factor of 6. Hence, the

108 3.4 CUDA

Table 3.5: Execution times in seconds using CUDA on the Quadro FX 5800 and Tesla C 2050. 240 cores (Quadro) 128 × 128 × 94 256 × 256 × 189 512 × 512 × 378 Naïve – – – + Local Memory 0.28 1.01 8.27 + 3D Texture 0.26 0.84 5.27

448 cores (Tesla) 128 × 128 × 94 256 × 256 × 189 512 × 512 × 378 Naïve – – – + Local Memory 0.25 0.74 4.68 + 3D Texture 0.26 0.66 3.32 implementation scales better for small problem sizes. However, this is mainly due to the high degree of parallelism required to utilize the graphics hardware best. Resource awareness of framework (run-time system): CUDA is not resource aware. Kernels executed on graphics hardware occupy exclu- sively the processors. Resource management for multiple sub-algorithms (image pipeline): CUDA does not provide support for resource management for pipelining. Support of streaming by the framework: CUDA does not provide streaming support for series of images.

109 3 Many-Core Frameworks

Quadro FX 5800 Tesla C 2050 160 142.93 140

120

100 90.04 84.09 80 66.03 speedup 60

40 25.36 24.7 20

0 128 × 128 × 94 256 × 256 × 189 512 × 512 × 378

Figure 3.7: Speedup of the CUDA implementation compared to the baseline imple- mentation for different volume resolutions.

110 3.4 CUDA

Quadro FX 5800 Tesla C 2050

8

7 6.24 6

5.01 5

4 3.31 3 2.53 execution time ratio

2

1

0 128 → 256 256 → 512

Figure 3.8: Scalability of fine-grained implementation using CUDA: Shown is the execution time ratio when moving to a different volume resolution.

111 3 Many-Core Frameworks

3.5

Intel announced a framework for data parallel programming on its own platforms called C for Throughput Computing (Intel Ct) [GSF+07]. Ct is similar to the ap- proach taken by RapidMind, own data types are introduced and the Ct programs are dynamically compiled at run-time for the underlying hardware. Since Intel ac- quired RapidMind, it is merging RapidMind into Ct and will discontinue RapidMind afterwards. However, as mentioned in a Ct presentation by one of the developers working on Ct, Intel will not provide support for hardware of other manufacturers than Intel itself. Moreover, the focus of Ct will be not on speed, but on productivity and portability, and, hence, a performance difference of one magnitude compared to hand-tuned implementations should be expected (M. Klemm, personal communica- tion). The first public beta of Intel Ct will be made available in Q3 2010. Therefore, Ct could not be evaluated as part of this study.

3.6 Larrabee

Larrabee is a many-core architecture based on simple x86 cores for visual computing, announced 2008 by Intel [SCS+08]. Compared to graphics hardware from NVIDIA or ATI, Larrabee is more flexible, the programmer can influence the scheduling of tasks and threads. Larrabee hosts also L1 and L2 cache similarly as the new Fermi architecture from NVIDIA. To program Larrabee, either Intel Ct could be used or a low-level programming interface, also called Larrabee. The latter is similar to the programming of the Cell Broadband Engine, where assembly instructions inlined in C/C++ is used to program the vector units. However, Intel canceled the first generation of Larrabee graphics chips end of 2009. In the meantime, Intel discon- tinued Larrabee as a many-core architecture for visual computing and announced instead a successor architecture called Many Integrated Core (MIC) targeting this time only the HPC market. Therefore, Larrabee could not be evaluated as part of this study.

3.7 Related Frameworks

In this section some other frameworks are listed, which were not evaluated in this study, but are also either frameworks to harness the computational performance of graphics card or provide services for development on graphics cards.

3.7.1 Bulk-Synchronous GPU Programming Bulk-Synchronous GPU Programming (BSGP) is a programming language for gen- eral purpose computation on graphics cards [HZG08]. BSGP introduces a few key- words that integrate into the existing source code, define the number of threads to

112 3.8 Discussion be spawned on the graphics cards and synchronization points. BSGP was developed at the Zhejiang University and the BSGP compiler is freely available for Windows.

3.7.2 HMPP Workbench The Heterogeneous Multi-core Parallel Programming (HMPP) Workbench is a com- piler developed by CAPS that converts annotated source code to multi-core and many-core platforms [BMD+09]. The annotations are directives like in the PGI Accelerator model and can generate code for different back ends like standard shared memory multi-core processors, CUDA for NVIDIA graphics cards, as well as OpenCL for graphics cards from NVIDIA and ATI.

3.7.3 Goose Goose is a compiler developed by K&F that converts annotated source code to multi-core and many-core platforms [goo]. The annotations are directives like in the PGI Accelerator model and can generate code for different accelerators like for graphics cards from NVIDIA and ATI, as well as for the GRAPE-DR. Support for further targets like OpenCL, Intel SSE Technology, and GRAPE-7 is planned.

3.7.4 YDEL for CUDA Fixstars provides a GNU Linux solution targeted for productive CUDA system and deployments. Therefore, Fixstars provides a Yellow Dog Enterprise Linux that is optimized for GPU computing. That is, they provide packages for CUDA and allow to seamlessly switch between different CUDA versions. Furthermore, they adapted the GNU Linux kernel for best performance of such systems. This comes at the expense of $400 per server and year. Their YDEL provides a consistent system environment for deployment of solutions using CUDA.

3.8 Discussion

To summarize the results of the frameworks, the achieved performance of the many- core frameworks is compared. Figure 3.9, 3.10, and 3.11 show the execution times of the fine-grained implemen- tation of each framework on different graphics cards for different volume and image resolutions. The graphs show that RapidMind and in particular PGI Accelerator achieve poor performance. While these frameworks have an high-level approach and abstract from details of the underlying hardware, they suffer from poor performance. In contrast, OpenCL and CUDA are tied very closely to the underlying hardware and achieve about one magnitude of order better performance. The OpenCL imple- mentation is only half as fast as CUDA, but this discrepancy is likely to diminish with future OpenCL versions.

113 3 Many-Core Frameworks

In summary, the best scaling framework for all volume sizes is CUDA followed by OpenCL. OpenCL has further the advantage of support for multi-core processors. All frameworks but PGI Accelerator require the source code to be targeted to a different language and a new hardware. The feedback of the PGI compilers are helpful, but some key features are missing in the current version like support for device permanent data. RapidMind programs are concise, easy to read and com- prehensible. Unfortunately, the achieved performance is far from the performance of CUDA and OpenCL.

Quadro FX 5800 Tesla C 2050 Radeon HD 5870

16 15.07

14

12

10 9.13

8 7.22

6 5.82 execution time (s)

4

2 1.37 0.43 0.37 0.26 0.26 0 RapidMind PGI Accelerator OpenCL CUDA

Figure 3.9: Speedup of the CUDA implementation compared to the baseline imple- mentation for different volume resolutions.

114 3.8 Discussion

Quadro FX 5800 Tesla C 2050 Radeon HD 5870

39.91 40

35

30.17 30

25

20

execution time (s) 15 13.9

10 9.69

5 3.08 1.71 1.39 0.84 0.66 0 RapidMind PGI Accelerator OpenCL CUDA

Figure 3.10: Speedup of the CUDA implementation compared to the baseline im- plementation for different volume resolutions.

115 3 Many-Core Frameworks

Quadro FX 5800 Tesla C 2050

141.33 140 134.13

120

100

80

60

execution time (s) 51.08

40

22.63 20 10.98 8.35 5.27 3.32 0 RapidMind PGI Accelerator OpenCL CUDA

Figure 3.11: Speedup of the CUDA implementation compared to the baseline im- plementation for different volume resolutions.

116 4 Conclusion

In this study, different parallelization frameworks for standard shared memory multi-core processors as well as parallelization frameworks for many-core proces- sors like graphics cards have been evaluated. The evaluation criteria were not only limited on pure performance numbers, but also other aspects like scalability of the framework or productivity of the framework were considered. For evaluation, a computational intensive application from medical imaging, namely 2D/3D image registration has been chosen. In 2D/3D image registration, a preoperatively ac- quired volume is registered with an X-ray image. A 2D projection from the volume is generated and aligned with the X-ray image by means of translating and rotating the volume according to the three coordinate axes. This alignment step is repeated to find the best match between the projected 2D image and the X-ray image. For multi-core platforms, two parallelization strategies were evaluated, namely fine-grained data parallelism and coarse-grained task parallelism. From the consid- ered frameworks, only OpenMP, Cilk++, and Threading Building Blocks support both approaches, while only the fine-grained data parallelism could be realized us- ing RapidMind and OpenCL. While for the coarse-grained approach all frameworks yield equally good results, the overhead for the fine-grained approach differs a lot. OpenMP and RapidMind come along with a lot overhead for a small number of cores, while the other frameworks have only little overhead here. The best re- sults for the fine-grained approach were obtained using TBB, OpenMP, followed by Cilk++. OpenCL was outstanding, though, only for large problem sizes. In terms of productivity, OpenMP and Cilk++ required only little modifications of the source code, and Threading Building Blocks major restructuring of the source code. In contrast, RapidMind and OpenCL required a complete restructuring of the source code and also the algorithm had to be expressed differently. On many-core platforms, only fine-grained data parallelism is suitable and was, hence, investigated. All investigated frameworks, namely RapidMind, PGI Acceler- ator, OpenCL, and CUDA support graphics cards from NVIDIA and target mainly their CUDA interface. Only RapidMind and OpenCL support also hardware from other manufacturers like graphics cards from ATI. While PGI Accelerator require only little modifications to the source code for parallelization and mapping to the graphics card, all other framework required a complete restructuring of the source code and a different way to express the algorithm. OpenCL and CUDA are very close to the underlying hardware and achieve the best results, whereas the perfor- mance of RapidMind and PGI Accelerator was far-off. The CUDA implementation was still twice as fast as the OpenCL implementation. In summary, the most promising framework is certainly OpenCL, targeting multi-

117 Acknowledgment core as well as many-core platforms with remarkable performance. While the im- plementation is not yet as mature as other frameworks like CUDA, this situation will change with the time. Another benefit is that a relatively smooth transition from CUDA to OpenCL with almost no change to the computational kernels is possible. This makes OpenCL in the long term to a serious alternative to CUDA as parallelization framework for graphics hardware. In the multi-core world, there exist multiple popular alternatives besides of OpenCL like OpenMP and Thread- ing Building Blocks. All of them have different strength and the framework choice depends here strongly on the requirements and the existing environment.

118 Acknowledgment

We are indebted to the RRZE (Regional Computing Center Erlangen) and their HPC team for granting computational resources and providing access to preproduc- tion hardware.

119

Bibliography

[Amd67] G.M. Amdahl. Validity of the Single Processor Approach to Achiev- ing Large Scale Computing Capabilities. In Proceedings of the AFIPS Spring Joint Computing Conference, pages 483–485. ACM, 1967.

[BJK+95] R.D. Blumofe, C.F. Joerg, B.C. Kuszmaul, C.E. Leiserson, K.H. Ran- dall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. ACM SigPlan Notices, 30(8):207–216, 1995.

[BMD+09] S. Bihan, G.E. Moulard, R. Dolbeau, H. Calandra, and R. Ab- delkhalek. Directive-Based Heterogeneous Programming: A GPU- Accelerated RTM Use Case. CCCT, 2009.

[goo] Goose: Domain-Specific Compiler. http://www.kfcr.jp/goose-e. html.

[GSF+07] A. Ghuloum, E. Sprangle, J. Fang, G. Wu, and X. Zhou. Intel Whitepa- per: Ct: A Flexible Parallel Programming Model for Tera-Scale Archi- tectures. http://techresearch.intel.com/UserFiles/en-us/File/ terascale/Whitepaper-Ct.pdf, Oct 2007.

[HZG08] Q. Hou, K. Zhou, and B. Guo. BSGP: Bulk-Synchronous GPU Pro- gramming. In ACM SIGGRAPH 2008 Papers, page 19. ACM, 2008.

[KDF+08] Alexander Kubias, Frank Deinzer, Tobias Feldmann, Stefan Paulus, Di- etrich Paulus, Bernd Schreiber, and Thomas Brunner. 2D/3D Image Registration on the GPU. International Journal of Pattern Recognition and Image Analysis, 18(3):381–389, 2008.

[Lei09] C.E. Leiserson. The Cilk++ concurrency platform. In Proceedings of the 46th Annual Design Automation Conference, pages 522–527. ACM, 2009.

[LNOM08] E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym. NVIDIA Tesla: A Unified Graphics and Computing Architecture. IEEE Micro, 28(2):39–55, 2008.

[MDT04] M. McCool and S. Du Toit. Metaprogramming GPUs with Sh. AK Peters, Ltd., 2004.

121 Bibliography

[Mun09] A. Munshi. The OpenCL Specification. Khronos OpenCL Working Group, 2009.

[NVI09] NVIDIA Corporation. NVIDIA Whitepaper: NVIDIA’s Next Generation CUDA Compute Architecture: Fermi. http: //www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_ Fermi_Compute_Architecture_Whitepaper.pdf, October 2009.

[Ope08] OpenMP Architecture Review Board. OpenMP Application Program Interface. OpenMP Architecture Review Board, May 2008.

[Ope09] OpenMP Architecture Review Board. Open Multi-Processing. http: //openmp.org/, Oct 2009. Visited 23/10/2009.

[Rap09] RapidMind. RapidMind Development Platform Documentation. Rapid- Mind Inc., June 2009.

[Rei07] J. Reinders. Intel Threading Building Blocks: Outfitting C++ for Multi- core Processor Parallelism. O’Reilly Media, Inc., 2007.

[SCS+08] L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, et al. Larrabee: A Many- Core x86 Architecture for Visual Computing. In ACM SIGGRAPH 2008 Papers, page 18. ACM, 2008.

[TV98] E. Trucco and A. Verri. Introductory Techniques for 3-D Computer . Prentice Hall New Jersey, 1998.

[Wol10] M. Wolfe. Implementing the PGI Accelerator model. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Pro- cessing Units, pages 43–50. ACM, 2010.

[WPD+97] J. Weese, GP Penney, P. Desmedt, TM Buzug, DLG Hill, and DJ Hawkes. Voxel-Based 2-D/3-D Registration of Fluoroscopy Images and CT Scans for Image-Guided Surgery. IEEE Transactions on Infor- mation Technology in Biomedicine, 1(4):284–293, 1997.

122