Software Plattform Embedded Systems 2020

SPES Software Plattform Embedded Systems 2020 - Beschreibung der Fallstudie „Multi-core and Many-core Evaluation“ - Version: 1.0 Projektbezeichnung SPES 2020 Verantwortlich Richard Membarth QS-Verantwortlich Mario Körner, Frank Hannig Erstellt am 18.06.2010 Zuletzt geändert 18.06.2010 16:09 Freigabestatus Vertraulich für Partner Projektöffentlich X Öffentlich Bearbeitungszustand in Bearbeitung vorgelegt X fertig gestellt Weitere Produktinformationen Erzeugung Richard Membarth Mitwirkend Frank Hannig, Mario Körner, Wieland Eckert Änderungsverzeichnis Änderung Geänderte Beschreibung der Änderung Autor Zustand Kapitel Nr. Datum Version 1 22.06.10 1.0 Alle Finale Reporterstellung Prüfverzeichnis Die folgende Tabelle zeigt einen Überblick über alle Prüfungen – sowohl Eigenprüfungen wie auch Prüfungen durch eigenständige Qualitätssicherung – des vorliegenden Dokumentes. Geprüfte Neuer Datum Anmerkungen Prüfer Version Produktzustand Contents 1 Evaluation Application and Criteria 7 1.1 2D/3D Image Registration . .7 1.2 Checklist . .9 1.3 Profiling . 10 1.4 Parallelization Approaches . 12 2 Multi-Core Frameworks 15 2.1 OpenMP . 15 2.2 Cilk++ . 29 2.3 Threading Building Blocks . 43 2.4 RapidMind . 57 2.5 OpenCL . 70 2.6 Discussion . 82 3 Many-Core Frameworks 89 3.1 RapidMind . 89 3.2 PGI Accelerator . 92 3.3 OpenCL . 99 3.4 CUDA . 105 3.5 Intel Ct . 112 3.6 Larrabee . 112 3.7 Related Frameworks . 112 3.7.1 Bulk-Synchronous GPU Programming . 112 3.7.2 HMPP Workbench . 113 3.7.3 Goose . 113 3.7.4 YDEL for CUDA . 113 3.8 Discussion . 113 4 Conclusion 117 Bibliography 121 3 Abstract In this study, different parallelization frameworks for standard shared memory multi-core processors as well as parallelization frameworks for many-core processors like graphics cards are evaluated. To evaluate the frameworks, not only performance numbers are considered, but also other criteria like scalability, productivity, and other techniques supported by the framework like pipelining. For evaluation, a computational intensive application from medical imaging, namely 2D/3D image registration is considered. In 2D/3D image registration, a preoperatively acquired volume is registered with an X-ray image. A 2D projection from the volume is generated and aligned with the X-ray image by means of translating and rotating the volume according to the three coordinate axes. This alignment step is repeated to find the best match between the projected 2D image and the X-ray image. To evaluate the parallelization frameworks, two parallelization strategies are considered. One the one hand, fine-grained data parallelism, and on the other hand, coarse-grained task parallelism. We show that for most multi-core frameworks both strategies are applicable, whereas many-core frameworks support only fine-grained data parallelism. We compare relevant and widely used frameworks like OpenMP, Cilk++, Threading Building Blocks, RapidMind, and OpenCL for shared memory multi-core architectures. These include Open Source as well as commercially available solutions. Similarly, for many-core architectures like graphics cards, we consider RapidMind, PGI Accelerator, OpenCL, and CUDA. The frameworks take different approaches to provide parallelization support for the programmer. These range from library solutions or directive based compiler extensions to language extensions and completely new languages. 5 1 Evaluation Application and Criteria Felix, qui potuit rerum cognoscere causas. (Vergil) In this chapter, the medical application chosen for the evaluation, namely the 2D/3D image registration, will be introduced, as well the criteria for evaluation of the different frameworks. At the end, the implications from the profiling of the reference implementation for the evaluation are given as well as a description of the employed parallelization approaches. 1.1 2D/3D Image Registration In medical settings, images of the same modality or different modalities are often needed in order to provide precise diagnoses. However, a meaningful usage of different images is only possible if the images are beforehand correctly aligned. Therefore, an image registration algorithm is deployed. In the investigated 2D/3D image registration, a previously stored volume is registered with an X-ray image [KDF+08, WPD+97]. The X-ray image results from the attenuation of the X-rays through an object from the source to the detector. Goal of the registration is to align the volume and the image. Therefore, the volume can be translated and rotated according to the three coordinate axes. For such a transformation an artificial X-ray image is generated by iterating through the transformed volume and calculating the attenuated intensity for each pixel. In order to evaluate the quality of the registration, the reconstructed X-ray image is compared with the original X-ray image using various similarity measures. To obtain the best alignment of the volume with the X-ray image, the parameters for the transformation are optimized until no further improvement is achieved. In this optimization step, for each evaluation one artificial X-ray image is generated for the transformation parameters and compared with the original image. The similarity measures include sequential algorithms like the summation of values over the whole image and have in parts 7 1 Evaluation Application and Criteria Figure 1.1: Work flow of the 2D/3D image registration. random memory access patterns, for example for histogram generation. The work flow of the complete 2D/3D image registration as shown in Figure 1.1 consists of two major computational intensive parts. Firstly, a digitally reconstructed radiograph (DRR) is generated according to the parameter vector x = (tx; ty; tz; rx; ry; rz) describing the translation in mm along the axes and rotation according to the Rodrigues vector [TV98]. Ray casting is used to generate the radiograph from the volume, casting one ray for each pixel through the volume. On its way through the volume, the intensity of the ray is attenuated depending on the material it passes. A detailed description with mathematical formulas on how the attenuation is calculated can be found in [KDF+08]. Secondly, intensity- based similarity measures are calculated in order to evaluate how well the digitally reconstructed radiograph and the X-ray image match. We consider three similarity measures, namely sum of square differences (SSD), normalized cross correlation (NCC), and mutual information (MI). These similarity measures are weighted to asses the quality of the current parameter vector x. To align the two images best, optimization techniques are used to find the best parameter vector. Therefore, we use two different optimization techniques. In a first step local search is used to find the best parameter vector evaluating randomly changed parameter vectors. The changes to the parameter vector have to be within a predefined range, so that only parameter vectors similar to the input parameter vector are evaluated. The best of these parameter vectors is used in the second optimization technique, hill climbing. In hill climbing always one of the parameters in the vector is changed—once increased by a fixed value and once decreased by the same value. This is done for all parameters and the best parameter vector is taken afterwards for the next evaluation, now, using a smaller value to change the parameters. This is done until 8 1.2 Checklist no further improvement is found. 1.2 Checklist The following criteria have been chosen as being relevant for the evaluation of the multi-core and many-core frameworks. The criteria should be evaluated for each of the frameworks and, afterwards, a summary of the evaluation of all frameworks should be given. • Time required to get familiar with a particular framework: This includes the time to understand how the framework works and how the framework can be utilized for parallelization, as well as experimenting with some small examples. • Effort of time to map reference code to a particular framework, how much code (percent) has to be changed or added: This item describes the effort that has to be invested in order to map the reference code to the new framework. This corresponds mainly to the required time to rewrite parts of the source code, but also the effort to express sequential algorithms in parallel counterparts. • Support of parallelization by the framework: To what extend does the framework support the programmer to parallelize a program, for example, abstraction of available resources (cores), management of those resources, or support of auto-parallelization. • Support of data partitioning by the framework: Data is often tiled and processed by different threads and cores, respectively. Does the framework support to partition automatically data, or has this to be done by hand. • Applicability for different algorithm classes and target platforms: Which type of parallel algorithms can be expressed. We consider in particular the support of task parallelism and data parallelism. • Advantages, drawbacks, and difficulties of a particular framework: Here, the unique properties of the framework that distinguishes it from other frameworks are listed, but also problems encountered during employment of the framework. • Effort to get a certain result, for example, performance or throughput: How much effort has to be spent in order to achieve a given result. For instance this may be a speed of 3x using a system with four cores. 9 1 Evaluation Application and Criteria • Scalability of the framework with respect to

Software Plattform Embedded Systems 2020

What Is Stream Processing?

Optimizing Applications for Multicore by Intel Software Engineer Levent Akyil Welcome to the Parallel Universe

Rapidmind: Portability Across Architectures and Its Limitations

Andrzej Nowak - Bio

Efficiency, Energy Efficiency and Programming of Accelerated HPC Servers: Highlights of PRACE Studies

Data-Parallel Programming with Intel Array Building Blocks (Arbb)

Design of a Parallel Multi-Threaded Programming Model for Multi-Core Processors

Download (633Kb)

A Rough Guide to Scientific Computing on The

A Rough Guide to Scientific Computing on the Playstation 3

General Purpose Programming on Modern Graphics Hardware

CUDA and Opencl