DEGREE PROJECT IN INFORMATION AND COMMUNICATION TECHNOLOGY, SECOND LEVEL STOCKHOLM, SWEDEN 2015

Analysis of Automatic Parallelization Methods for Multicore Embedded Systems

FREDRIK FRANTZEN

KTH ROYAL INSTITUTE OF TECHNOLOGY INFORMATION AND COMMUNICATION TECHNOLOGY

Analysis of Automatic Parallelization Methods for Multicore Embedded Systems

Fredrik Frantzen

2015-01-06

Master’s Thesis

Examiner Mats Brorsson

Academic adviser Detlef Scholle

KTH Royal Institute of Technology School of Information and Communication Technology (ICT) Department of Communication Systems SE-100 44 Stockholm, Sweden

Acknowledgement

I want to thank my examiner Mats Brorsson and my two supervisors Detlef Scholle and Cheuk Wing Leung for their helpful advice and for making this report possible. I also want to thank the other two thesis workers, Andreas Hammar and Anton Hou, that have made the time at Alten, really enjoyable.

1 Abstract

There is a demand for reducing the cost of porting legacy code to different embedded platforms. One such system is the multicore system that allows higher performance with lower energy consumption and it is a popular solution in embedded systems. In this report, I have made an evaluation of a number of open source tools supporting the parallelization effort. The evaluation is made using a set of small highly parallel programs and two complex face recognition applica- tions that show what the current advantages and disadvantages are of different parallelization methods. The results show that parallelization tools are not able to parallelize code automati- cally without substantial human involvement. Therefore it is more profitable to parallelize by hand. The outcome of the study is a number of guidelines on how to parallelize their program and a set of requirement that serves as a basis for designing an automatic parallelization tool for embedded systems.

2 Sammanfattning

Det finns ett behov av att minska kostnaderna f¨orportning av legacykod till olika inbyggda system. Ett s˚adant system ¨arde flerk¨arnigasystemen som m¨ojligg¨orh¨ogreprestanda med l¨agreenergif¨orbrukningoch ¨aren popul¨arl¨osningi inbyggda system. I denna rapport, har jag utf¨orten utv¨ardering av ett antal open source-verktyg, som hj¨alper till med arbetet att parallelisera kod. Detta g¨orsmed hj¨alpav sm˚aparalleliserbara program och tv˚akomplexa ansiktsigenk¨annings-applikationer som visar vad de nuvarande f¨or-och nackdelar de olika par- allelliserings metoderna har. Resultaten visar att parallelliseringsverktygen inte klarar av att parallellisera automatiskt utan avsev¨ardm¨anskliginblandning. Detta medf¨oratt det ¨arl¨onsam- mare att parallelisera f¨orhand. Utfallet av denna studie ¨arett antal riktlinjer f¨orhur man ska g¨oraf¨oratt parallelisera sin kod, samt ett antal krav som agerar som bas till att designa ett automatiskt paralleliseringsverktyg f¨orinbyggda system.

3 Contents

Acknowledgement 1

List of Tables 7

List of Figures 8

Abbreviations 10

1 Introduction 11 1.1 Background ...... 11 1.2 Problem statement ...... 12 1.3 Team goal ...... 12 1.4 Approach ...... 12 1.5 Delimitations ...... 12 1.6 Outline ...... 13

2 Parallel software 14 2.1 Programming Parallel Software ...... 14 2.1.1 Where to parallelize ...... 15 2.1.2 Using OpenMP for parallelism ...... 17 2.1.3 Using MPI for parallelism ...... 19 2.1.4 Using vector instructions for spatially close data ...... 20 2.1.5 Offloading to accelerators ...... 20 2.2 To code for different architectures ...... 21 2.2.1 Use of hybrid shared and distributed memory ...... 21 2.2.2 Tests on accelerator offloading ...... 22 2.3 Conclusion ...... 22

3 Parallelizing methods 24 3.1 Using dependency analysis to find parallel loops ...... 24 3.1.1 Static dependency analysis ...... 25 3.1.2 Dynamic dependency analysis ...... 26 3.2 Profiling ...... 26 3.3 Transforming code to remove dependencies ...... 26 3.3.1 Privatization of variables to remove dependencies ...... 26 3.3.2 Reduction recognition ...... 27 3.3.3 Induction variable substitution ...... 27 3.3.4 Alias analysis ...... 28 3.4 Parallelization methods ...... 29 3.4.1 Traditional parallelization methods ...... 29 3.4.2 Polyhedral model ...... 29

4 3.4.3 Speculative threading ...... 30 3.5 Auto-tuning ...... 30 3.6 Conclusion ...... 31

4 Automatic parallelization tools 32 4.1 Parallelizers ...... 32 4.1.1 PoCC and Pluto ...... 32 4.1.2 PIPS-Par4all ...... 33 4.1.3 LLVM-Polly ...... 33 4.1.4 LLVM-Aesop ...... 33 4.1.5 GCC-Graphite ...... 34 4.1.6 Cetus ...... 34 4.1.7 Parallware ...... 34 4.1.8 CAPS ...... 34 4.2 Translators ...... 35 4.2.1 OpenMP2HMPP ...... 35 4.2.2 Step ...... 35 4.3 Assistance ...... 35 4.3.1 Pareon ...... 35 4.4 Comparison of tools and reflection ...... 36 4.4.1 Polyhedral optimizers and performance ...... 36 4.4.2 Auto-tuning incorporation and performance ...... 36 4.4.3 Functional differences ...... 37 4.5 Conclusion ...... 38

5 Programming guidelines for automatic parallelizers 39 5.1 How to structure loop headers and bounds ...... 39 5.2 Static control parts ...... 40 5.3 Loop bodies ...... 41 5.4 Array accesses and allocation ...... 42 5.5 Variable scope ...... 43 5.6 Function calls and stubbing ...... 43 5.7 Function pointers ...... 44 5.8 Alias analysis problems: Pointer arithmetic and type casts ...... 45 5.9 Reductions ...... 45 5.10 Conclusion ...... 46

6 Implementation 47 6.1 Implementation approach ...... 47 6.2 Requirements ...... 48

7 The applications to parallelize 50 7.1 Face recognition applications ...... 50 7.1.1 Training application ...... 50 7.1.2 Detector application ...... 52 7.2 PolyBench benchmark applications ...... 53

8 Results from evaluating the tools 54 8.1 Compilation flags ...... 54 8.2 PolyBench results ...... 54 8.3 Parallelization results on the face recognition applications ...... 59 8.4 Discussion ...... 61

5 9 Requirements fulfilled by automatic parallelizers 63 9.1 Code handling and parsing ...... 63 9.2 Reliability and exposing parallelism ...... 63 9.3 Maintenance and portability ...... 63 9.4 Parallelism performance and tool efficiency ...... 64

10 Conclusions 65 10.1 Limitations of parallelization tools ...... 65 10.2 Manual versus Automatic parallelization ...... 65 10.3 Future work ...... 66

References 67

6 List of Tables

4.1 Functional differences in the tools...... 37 4.2 A rough overview of what the investigated tools take as input and what they can output...... 38

6.1 The list of requirements for an automatic parallelization tool...... 48

8.1 Compilation flags for the individual tools...... 54 8.2 Refactoring time and validity of parallelized training application...... 60 8.3 Refactoring time and validity of parallelized classification application...... 60

7 List of Figures

2.1 Two parallel tasks are in separate critical sections and holding a resource each, when requesting to get the others resource, a is created...... 15 2.2 Parallelism in a loop...... 16 2.3 A false sharing situation...... 16 2.4 A sequential program split up into pipeline stages...... 17 2.5 Pipeline parallelism, displaying different balancing of the stages...... 17 2.6 creation and deletion in OpenMP. [1] ...... 17 2.7 A subset of OpenMP pragma directives...... 18 2.8 Dynamic and static scheduling side by side. Forking and joining is done only once. 19 2.9 Example of a SIMD instruction...... 20 2.10 An overview of different architectures...... 21

3.1 Example on data dependencies, revealed after unrolling the loop once...... 25 3.2 GCD test on the above code segment yields that there is an independence. . . . . 25 3.3 Example of a more difficult loop...... 26 3.4 An example of a variable and an array that is only live within a scope of an iteration...... 27 3.5 A reduction recognition example using OpenMP...... 28 3.6 A simple example of induction variable substitution...... 28 3.7 A simple example of a pointer aliasing an array...... 29 3.8 Example code to illustrate dependence vectors...... 29 3.9 A loop nest that has been transformed to be parallelizable...... 30

5.1 Allowed loop bounds...... 40 5.2 Disallowed loop bounds...... 40 5.3 A loop that does not satisfy as a static control part because of the unpredictable branch...... 40 5.4 A loop that satisfy as a static control part...... 41 5.5 Critical region within the loop...... 41 5.6 Critical region fissioned out of the loop...... 42 5.7 Move private dynamic allocation inside the loop scope...... 43 5.8 A is classified as shared, even though it is private in theory...... 43 5.9 A is in a scope where it cannot be shared between the iterations over i, thus is private...... 43 5.10 Function pointers should be avoided...... 44 5.11 Two examples on how to complicate alias analysis...... 45 5.12 Fission out the reduction...... 45

7.1 Training application for face recognition...... 51 7.2 Detector application for face recognition...... 52

8.1 Results from Polybench benchmarks (part1). Y axis is speed up...... 56

8 8.2 Results from Polybench benchmarks (part2). Y axis is speed up...... 57 8.3 Results from Polybench benchmarks (part3). Y axis is speed up...... 58 8.4 Speed up on different number of cores on the training application after paral- lelization using the different tools...... 60 8.5 Speed up on different number of cores on the classification application after par- allelization using the different tools...... 61

9 Abbreviations

Abbreviation Definition CPU Central Processing Unit CUDA Compute Unified Device Architecture, A platform for Nvidia devices DSP Digital signal processor GPU Graphical unit Heterogeneous system System containing different computing units Homogeneous system System containing multiple identical cores HMPP Hybrid multi-core parallel programming, Standard to write programs for Heterogenous systems IP core Intellectual property core MCAPI Multicore Communications API, Standard for communication between cores on-chip or board for embedded systems MPI Standard defining library routines for writing portable message passing applications OpenACC Standard to write programs for Heterogeneous systems OpenCL Standard to write programs for Heterogeneous systems OpenMP Standard to write parallel programs for shared memory SMP Symmetric Multi Processor (Homogeneoneous system using shared memory)

10 Chapter 1

Introduction

1.1 Background

The demand for high performance in embedded systems is increasing, but at the same time the systems needs to be power efficient. A way to increase the performance is to add cores to the system and decrease the frequency. Power consumption can remain constant, but to get the performance, it is important to utilize the cores. Today, applications are still written for single-core execution. To utilize the processing power of a many-core system, developers have to modify their software. This can be very time consuming and difficult. The complexity can be worsened if the software has grown big with thousand lines of code. The state of art report [2] by the ITEA2/MANY [3] project concludes that there is no unique architecture that will provide the best performance for all kind of applications but just for a set of applications with known complexity and required resources. It also predicts that the future embedded system will consist of hundreds of heterogeneous Intellectual Properties cores, which will execute one parallel application or even several applications running in parallel. Developing for these architectures will get more complex and this makes it necessary to create tools that will close the gap between hardware and software. One big gap is the parallelism that exists in hardware but not in software. Tools that help the developers to create parallel software are needed. One such tool is the compiler that analyses the code and automatically parallelizes it. This allows developers to reuse their existing code and continue developing software without having to think about the hardware architecture. There is a wide range of tools that a developer can use together with a compiler to get a more optimized application or more knowledge of their application. The study reported here was done as a master’s degree project. It was conducted at Alten Sverige [4], an engineering and IT consulting firm. The main customers belong to the energy, telecommunication, manufacturing and automotive industries. This degree project is part of the ITEA2/MANY project that is working with putting together a developing environment that will allow more code reuse to lower the time-to-market for embedded systems development. The degree project is also a part of the ARTEMIS/Crafters [5] project to some extent, which is developing a framework for developing applications on many-core embedded systems.

11 1.2 Problem statement

Parallelization of code can be a complex task depending on the legacy application and it can take a lot of time to move to a parallel platform. It is required to investigate if there are cost effective alternatives to parallelizing code by hand, such as using automatic parallelization tools. There are several automatic parallelization tools available to use, but in the context of the MANY and Crafters projects it was currently unknown how to draw benefits from them. It was also of interest how the tools can be improved to increase the benefits of using them. The goal was to give a model for how automatic parallelization can be used in production. This report can be seen as a package containing guidelines and a knowledge base for using automatic parallelization tools. This will hopefully lead to a decrease in the amount of resources needed to port legacy code and serve as a basis for future improvement in automatic parallelization tools.

1.3 Team goal

During this degree project, a sub-project together with two other thesis workers was carried out. Each thesis worker was studying a separate subject and the goal was to combine the knowl- edge gained from these studies to design and develop a face-recognition application usecase that makes use of automatic parallelization tools investigated in this thesis and middle-ware com- ponents supplied by the other two workers. The other two technologies are run-time adaptive code by using self-managing methods and high-performance interprocess communication. The implementation was conducted on Linux on an x86 multi-core system. The usecase application was used to validate the efficiency on the automatic parallelizer.

1.4 Approach

In this master thesis, I have made an academic study on the state of art in automatic paralleliza- tion of software. I investigated what methods there are to create parallel code both manually and automatically. I also looked at the current technologies and methods used in different au- tomatic parallelizing tools. This includes material on compilers, parallel theory and scientific articles on parallelization. Different methods was investigated and analyzed for parallelizing software but the focus was on automatic parallelization. The second half of the work consists of an evaluation of the automatic parallelization tools to get an insight of their usability. The result of the study gives a comparison on existing automatic parallelizing tools and distinguishes the differences of the tools in terms of what parallelizing methods they use and their efficiency. An analysis was then carried out using the results from the evailation and the findings of the study, on how these tools can be improved and what technology should be incorporated in an automatic parallelization tool for embedded systems.

1.5 Delimitations

This report considered only the parallelization of sequential code, which is a widely used language in developing embedded systems. Furthermore, thread level parallelism for SMP systems was the main focus, but findings and discussions on how to target other systems was

12 presented as well. Tools that automatically parallelizes code and those that can improve the work flow when parallelizing by hand was investigated.

1.6 Outline

This report is divided into nine chapters excluding the introduction chapter. Chapter 2 describes the concept of parallelizing a program. This includes a description on how a developer can parallelize their program using libraries and compiler directives. It also includes concepts that the developers can think of when doing parallelization. It also includes an overview of different system architectures that a developer can decide to target, with their parallel program. Chapter 3 presents common techniques that are used in automatic parallelization compilers, a summary concluding the chapter reflects on there strength and weaknesses and why one method might be more favorable over the other. Chapter 4 gives the reader a summary of existing tools that can perform automatic paralleliza- tion or assist the developer when developing a program to perform better on a parallel system. Some of the tools do the same things as others and some do entirely unique things. A detailed map depicting how the tools differ is presented here and data strengthening why one tool is more favorable over the other. Chapter 5 presents the required refactoring steps needed to be able to take advantage of parallelization tools. Chapter 6 discusses the implementation approach needed to be taken for creating an automatic parallelizer that works efficiently on general problems. Chapter 7 presents the applications that were used in the evaluation of the selected automatic parallelization tools, and chapter 8 presents the results from the evaluation. In chapter 9, the selected tools are compared against requirements that was identified in chapter 6, to get a basis for what improvements are necessary to make them useful. Chapter 10 presents the conclusions that can be drawn from all of the work carried out during this thesis work.

13 Chapter 2

Parallel software

Parallelizing software has been a research topic decades before the multi-core revolution, but now it is more relevant than ever. Since the CPU frequency increase in computer systems has begun to stall at about 3.5 GHz, it has become more interesting to add more cores. This allows computer systems to perform better using less power. If the frequency (f) is halved in a system, the voltage (V) can be lowered. This leads to that the power consumption (P) becomes an eighth of the original system (see Equation 2.3). By adding an additional core to it, the theoretical performance is about the same, but only consuming a fourth of the power. Adding two more cores, the performance could be doubled that of the single core system while only consuming half of the power.

P = C · V 2 · f (2.1) V = a · f (2.2) f 3 P = C · a2 · (2.3) 23 The performance in practice is however a different thing. Software has yet to be written to utilize multi-core architectures efficiently. Most of the software today is written for single thread execution and has to be rewritten for the new architectures to achieve the performance. Software is however limited to Amdahls law which states that the speed up (S) of a program is limited to the proportion that is not parallelizable (1-P) (see Equation 2.4). This can be seen in the formula, where if the number of cores increases, the second term in the denominator will move towards zero and will no longer affect the performance of the program significantly. 1 S = P (2.4) (1 − P ) + N Therefore the importance lays in how much code can be parallelized and thus methods for finding parallelism is needed. This chapter will give the reader an introduction to these methods and an understanding on how one can go about to parallelize a program or to port a program to a multi-core architecture.

2.1 Programming Parallel Software

There are several programming languages in existence today. A large subset of those also support parallel programming in different ways. To mention a few, there are C, Ada, Java,

14 Figure 2.1: Two parallel tasks are in separate critical sections and holding a resource each, when requesting to get the others resource, a deadlock is created.

Haskell and Erlang. These languages are interesting from a parallel programming perspective in different ways. This thesis will only look at C, which is one of the most popular languages out there, especially for embedded systems. C is a very low level language compared to the others mentioned, and has little abstraction for parallel programming without extensions. As of the release of C11 standard, however, it is now possible to use a standard thread library which does not require a POSIX based system. To program parallel programs with C, thread library such as the standard one or the POSIX threads library (pthreads) can be used. This gives the developer full support to program tasks that execute in parallel. But programming for parallel systems is not trivial. When the developer wants tasks to share a resource, there are several problems that can occur. The parallel tasks cannot write and read on the resources in which way they like, because this will result in race conditions. A can occur if a task writes to a shared resource and plans to read it in a near future, another task that is also using the resource can write on to it before the first task reads it. This means that the program will be non-deterministic in run-time. For the developer, it can look like there is no problem with the program especially if he is programming on a single core system, where tasks will run concurrently but not in parallel. To prevent this, the developer can use protection mechanisms, e.g. locks or semaphores, to surround a critical section, so that a task that enters this section is guaranteed that the shared resource is not modified while it is in use. Using locks however can impose other problems. A deadlock occurs when two tasks wants a resource the other task is holding. See Figure 2.1 for a visual explanation. Task1 holds R1 and wants to get R2 but R2 is already held by Task2 and Task2 wants R1. What happens is that both tasks gets in a waiting state, and prevents execution. An alternative to program threads in this error prone way is to use an annotation based approach using OpenMP [1]. The annotations are pragma statements in C code which should be handled by the compiler. When the compiler parses the statements, it will insert low level code that implements the intended functionality. This method is less verbose and maps better to the parallel paradigm. The benefit of using this method is that it abstracts much of the required synchronization. Meaning that the problems of race conditions and can be avoided to some degree, however, it does not make the programmer completely safe. As a side note, to be able to program with OpenMP the developer needs to use a compiler that supports the OpenMP pragmas.

2.1.1 Where to parallelize

There can be several places in the software which can be parallelized. Loops are the common target for refactoring to parallelize a program. This is because a lot of the execution time is spent in these sections. The potential speed-up of loops is equal to the number of iterations. Figure 2.2 shows how a loop can be split up into workers that execute a number of iterations each.

15 Figure 2.2: Parallelism in a loop.

When parallelizing loops there are several problems that needs to be avoided. Loops typically processes a data set where elements are stored spatially close in memory. On a symmetric unit it is common to have some form of . Cache coherence makes sure that cores that are working on the same memory always have the latest version of a memory block. If one core writes to a memory address, the data will be written to a cache line first since it may be written or read in a near future. When the core has written to the cache, it has to be invalidated in all the other caches used by the neighboring cores. Since their copy of the data is not the latest version. In a cache coherent system false sharing can occur. It means that a cache line is invalidated in neighboring caches on accident. When a memory address is read, the elements that are spatially close to it will be loaded into the same cache line. Figure 2.3 shows a simple example of a cache line containing two data blocks. When core 1 writes to A, the cache line for core 2 is invalidated even though core 2 is only interested in B.

Figure 2.3: A false sharing situation.

Another pattern to look out for in a sequential program is the pipeline (see Figure 2.4). This pattern assumes that there is a batch of data that needs to be processed. A pipeline will process a data element at a time, in several stages. When the first message is received at the second stage, a second message can be handled at the first stage simultaneously. By splitting up a program into pipeline stages, the potential speed-up is equal to the number of pipeline stages. But in reality the load balancing of the pipeline stages has to be perfect to get that performance. The balancing in terms of execution time is shown in Figure 2.5; A shows a perfectly balanced pipeline, B realizes the stalling pipeline problem. Replicating a pipeline stage may a non-stalling pipeline as seen in C, but performance is spilled because threads have to go idle. This pattern is harder to solve automatically and have to make use of

16 techniques similar to that of Tournavits et al. [6], where automatic tuning is used to find the optimal load balance.

Figure 2.4: A sequential program split up into pipeline stages.

Figure 2.5: Pipeline parallelism, displaying different balancing of the stages.

2.1.2 Using OpenMP for shared memory parallelism

OpenMP [1] is a standard that provides an extension for C. It provides a simple and flexible interface for handling threads. It is an API that consists of a set of compiler directives, li- brary routines and environment variables that influence run-time behaviour. These compiler directives hides the complicated parts such as synchronization of threads, resource sharing and thread scheduling. This paper will continuously return back to OpenMP and its directives since it is a popular format used by automatic parallelizing compilers which will be presented in chapter 4.

Figure 2.6: Thread creation and deletion in OpenMP. [1]

The OpenMP parallel directive is used to fork a number of threads, the number is defined either by the developer or an environment variable, to execute the region that follows the directive

17 in parallel. The forked threads will execute until they reach a synchronization clause such as the barrier directive where they will wait until all threads has finished executing. If it is in the end of a parallel region (where barriers are implicitly inserted), the threads will be joined with the master thread and the application will continue running on a single thread. Otherwise the threads will continue executing until the next synchronization clause is reached. Figure 2.6 displays the fork and join model. OpenMP also supports a tasking model where directives in the code will put a task on a work queue. Each thread can put more work on the work queue and when all tasks are finished the threads can continue. This is a model that can be used for recursive algorithms. This study will mainly look at the work sharing constructs defined by OpenMP.

1 #pragma omp parallel 2 { 3 //Code 4 } 5 #pragma omp for schedule((static)|(dynamic)|(guided),chunk_size) 6 for(i=0;i

Figure 2.7: A subset of OpenMP pragma directives.

The work sharing constructs in OpenMP are directives that define which thread is going to execute which part in the region. By using the parallel for directive (followed by a for loop) in OpenMP you can execute iterations of a loop in parallel. Assigning each thread a number of iterations to execute. Iterations can either be scheduled during compile time or run-time defined by the developer using the OpenMP schedule clause. The directives mentioned so far can be found in Figure 2.7. There are three scheduling clauses in OpenMP: Static, Dynamic and Guided. The static schedule determines at compile time which iteration are going to be executed by which thread, the work of the loop is split up into chunks of iterations where the size of the chunk is determined in the schedule clause. In the dynamic schedules, the iterations are scheduled at run-time. The benefit of using a dynamic schedule over a static one is that the work load will be better balanced over all threads as shown in Figure 2.8. When a thread is out of work, it can request more work from the scheduler in a dynamic schedule. In contrast to the static schedule where if a thread has executed all its iterations, it will have to go idle until the other threads has finish. The drawback with the dynamic schedule is the additional overhead of assigning chunks to threads at run-time. The guided schedule is in principle the same as the dynamic schedule. The difference is that it has different chunk sizes that it can give to a thread. Like pthreads, OpenMP does not make the developer safe from race conditions. A variable can be shared between threads using a share clause. This can be useful if a variable is going to be read by several threads. But if writes are going to be performed on the variable, it is up to the developer to either specify it as private to let each thread have their own private copy of the variable, or to insert a critical section clause.

18 Figure 2.8: Dynamic and static scheduling side by side. Forking and joining is done only once.

Forking and joining threads may lead to significant overhead if they are placed inside nested loop statements. Creating threads is an expensive process [7]. If the creation of threads is made inside a loop, it means that the creation of threads occurs the same number of times there are iterations of the outer loop. This can create significant overhead if the region that will execute in parallel is short leading the program to become slower. A better approach is to move the creation of threads to the outer loop since this results in only one instance of forking threads.

2.1.3 Using MPI for distributed memory parallelism

MPI [8] or message passing interface is a standard that defines an API for interprocess com- munication. MPI is a useful abstraction for course grain parallelism i.e. when threads can be divided into separate tasks that communicates with each other to a small degree. The advantage that MPI has over OpenMP is that nothing is assumed of the underlying architecture, which means that the application can be deployed on any system. While OpenMP would only execute in parallel on SMP architectures [9] that has, as of version 4, added support for heterogeneous systems with shared memory. An MPI applications can also be distributed on several systems simultaneously since the links connecting two tasks of a program hides the location of a task. Several implementations of the MPI API exists. The full MPI implementation is too big to be included with most embedded systems but there are libraries that implements a subset of functions with similar functionality to that of the MPI API such as MCAPI [10]. The disadvantage of using MPI is that there will be an additional overhead when tasks that are running on the same SMP unit communicate. When processors are running on the same node, sending data with MPI will use shared memory. Although the memory is shared, there will be a copy from the send buffer into the shared memory and then a copy from the shared memory to the receive buffer. Therefore, MPI is more well suited for applications that do not have to send a lot of data around or that is not running on a cache coherent system.

19 2.1.4 Using vector instructions for spatially close data

The finest-grain of parallelism is when each data element in an array can be processed in parallel. Vectorization or (SIMD, Single Instruction Multiple Data) is a technique that makes it possible to execute one instruction on several elements spatially close in memory (see Figure 2.9). This requires that the hardware has support for these instructions. The width i.e. the number of elements that can be processed at a time varies. Many popular compilers have support for automatically inserting vector instructions for trivial cases e.g. GCC. A trivial case can be a loop that contains a chain of binary instructions (Addition, Multiplication, etc.), that is performed on each data element in an array. These instructions can be replaced with vector instructions and lower the iteration count of the loop.

Figure 2.9: Example of a SIMD instruction.

2.1.5 Offloading to accelerators

Many architectures are heterogeneous, i.e. there is CPU combined with an accelerator of some kind. PCs typically have a Graphical Processing Units (GPU) and embedded systems have a whole range of different accelerators such as Digital Signal Processing units (DSP). Accelera- tors are highly specialized on a particular type of problem which often means that they can execute faster than the CPU. Lately it has become very common to make use of accelerators for general purpose applications, especially GPUs. To program for accelerators there are several libraries such as OpenCL [11] or CUDA [12]. CUDA is used for programming Nvidia GPUs and OpenCL is designed for programming on accelerators in general. Similar to pthreads, program- ming with these languages can become complex and are quite verbose compared to OpenMP pragmas. Annotation based approaches similar to OpenMP are also available such as OpenACC [13] and OpenHMPP [14]. They provide compiler directives that makes it possible to offload pieces of the execution onto an accelerator. Which accelerator to offload to is specified in the directive. The compiler is then responsible to insert the accelerator code that is adapted to the specified accelerator. The two annotation languages mentioned are very similar, OpenACC is very influenced by the OpenHMPP directives. OpenHMPP was first developed by CAPS for their own compiler but later several companies working with accelerators created a committee that together developed the standard OpenACC [13]. Currently OpenHMPP has more directives than the OpenACC standard, but it has not gained popularity. OpenACC has slowly been adopted by GCC, but it is currently only able to target the host (CPU) and not accelerators. In the latest OpenMP version (4.0) support for accelerators has been added. It is not implemented in any compiler yet although some preliminary implementations have been made [15]. In the next version of GCC, OpenMP 4.0 will be supported but just as OpenACC the only possible target will be the host [16].

20 2.2 To code for different architectures

There exists several different hardware architectures. OpenMP, MPI and OpenACC standards solves problems in different domains that makes sense to combine. The standards are good at their domain and can complement each other.

Figure 2.10: An overview of different architectures.

In Figure 2.10 an overview of different architectures is given. A) depicts a system with four threads using OpenMP to take advantage of using shared memory and B) depicts an MPI implementation on a system using distributed memory. C) and ) displays two different setups of a heterogeneous system. C) displays one thread that is offloading tasks to a number of threads on a connected accelerator. D) displays a heterogeneous system where accelerator threads and the CPU share memory. The coding standards for accelerators cannot be used to program multiple CPU threads. E) shows the combination of using MPI together with OpenMP. F) is a system that is able to take full advantage of the parallelism in the hardware. It requires the developer to program with the previous mentioned standard to achieve this mapping with software. This can become very complex depending on how the system is set up. To give some examples on architectures, ARM who has a big market share in mobile units, has released an SMP processor called Cortex-A53 [17] which is the system depicted in A). This processor supports being connected to one additional ARM processor, which creates a system that is similar to what is depicted in E). The system depicted in F) is similarly as complex as Adaptevas Parallella board [18], which has an ARM Coretex A9 Dual-Core [19] together with an Zynq-7000 series FPGA from Xilinx [20] with the capability to use shared memory. It also has a co-processor developed by Adapteva called Epiphany IV [21] which consists of 64 accelerators.

2.2.1 Use of hybrid shared and distributed memory

A usual approach is to create a distributed program with MPI where each component uses OpenMP within to benefit from using shared memory. This is called hybrid programming. Comparisons on performance of hybrid programming with OpenMP and MPI and using dis- tributed with MPI have been made by Jin et al. [22] and Rabenseifner et al. [23]. In summary, they show that using the hybrid combination over a pure distributed implementation is not always a performance increase.

21 Rabenseifner et al. [23] have tested different set-ups of distributed models. They tested a pure-MPI implementation with a hybrid MPI/OpenMP implementation. The results are that the hybrid implementation outperforms the pure MPI implementation. There are however multiple issues with combining the standards acknowledged by Rabenseifner et al [23]. The OpenMP parallel region either have to join into the master thread for MPI communication or overlapping communication with the computations. The former has the disadvantage of having to put threads in idle during communication. Jin et al. have shown that a pure MPI implementation performs better on big clusters, and that the limitation of hybrid version is the data locality in the implementations used.

2.2.2 Tests on accelerator offloading

Liao et al. [15] compared a preliminary implementation of the OpenMP 4.0 standard in the Rose[24] compiler infrastructure, together with the PGI [25] and HMPP(CAPS) [26] compiler where code was annotated with OpenACC. There were also an OpenMP implementation that did not use accelerators but made use of the 16 cores on the Xeon processor. The accelerator code was run on a Nvidia Tesla K20c. Shown in the comparison, there was a significant speed- up on matrix multiplaction compared to the 16 core OpenMP implementation, this is visible on matrices that are bigger than 1024x1024 elements. The paper also displays how much time is spent on preparing the kernel for the matrix multiplication. When the kernel where small (matrices lower than 512x512 elements ), the preparation accounts for 50% or more of the execution time. In the second test, the computation is not heavy and the communication costs shadows the computation cost. Here the accelerator code does not outperform the sequential version until the vector size reaches 100M floats. The third test is computationally heavy but the threads on the GPU is not used optimally. Adding the collapse clause in OpenACC which the HMPP compiler supports showed that it can outperform the 16 core OpenMP implementaion, this is however only visible for matrices of size 1024x1024 and above. It is at this point the sequential version gets outperformed of the other accelerator implementations. This shows that a disadvantage of using accelerators is that copying is needed between the memory of the accelerator and the main thread, since they typically do not share memory. This means that the accelerators will have to be used when there are lots of data that will be processed in computationally heavy kernels so that the cost of copying is out weighted by the speed-up due to parallelism. Another alternative is to use architectures where accelerators and CPUs share memory like the hybrid core developed by Convey Computer [27] which combines a CPU with an FPGA accelerator or AMDs [28] APU that combines the CPU with the GPU, this type of system is depicted in Figure 2.10.D. Another problem with accelerators is that they are very target dependent. The portability versus the performance of accelerator programming has been discussed by Sa`a-Garrigaet al. [29] and Dolbeau et al. [30] that confirms this point.

2.3 Conclusion

This chapter has presented several techniques to create parallel software, such as OpenMP, MPI, OpenACC and vectorization. These simplify the work for a developer that parallelize by hand. But there are multiple problems that the developer has to solve. Some of these problems are load balancing, inefficient use of caches and overheads. It is not obvious how to parallelize code for optimal performance, thus tools that can parallelize code automatically becomes a valid

22 alternative to manual coding. Potentially a general version of source code could be ported and optimized to multiple platforms without the involvement of the developer. OpenMP has well defined library functions for dealing with SMP systems. It is also an active standard that soon incorporates heterogeneous systems. Furthermore, compilers have the functionality to deal with load balancing using the dynamic scheduling clause. Thus it will be the standard that will be the main focus in this report.

23 Chapter 3

Parallelizing methods

The popular compilers have several optimization transforms and parallelization techniques al- ready incorporated. Automatically inserting vector instructions is one example. New techniques have been researched and tested on research compilers and many are still in an experimental state. Some of these techniques may however be ready for production compilers. This section gives the reader an overview of the different techniques used in compilers today both in products and in research. The compiler’s main task is to take source code and produce binaries that can be run on a target hardware architecture. However, the compilers responsibility is more than that. The compiler is able to detect syntactical errors and some semantic errors in the source code. The compiler is also able to reorganize code so that the processor is kept busy while fetching data from memory. Currently the main memory is the big bottle-neck in computers, and fetching from it slows the execution. Keeping a resource in a register that will be used again very soon is one way of reducing the number of accesses to memory. This is one particular decision a compiler is able to do. This thesis is not about the common optimization techniques compilers make to speed up a program, it is instead focused on the particular methods that are useful to parallelize a program.

3.1 Using dependency analysis to find parallel loops

Analysis of the program is needed to find parallelism in a program. There are two kinds of analyses a compiler can do: static dependency analysis and dynamic dependency analysis. Dependency analysis is an important step in order to find parallel regions. A parallel program does not have a specific order on when a thread will execute. Therefore the data the threads handle has to be independent. The dependencies come from the use of shared memory between the threads. There are three types of dependencies: Read-after-write (RAW), write-after-read (WAR) and write-after-write (WAW) dependencies. In Figure 3.1, a simple loop is unrolled once to display the three different dependencies mentioned. A RAW dependency in a loop would mean that the iteration that is doing the read access would have to come after the iteration that is doing the write access. If the read comes before the write, then it would mean that the read access will read the old data that was stored on that memory address. Similarly the WAR dependency means that a write has to occur after a read or the read access would read the overwritten data. A WAW-dependency needs the writes to be in order or otherwise a future read access will read the wrong data. Static dependency analysis relies on looking at the source as it is (in an intermediate representation form) and finds memory dependencies by using complex

24 algorithms. In contrast, dynamic dependency analysis looks at how the program executes and finds dependencies by looking at which memory addresses are accessed at run-time.

1 for(i=0;i

Figure 3.1: Example on data dependencies, revealed after unrolling the loop once.

3.1.1 Static dependency analysis

Loop dependencies are a hot topic in automatic parallelization research as loops are the regions where a program can potentially run in the same amount of threads as there are iterations. But if there are dependencies between iterations, the loop will have to be executed serially. To find out if there exists a dependency (if it is non trivial), data dependence tests can be used. Finding data dependencies is an NP-hard problem, meaning that the problem cannot be solved in reasonable time. Instead, approximations of the problem has to be made, these algorithms are called data dependency tests. These tests will be able to tell if there is an absence of a dependence in a subset of the problems in polynomial time. A dependency has to be assumed if an independence cannot be proven. This guarantees that the program will execute correctly. In Figure 3.2, a simple dependency test called greatest common divisor test (gcd-test) is presented. The test is evaluates that given two array accesses A[a · i + b], A[c · i + d] a dependence may exist if GCD(a, c) divides (d − b).

1 for(i=1;i

Figure 3.2: GCD test on the above code segment yields that there is an independence.

These tests do not say when a dependency occur, only that one may occur. The loop in Figure 3.3.A contains a non-linear term in the subscript of A, which is not handled by some dependency test algorithms. A dependency will be assumed here. In this case, there is a WAR dependency in the first two iterations, but the rest are dependency free. This means that the loop could have been parallelized if the first iteration had been hoisted out (see B), or if each thread would have executed a chunk of two iterations at a time (see C). Popular data dependency analysis methods today are the GCD-test, I-test, Omega-test, Banjaree- Wolf test and range-test. Kyriakopoulos et al. [31] have summarized these different dependency tests on performance, together with a new dependency test that is compared to the ones men- tioned. Where results shows that each of these are good in different ways. The I-test is fast and detects many dependencies while omega-test is slower but detects some dependencies that I-test cannot detect.

25 1 for(i=0;i

Figure 3.3: Example of a more difficult loop.

3.1.2 Dynamic dependency analysis

In contrast to static dependency analysis, dynamic dependency analysis looks at the actual accesses to memory. If two iterations in a loop read and write to the same memory address during execution, there is a dependency. With this method, it is possible to find more loops that can be parallelized than with the static dependency analysis. Loops that may contain a dependency according to the static dependency test, can be shown that the dependency does not occur in the loop range. The problem with dynamic dependency analysis is that it can be overly positive and classify loops that contains a dependency as parallel, because the given input data did not result in the memory access pattern that might occur in other situations. This can be a huge problem and to work around this the test input has to cover all memory access pattern cases to guarantee that a race condition does not occur in the future. Another drawback with dynamic analysis is that running the program for gathering the data can be very time consuming depending on the program and is nothing you would want to do every time you compile.

3.2 Profiling

Profiling is a dynamic way to get more knowledge on how your program executes in terms of flow and memory usage. This is made in run-time, where different methods can be used such as instrumenting the code or gather sampling of where the program counter is on regular intervals. The methods can for example count number of jumps to a function (edge counters), branch prediction miss rate and cache miss rate. Execution times can also be monitored. With this functionality, the developer can see after an execution where a program spends most of its execution time which can be used as a hint to where he or she should try to optimize. But it can also be used in a compiler to do decide whether parallelization is beneficial or not when optimizing the program.

3.3 Transforming code to remove dependencies

Some reoccurring dependencies can trivially be removed which may turn the loop parallel. The following section presents some of them.

3.3.1 Privatization of variables to remove dependencies

If a loop region will be executed in parallel, some of the dependencies can be removed that was just inserted for sequential efficiency. An array can for example be allocated outside of a loop region, and then the allocated space is reused during execution. This reduces the memory usage.

26 When running in parallel this will result in dependencies, because the same space accessed is written and read several times. Using liveness analysis, it might be detected that the variable or array is live only during one iteration at a time. This particular example can be seen in Figure 3.4.

1 floatA[N]; 1 floats=0; 2 floats=0; 2 for(i=0;i

Figure 3.4: An example of a variable and an array that is only live within a scope of an iteration.

3.3.2 Reduction recognition

A reduction statement in a loop introduces a dependency between iterations. An example of a reduction statement is when a sum is calculated. Because the add operation is associative, i.e. (a + b) + c = a + (b + c) the order on operating the terms does not matter. This is not always true, however. Reductions on floats can give different results depending on which order it is executed in because of rounding errors. OpenMP implements a pragma that handles simple reductions. It works by letting each thread has its own private reduction variables, which are then reduced into a global shared variable in a succeeding critical region, similar to what is seen in Figure 3.5.

3.3.3 Induction variable substitution

Another common dependency is the one created by induction variables. An induction variable is a variable that, for every iteration in a loop, is changed in the same manner. Meaning that it can be predicted what the value of the variable will be in N iterations. These can often times be substituted with the induction formula that defines them, as shown in Figure 3.6. The benefit of this is that the loop dependency is then removed, but the evaluation of the induction variable can be costly meaning that the transform is not always beneficial.

27 1 #pragma omp parallel for reduction(+:q) 2 for(i = 0; i < N; i = i+1){ 3 l = f(i); //heavy execution 4 q += l; 5 } 6 7 #pragma omp parallel 8 { 9 #pragma omp for private(q1) 10 for(i = 0; i < n; i++){ 11 l = f(i); //heavy execution 12 q1 += l; 13 } 14 } 15 #pragma omp critical 16 for(i = 0; i < n; i++){ 17 q += q1; 18 } 19 }

Figure 3.5: A reduction recognition example using OpenMP.

1 c = 10; 2 for (i = 0; i < 10; i++){ 3 //c is incremented by5 for each loop iteration 4 c = c + 5; 5 A[i] = c; 6 } 7 8 for (i = 0; i < 10; i++){ 9 //the dependency on previous iteration is now removed 10 c = 10 + 5*(i+1); 11 A[i] = c; 12 }

Figure 3.6: A simple example of induction variable substitution.

3.3.4 Alias analysis

When a pointer is referring to the same memory address as another pointer (or array), it is said that they alias each other. In parallel programs, if threads have pointers that alias the same memory location problems can arise such as race conditions if reading and writing to the address is performed. The behavior of the program can become non-deterministic as in different runs of the same program on the same data may produce different results. Thus, it is often important to impose an explicit ordering between threads which have aliases. In Figure 3.7, a pointer is aliasing an array, and both variables are being accessed within the loop. By replacing all aliases with the original variable name, it becomes simpler to analyze for dependencies.

28 1 p = &A [3]; 1 p = 3; 2 for (i = 0; i < 10; i++){ 2 for (i = 0; i < 10; i++){ 3 *p = y; 3 A[p] = y; 4 A[i] = x; 4 A[i] = x; 5 p ++; 5 p ++; 6 } 6 }

Figure 3.7: A simple example of a pointer aliasing an array.

3.4 Parallelization methods

The following methods are optimization techniques that can make the program parallel, they have been categorized under these three categories, traditional parallelization methods, poly- hedral parallelization methods and thread level speculation parallelization. These methods can also be combined with the pattern recognition techniques that were presented in the previous section, such as reductions and privatization of variables.

3.4.1 Traditional parallelization methods

The traditional way of parallelizing programs is to gather the dependencies within the loop and identify on which iteration space a loop is parallelizable. This is done using dependence vectors. Given a loop nest as seen in Figure 3.8, the dependence vector is equal to [g1 − f1, g2 − f2]. The statements are loop independent if the distance vector only contains zero. Otherwise there is a loop carried dependency. Traditional methods typically do not change the iteration space like the polyhedral methods. Only loop switching occurs, so that the loop with independent iterations becomes the most outer loop. The traditional methods are much simpler than the polyhedral parallelization technique and faster, but can only handle simple cases.

1 for(i1=0;i1

Figure 3.8: Example code to illustrate dependence vectors.

3.4.2 Polyhedral model

The polyhedral model [32] is a mathematical model that is used for loop nest optimizations. It can be used to optimize data locality by using tiles based on cache sizes and levels. Tiling is the process of grouping data accesses into chunks that will be able to fit in cache to reduce cache misses. The polyhedral method can also optimize the tiles for parallelism. Figure 3.9 shows an iteration space of a loop nest with an outer loop iterating over j and an inner loop iterating over i. The polyhedral optimization method is able to transform this loop nest into a skewed loop. Creating an outer loop that iterates over t and an inner loop iterating over P . The inner loop has now become parallelizable with 2 threads. In Figure 3.9, the circles represent a statement

29 and the smaller arrows represent the dependency the statement has. In the skewed loop there is two statements on the same iteration of t which are independent thus are able to be run in parallel. The long arrows shows how the statements can be distributed over two threads.

Figure 3.9: A loop nest that has been transformed to be parallelizable.

3.4.3 Speculative threading

Speculative threading is whole other approach to parallelization. Instead of looking at depen- dencies statically or dynamically, speculative read and writes can be used, which keep track on which memory blocks are being accessed. This way, all loops can be assumed parallel. If two iterations running in parallel in run-time imposes a dependency from a read or write access, the loop is clearly not parallel and the loop will be restarted. Gonzales-Escribano et al. [33] has presented a proposal for support of speculative threading in OpenMP. They suggest that read and write operations are replaced with functions calls that will check if a dependency violation is made before reading and writing. From the programmers point of view, a worksharing directive would just contain an additional clause defining if the variables should be speculatively checked. This requires that a speculative scheduler is implemented that is able to restart iterations that has failed due to dependency violations. Automatic speculative threading has been shown to work by Bhattacharyya et al. [34]. They use Polly [35] to implement heuristics for paralleliz- ing regions that have been classified to maybe contain a dependency (regions that would not have been parallelized). They show that the heuristics manages to gain comparable to normal parallelized regions.

3.5 Auto-tuning

Auto-tuning is the process of automatically tuning the program for optimal performance. It is not always profitable to parallelize a loop if for example the cache locality becomes terrible. Different heuristics can be created and used to approach the typical problems of parallelization. An auto-tuning step can look at dynamic information such as cache misses and decide if the loop benefits from executing in parallel. The auto-tuner could in this case try to do a loop swap (inner loop and outer loop switch places) and measure the execution time of the change. Without dynamic data, heuristics can be used to base the decisions on the number of instructions in a loop, or if the iteration count is too small and decide based on heuristics that it is not profitable to parallelize the particular loop.

30 There is work done in finding pipeline parallelism by Tournavits et al. [6] which uses auto- tuning. This is highly relevant for embedded systems that processes streaming data. By finding a pipeline stages it will be able to get information on how long a pipeline stage takes to execute. The stage that has the most load will be targeted to extract additional pipeline stages. The algorithm used, targets the bottle neck stages and divides them up into smaller pipeline stages, in order to get a better load balancing of the system. However, this technique has not been researched to the same degree as parallelizing independent loops and has not been popularized in parallelization tools yet thus it will not be investigated further in this report. Another interesting auto-tuning technique is implemented by Wang et al. [36]. Their imple- mentation uses dynamic dependency analysis to determine the parallelism of the loop. It also gathers run-time information to determine the profitability of parallelizing the loop together with heuristics they developed. What can be seen from their tests is that in many of the bench- marks their automatic parallelizer is able to achieve equal performance to hand optimized code. Further evaluation of their implementation shows the strength of their tuning heuristics where optimized programs never give worse performance than the original.

3.6 Conclusion

In this chapter a summary of different methods that are commonly used to parallelize and optimize a parallel program has been presented. Many other methods for optimization exists but they generally do not apply only to parallel programs but sequential programs as well such as in-lining of functions at function calls or unrolling a loop to minimize the amount of jumps. According to Bae et al. [37] the pass that by themselves enabled the most speedup is the variable privatization pass. This is due to the parallelism that it creates. Reduction recognition showed an impact as well in two programs. On the benchmark used the induction variables are already substituted and did not show any impact. Bae et al. [37] acknowledges that earlier work independent from their research that these three passes has significant impact which is confirmed by their results. In a future benchmark, it can be interesting to see how efficient parallelizing tools are at rec- ognizing these dependencies and how they deal with them. Many loops contain these type of dependencies and can be an essential part in parallelizing a program. Additionally, static dependency analysis is preferable over dynamic dependency analysis, as safer code but also tolerable compilation time is preferred. Speculative threading is similar to dynamic dependency analysis, but can be used for loops where static dependency analysis is unsure of the parallelism and does not affect the compilation time. This makes the dynamic dependency analysis redundant for automatic parallelization. Dynamic information of the loop however such as execution time is still something to consider, to deduce the profitability of a parallelization for an auto-tuner. The parallelization methods sound effective and could potentially remove the burden from the programmer to parallelize complex loops. The polyhedral method together with speculative threading sound to be the most effective parallelization technique. Auto-tuning is another technique that sounds useful for getting the most performance out of the program. It does, however, appear to be a complicated subject, and finding the best heuristics, based on target platform and other parameters such as code size requires a lot of work, perhaps even a step that uses machine learning methods.

31 Chapter 4

Automatic parallelization tools

There are many tools that can help a developer to parallelize their programs. In this chapter a subset of them is presented followed by a comparison on their functionality. They were selected based on popularity, they are commonly referenced in the studied material. The found tools has been categorized into Parallelizers, Translators and Assistance tools. Parallelizers are tools that analyses the code and finds parallel regions and parallelizes them. Translators are capable of taking an already parallelized application and generate code for a different architecture. Assistance tools are applications that can help a programmer to parallelize an application by hand, or gain more knowledge of their program.

4.1 Parallelizers

The following parallelization tools transform sequential code to run in parallel. They all have in common that they operate on loops. What differentiates them are the methods they use and where the parallelization is visible. The methods they use does not differentiate a lot, these two categories can be seen: traditional parallelization and polyhedral parallelization. Some tools transforms source code to parallel source code. Meaning that the source code is visibly modified. Other tools transform the code but do not write parallel source code, instead it is only visible in the executable.

4.1.1 PoCC and Pluto

The polyhedral compiler collection (PoCC [38]) is a chain of tools that is able to do polyhedral optimizations on annotated loops. Each tool has their responsibility in the chain such as parsing, dependency analysis, optimization and code generation. The actual parallelizer is the tool called Pluto [39] in the tool chain. The code generation is capable of outputting OpenMP annotation and vector instructions. It can also generate a hybrid parallel solution using OpenMP and MPI. To use this tool it is required that the loops in the source code are annotated with a pragma that isolates the loop. This loop is the only thing that will get parsed and converted to polyhedral representation, if the loop fulfilled all the properties required to be considered a static control flow region. The output will then either be a sequential but optimized loop, or a parallel optimized loop if it was found parallel.

32 4.1.2 PIPS-Par4all

PIPS [40] is another polyhedral optimizer, but it also has other passes to optimize the code. It is not a compiler like PoCC, but a compiler workbench that can be used to build compilers on. It converts source code to optimized source code. It is also very extensible, you can for example replace the internal polyhedral optimizer with PoCC [41]. PIPS has several transforms and analysis steps. Some examples are variable privatization and loop interchange. PIPS can also generate code for heterogeneous systems by generating CUDA or OpenCL. Before generating the code, a calculation on data transfer vs. computational intensity is made to determine that its worth offloading the code. In PIPS this is made by estimating the execution time and convex array region analysis [40]. Par4all [42] is a compiler that is built using PIPS. Par4all has defined which passes that will be used and automatically performs them on the given code. The user can also specify with flags if he wants that the parallelized loops be annotated with OpenMP, or that it generates code using CUDA or OpenCL for the loops.

4.1.3 LLVM-Polly

Polly [35] is a plugin to the LLVM environment that can do polyhedral optimizations. It is very influenced by the PoCC tool and can like PIPS export the polyhedral optimization to PoCC and then import the optimized version [43]. This was how Polly was designed at first, but has since then added its own polyhedral optimizer based on the same algorithms used in Pluto. What differentiates Polly from PoCC and PIPS is that it does not perform polyhedral optimiza- tion on source code. Instead Polly optimizes LLVM intermediate representation. Performing optimizations on this level means that any language that can be parsed to LLVM IR will be able make use of the optimization, because all code will be decomposed into basic blocks and branches when parsed into LLVM IR. Before Polly makes any optimizations it makes use of preparing transforms that are developed for LLVM. Basic alias analysis, memory-to-register promotion and loop simplification are among the transformations that is used [43]. This exposes loop patterns which Polly can operate on. Polly is able to detect trivial vectorizable loops and insert SIMD instructions. It is also able to hoist loop-invariant variables out of the loop [43]. Currently it does not handle reductions or variable privatization.

4.1.4 LLVM-Aesop

Aesop [44] is an automatic parallelizer targeting shared memory machines implemented in LLVM. It targets parallelism for dense array based code with affine based analysis using tradi- tional methods with dependence vectors. First a couple of transformations are made to simplify the serial code just like Polly. After this the code is analysed for loop carried dependences using alias analysis and a dependence vector module that uses two dependency tests (Banjaree and delta). This generates dependence vectors that is then used for parallelization. Given a loop iterating over i, jandk and a vector v = (0, 0, 1) the decision algorithm used by Aesop can see that the iterations along i and j are independent and decide that the the outer most loop can be parallelized. Aesop is also able to resolve some dependencies such as reductions and privatization of variables.

33 4.1.5 GCC-Graphite

GCC has an automatic parallelization step called graphite [45] that does polyhedral optimization on the IR of GCC (Gimple). The current maintainer is also a maintainer of the Polly plugin and it is on Polly where most effort is put currently. Graphite can be seen as a light version of Polly, that works for the GCC compiler in contrast to the Clang compiler for the LLVM environment. Graphite is capable of inserting OpenMP function calls and vectorization instructions just like Polly.

4.1.6 Cetus

Cetus [46] is a research source to source compiler that uses traditional methods. It performs several other optimization such as induction variable substitution, reduction recognition and variable privatization. To parallelize loops it looks for dependencies, for scalar variables it uses information from the reduction and privatization transform. For array variables, the data dependency graph is investigated [37]. Cetus has two profitability tests. Their model-based profitability test estimates a loops workload to determine if a loop should be parallelized. If it cannot be determined in compile time, it will have to decide on run-time using an OpenMP IF clause. The other profitability test uses profile information, to compare the sequential version with the parallel version. The compiler also has a tuning method called window-based empirical tuning. That searches all optimization techniques of the compiler to find a combination that performs best at run- time. This method was also used to show which optimization technique had the most impact on performance. This has been summarized in section 3.6.

4.1.7 Parallware

Parallware [47] is a source to source proprietary compiler that uses the domain independent kernel method presented by Andi´onet al. [48]. It is able to insert OpenMP annotations or OpenACC annotations on the parallelized loops that it detects. It was shown that the polyhedral optimizations tools Pluto and Graphite was ineffective on parallelizing a number of applications that the domain independent kernel was able to parallelize [48]. A possible reason for the poor efficiency could be because of the syntactical sensitivity of Pluto, which the authors acknowledges in the report.

4.1.8 CAPS

CAPS compiler [26] is a source to source proprietary compiler that can convert sequential c code into accelerator code using OpenHMPP and OpenACC directives. It allows to build portable applications for many-core platforms, such as Nvidia GPU, AMD GPU and Intel MIC. The CAPS compilers try to partition the code into standalone pieces that can run in parallel, called Codelets.A Codelet is a that does not contain any global variables except if these have been declared by a HMPP directive ”resident”. It does not contain any function calls with an invisible body (that cannot be inlined). This includes the use of libraries and system functions such as malloc and printf. Every function call must refer to a static pure function (no function pointers) [49]. The function should not return anything, this way the codelet can be executed remotely and asynchronously. By using Codelets, it becomes easier to classify regions as parallel. The CAPS compiler generates CUDA or OpenCL code depending on the target

34 specified of a code region. The compiler also provides auto-tuning techniques that optimizes between performance and portability.

4.2 Translators

This category are tools that does not parallelize code, but helps converting source code from one parallel paradigm to another. The parallel paradigms have been touched in section 2.2. The following tools have in common that they translate C code that uses OpenMP worksharing construct pragmas for SMP platforms.

4.2.1 OpenMP2HMPP

OpenMP2HMPP [29] is a source to source translator that takes OpenMP code that is annotated with the added directive clause target CUDA and creates the corresponding OpenHMPP version of the code. This code can then be compiled with a HMPP compatible compiler such as CAPS, to get an automatically generated CUDA version of the code.

4.2.2 Step

STEP [50] is a source to source translator built with the PIPS workbench. Step is able to convert OpenMP worksharing constructs code to MPI processes. This enables the program to run on distributed processors with distributed memory. It also supports combining OpenMP with MPI (Hybrid programming), where the outer worksharing constructs are made into MPI processes and keeping the inner worksharing constructs intact. Step works in the following way as descibed by Millot et al. [51]. First a parallel loop with an OpenMP worksharing construct is found and the contents of this loop is then extracted to a separate procedure. This procedure takes as input the start and end indices, and the arrays and variables that are used in the loop region. The second task is to determine what data that needs to be updated after the parallel section has been executed for each process in the distributed system. This means that the updated data has to be passed around between the processes. After this, the code is generated. When starting the resulting application, the user specify the number of processes to execute with, so that it can be determined which iterations each process will handle.

4.3 Assistance

Assistance tools do not parallelize code at all, but analysis code and gives the programmer additional knowledge on their code. Typically a debugger or a profiler is a valuable tool for gaining knowledge on how your code executes.

4.3.1 Pareon

Pareon [52] is a semi-automatic tool that can analyze a program for parallelism. The work flow of Pareon is to have the developer implement the parallel code. The assumption is that the developer wants to have the control over their code. Pareon compiles a program to an executable for a virtual target and records the memory accesses performed by the executable. If the host

35 computer is single threaded Pareon will still be able to find the parallelism in the program since the host is not the assumed target. After the profiling of the executable has been made it can be displayed in the Pareon graphical interface. This interface gives the user an overview of how the program executes in its current version. The user can then explore his program and tell Pareon to parallelize loops. Pareon will then calculate the parallel schedule for the loops and display the schedule and the potential speed up of the program. After a parallelization is made, Pareon will suggest how the user can implement the parallel modifications herself. If there are dependencies between iterations, Pareon will not be able to parallelize it, but will be able to provide suggestions on removals of dependencies.

4.4 Comparison of tools and reflection

In this chapter a short introductions to a number of tools has been presented so far. Many of these tools cover the same domain. This section will compare them since there is overlap.

4.4.1 Polyhedral optimizers and performance

The polyhedral optimization tools PoCC, Par4all, Polly and Graphite overlap on their function- ality and it is unclear which of these performs best. Previous evaluations have been made using Polybench 2.0 by Amini et al. [53] and Grosser et al.[43]. These tests have been performed under different conditions and can not entirely be compared. The tests are performed on an older version of Polly when it still used PoCC for its parallelization optimization. Currently it implements its own optimization algorithm based on the algorithm used in Pluto. It is shown that the speed up is significant on these benchmarks, although it struggles with two of the benchmarks, that resulted in performance degradation. This confirms that the tools cannot be trusted blindly. The performance of the parallel code was, however, better than the sequential version on average. The other tools that have been tested with Polybench 2.0 is the Par4all tool and the CAPS compiler. In these tests, it is shown that the accelerator code yields significant speedups com- pared to the OpenMP implementation. Not all tests were performed as with the tests on Polly. Both the CAPS version and the Par4all CUDA version struggles on some of the benchmarks, due to that the time for copying the memory to the GPU takes more time than executing the program sequentially. Overall the tools perform well and worth noting is that Polly makes use of PoCC, so indirectly it is a benchmark on PoCC as well. Since these tests have been carried out, the tools have evolved, so it is not clear if the efficiency is the same today. It is not clear how these would perform on a real application either, thus a new test has to be carried out. As shown by Andi´onet al. [48] the polyhedral optimizers, Pluto and Graphite, struggled with finding parallelism in several tests such as reduction recognition which is a common construct in real applications.

4.4.2 Auto-tuning incorporation and performance

Cetus is an interesting alternative since it does not use the polyhedral model to do its opti- mizations and it has tuning capabilities such as a profitability test to determine whether the parallelized loop is beneficial over the sequential loop. This means that the program will never perform worse than the sequential version of the program. The most interesting feature of Ce- tus is the tuning approach that is able to explore an optimization space using a method called

36 Table 4.1: Functional differences in the tools.

Tool / Fea- Reductions Variable pri- Interprocedural Multiple files Parallelization tures vatization analysis Method Aesop Yes Yes Inlined functions Yes1 Traditional Cetus Yes Yes Inline functions No Traditional Par4All Yes No Yes Yes Polyhedral Pluto No No No No Polyhedral Polly No No Inlined functions Yes1 Polyhedral 1 If optimization is performed after linking multiple files.

window-based empirical tuning. Where the tuning algorithm is capable of selecting a combina- tion of optimization techniques and deduce which one performs best, but only considers a small window of the program since searching in the whole program is exhaustive. It is not clear how this implementation performs against a polyhedral optimizer and further tests has to be carried out. Cetus has been evaluated by Bae et al. [37] with the NAS parallel benchmark showing significant performance on a number of the benchmarks comparable to hand optimized code. Par4All also incorporates a profitability test available in PIPS before it generates code for ac- celerators. It uses heuristics to check whether the time for data transfers exceeds the processing time. As shown by Amini et al. [53] both CAPS and Par4all are efficient on a number of the Polybench 2.0 tests. Both CAPS and Par4all is said to have a profitability check [40] [26], but it is clear that it does not work in all cases by looking at these tests or it has not been used. Overall Par4all performs better then CAPS in this test, but it is stated by Amini et al. [53] that only simple annotations has been inserted by hand. CAPS is capable of finding regions that can be parallelized using different wizards such as CodeletFinder and HMPP wizard [26] it is not clear to what degree they have been used in the test.

4.4.3 Functional differences

Table 4.1 shows some functional differences that are essential for the effectiveness of the paral- lelization. Interprocedural analysis is one feature. This feature makes it possible to parallelize loops that contains function calls, given that the called function does not affect global state in any way. For these tools however the interprocedural analysis is limited in different ways. Aesop and Polly for instance does not perform interprocedural analysis directly, but during prepara- tion phase smaller functions are inlined. Another feature is whether the tool can solve simple dependencies in order to make a loop parallel, such as reductions or variable privatization. In section 2.2, an introducton was given on how to code for different types of systems. Table 4.2 shows the different input and output languages for a number of tools, that helps the developer to deploy their code to different architectures. Pluto and PoCC needs that someone or something finds static control parts (SCoP) for them before they can perform their optimizations and is therefore much more limited for real usage. A SCoP is a region which is static, meaning that the control flow can be predicted at compile time. This means that the control statements i.e. if, contains statements which does not depend on values calculated during runtime. The Step tool can not parallelize by itself, but given parallelized code from any of the other tools it is capable of generating MPI code. Par4all can generate code both for SMP systems and for accelerators.

37 Table 4.2: A rough overview of what the investigated tools take as input and what they can output.

Tool Input languages Output languages Aesop LLVM IR1 LLVM IR instrumented with OSAL thread calls Cetus C C annotated with OpenMP Par4All C, C, Fortran annotated with OpenMP, C with OpenCL or CUDA Pluto C annotated with SCoP C annotated with OpenMP and/or MPI2 PoCC C annotated with SCoP C annotated with OpenMP Polly LLVM IR1 LLVM IR instrumented with OpenMP (libgomp) calls Step C annotated with OpenMP C with MPI and/or OpenMP annotations 1 Takes any language that can compile to LLVM IR, examples are Assembly, C, C++, Fortran and Ada. 2 Pluto has an experimental version that can create distributed programs

4.5 Conclusion

In this chapter, a comparison on a number of tools is presented to get a better overview on how these tools can be used to help the programmer parallelize her code. The focus of this report is to make use of parallelization tools and therefore the tools that are interesting to investigate and evaluate further are: Aesop, Cetus, Par4All, Pluto and Polly. Cetus, Par4All and Pluto are tools that are frequently referenced to in previous work in automatic parallelization. While Aesop and Polly are new in the area. They are all freely available (open source) and incorporates the state of art methods presented in the previous chapters. No tool incorporates speculative threading, and the level of auto-tuning is minimal, which is unfortunate. The portability of the code they generate is limited to platforms that either support OpenCL, CUDA or OpenMP. A typical embedded platform can consist of multiple heterogeneous cores, and ideally all processing units should be kept busy. Currently SMP platforms, and one core that offloads to accelerators using OpenCL or CUDA are the two types of architectures that can be handled by the parallelization tools. Additionally clusters of SMP platforms can be handled by generating MPI code. However, since all of the tools are able to handle SMP platforms, this is what will be the main focus on the report. This study will not look any further into translator tools, since they assume already parallelized code. Pareon will be further looked into since it can be a reliable alternative to automatic parallelization.

38 Chapter 5

Programming guidelines for automatic parallelizers

Automatic parallelizers are currently very sensitive to how the code is structured in order to parallelize. The reasons for this is the complexity of analysing code, and they need all the help they can get. To make safe and reliable parallelizations, the problem has to be reduced to a specific type of problem to avoid complexity. This chapter will explain the key steps that needs to be taken to efficiently make use of today’s parallelizers. However, how to write code depends on which parallelization tool that are being used. Therefore this guide tries to collect the super set of the constraints on source code.

5.1 How to structure loop headers and bounds

Parallelization tools require in general that the loop bounds are defined with a deterministic loop exit. This means that the loop will always perform a fixed number of iterations that depends on the upper bound. This upper bound has to be loop invariant. This means that for each iteration the upper bound is the same. The stepping value has to be linear (i.e. addition or subtraction) since this makes it easier to calculate the induction variable. The induction variable i will then be calculable at compile time and the iteration count can be fetched. The body should not include statements that can make the loop exit prematurely, such as a break statement, unless it is the only loop exit. Figure 5.1 displays a couple of loop examples with predictable iteration counts. The first one is a so called canonical loop which many parallelizers transform all the loops to, to generalize the analysis and parallelization. Other types of loop structures such as while are also handled by some parallelizers as long as the iteration count depends on a predictable induction variable. The induction variable should not be updated except in the loop header, or once in the case of while loops. Functions within the loop header are forbidden, except for a few cases. The function has to be pure and the tool has to have interprocedural analysis to check for this accurately. The function also needs to be loop invariant. Non of the investigated tools can look for all of these properties in the loop header function so it has to be avoided. There are a few functions that are allowed, and these are max(), min(), abs(). These functions should, however, not take the induction variable as a parameter. Only parameters that are loop invariant are allowed.

39 1 for(i = 0; i < N; ++i){ 1 2 //body 2 for(i = -12; i != 412; i+=5){ 3 } 3 if(i == -2) 4 for(i = -12;; i+=5){ 4 i = f(n); 5 if(i>12) 5 //body 6 break; //the only loop exit 6 } 7 } 7 for(i = 0; i < N; i*=5) { 8 i =0; 8 //body 9 while(i

Figure 5.1: Allowed loop bounds. Figure 5.2: Disallowed loop bounds.

5.2 Static control parts

Static control parts (SCoP) are the type of loop structure that are required for polyhedral analysis. A SCoP is a region with one loop entry and one loop exit where the iteration space is predictable at compile time. Inside the static control part, conditional statements or control flow, has to be predictable. Figure 5.3 shows an example of this. Loops may only contains static control flow and this example has a conditional that has to be calculated in runtime. Hiding unpredictable branches within function calls is a possibility (see Figure 5.4). The poly- hedral automatic parallelizer is however not able to analyze if functions are side-effect free, so the programmer has to assure that the function within a SCoP is in fact free from side- effects. Otherwise the programmer will end up with a program with unpredictable behavior. The unpredictable branch still makes the for loop parallelizable, but the analysis methods are limited.

1 //Potential SCoP 2 for(i = 0; i < num_features; i += 1) { 3 //omitted body 4 5 dot_w_err1 = 0.0; 6 dot_w_err2 = 0.0; 7 for(j = 0; j < num_images; j += 1) { 8 dot_w_err1 += weights[j]*err1[j]; 9 dot_w_err2 += weights[j]*err2[j]; 10 } 11 12 //unpredictable control flow 13 if (dot_w_err1

Figure 5.3: A loop that does not satisfy as a static control part because of the unpredictable branch.

40 1 //Potential SCoP 2 for(i = 0; i < num_features; i += 1) { 3 //omitted body 4 5 dot_w_err1 = 0.0; 6 dot_w_err2 = 0.0; 7 for(j = 0; j < num_images; j += 1) { 8 dot_w_err1 += weights[j]*err1[j]; 9 dot_w_err2 += weights[j]*err2[j]; 10 } 11 struct MyFuncResult res = myFunc(dot_w_err1,dot_w_err2); 12 ps[i] = res.ps; 13 error[i] = res.error; 14 }

Figure 5.4: A loop that satisfy as a static control part.

5.3 Loop bodies

Within loop bodies it is important to not introduce loop carried dependencies. These are created by induction variables, reductions, global variables and array accesses. Parallelizing tools are capable of handling certain types of loop carried dependencies. But avoiding them as much as possible, increases the chance of parallelizing the loop. The parallelization tools investigated cannot create critical regions around code that reads and writes to a shared variable. Figure 5.5 shows an example of a shared array and scalar (count) that is blocking a parallelizable loop. Writing it differently makes it possible to parallelize it. The trick is to have in mind that loop iterations are predictable, for example if a max number of iterations is possible to calculate, then calculate it as in and Figure 5.6. In this example the critical region has been moved out of the loop.

1 rectangles = malloc(capacity*sizeof(Rectangle)); 2 count = 0; // <- unpredictable counter 3 //Not parallel 4 for(j=0;j<=height-L;j++){ 5 for(i=0;i<=width-L;i++){ 6 double score = ApplyDetector(classifier,&sub_integral_image); 7 8 if( score > threshold ){ 9 //Start critical region 10 rectangles[count] = (Rectangle){i*step,j*step,L,L}; 11 count ++; 12 if(count >= capacity){ 13 capacity += 1000; 14 rectangles = realloc(capacity*sizeof(Rectangle)); 15 } 16 //End critical region 17 } 18 } 19 }

Figure 5.5: Critical region within the loop.

41 1 rectangles = malloc(element_capacity*sizeof(Rectangle)); 2 score = malloc(element_capacity*sizeof(double)); 3 //Parallel 4 for(j=0;j<=height-L;j+=step){ 5 for(i=0;i<=width-L;i+=step){ 6 score[j][i] = ApplyDetector(classifier,&sub_integral_image); 7 rectangles[j][i] = (Rectangle){i*step,j*step,L,L}; 8 } 9 } 10 detected_rectangles = malloc(capacity*sizeof(Rectangle); 11 count = 0; 12 //Not parallel 13 for(i=0;i threshold){ 15 rectangles[count] = rectangles[i]; 16 count ++; 17 if(count >= capacity){ 18 capacity += 1000; 19 detected_rectangles = realloc(capacity*sizeof(Rectangle)); 20 } 21 } 22 }

Figure 5.6: Critical region fissioned out of the loop.

5.4 Array accesses and allocation

Array accesses are the biggest hurdle when analyzing programs for parallelism. Some assump- tions are made about them that simplifies analysis. A two dimensional array is easier to analyze than a linearized array where the two dimensions are accessed like a vector, with using a stride of the column length of the matrix (A[i∗width+j] vs A[i][j]). A parallelizer may have the func- tionality of delinearizing the one dimensional array into a multi dimensional array but should not be taken for granted. The array subscripts should preferably only contain linear expressions using loop induction variables or loop invariant variables. Allocation of the array can be made either on the stack or on the heap. Automatic parallelizers are not able to privatize heap allocated objects since OpenMP tags are only able to specify stack allocated data as private. Setting a heap allocated object as private would mean that the pointer got privatized and not the data. Ideally the parallelizers would move heap allocation function calls (malloc) into the scope in which they can be privatized since each thread has a private pointer. Currently this has to be performed by the programmer (see Figure 5.7).

42 1 int * a = malloc(50); 1 //Parallel sincea can 2 //Not parallel sincea will 2 //be declared private 3 //not be declared private 3 for(int i = 0; i < N; i++){ 4 for(int i = 0; i < N; i++){ 4 int * a = malloc(50); 5 for(int j = 0; j < 50; j++){ 5 for(int j = 0; j < 50; j++){ 6 a[j] = MyFunc(i,j); 6 a[j] = MyFunc(i,j); 7 } 7 } 8 //... 8 //... 9 } 9 free (a); 10 free (a); 10 }

(1) Before (2) After

Figure 5.7: Move private dynamic allocation inside the loop scope.

5.5 Variable scope

Where the variable is defined matters for parallelization tools. Even though a privatization element exist in some a parallelization tools, it is not guaranteed that it will be able to classify the variable as private. By defining variables as low into a loop depth as possible, the variables can be removed from the list of variables that needs to be analyzed for variable privatization, and there will be less chance that the parallelizer will by mistake classify a private variable as shared as in Figure 5.8. To resolve this refer to Figure 5.9.

1 double x[N]; 1 doubleA[N]; 2 for(i = 0; i < N; ++i){ 2 double x[N]; 3 doubleA[N]; 3 for(i = 0; i < N; ++i){ 4 for(j = 0; j < N; ++j){ 4 for(j = 0; j < N; ++j){ 5 A[j] = 4; 5 A[j] = 4; 6 } 6 } 7 for(j = 0; j < N; ++j){ 7 for(j = 0; j < N; ++j){ 8 x[i] = A[j]; 8 x[i] = A[j]; 9 } 9 } 10 } 10 }

Figure 5.9: A is in a scope where it can- Figure 5.8: A is classified as shared, even not be shared between the iterations over though it is private in theory. i, thus is private.

5.6 Function calls and stubbing

Many of the parallelization tools do not have any form of interprocedural analysis. This means that the programmer has to write code in place of the function call. This is obviously not the optimal way of writing C code since readability will decreases and code can not be reused without copy and pasting. Some tools are able to inline functions but are usually configured to calculate a cost and inline a function depending on the cost. How this costs is calculated can vary, but code size is usually a factor. Inlining a function removes a call and return instruction, but the bigger code size can increase cache misses. An additional benefit to inlining a function call is that more optimization opportunities can emerge when removing the boundary between the caller and callee.

43 Par4all is able to reliably analyze functions using interprocedural analysis but it has limitations. These rules apply only for that tool. Inaccessible function definitions can not be reasoned about in automatic parallelizers. Either hiding the function call by commenting it out or writing stub code for the function so that the parallelizer has a definition it can reason about is needed. The most common C standard library functions are already dealt with by Par4all and is not needed to be stubbed. Function calls needs in general to be side effect free, this means that they are not allowed to write global variables or memory locations provided as arguments. As is the case for Par4all, recursive functions is difficult for the tool to analyze and parallelization of these should not be expected. In theory an optimization pass may be able to rewrite a recursive function into an iterative loop, which is simpler to parallelize. An example is the tail recursion elimination pass that is a common transform used in LLVM. Automatic parallelizers cannot reason about already parallelized applications. Just like inac- cessible function calls, hiding the parallel construct from the automatic parallelizer is necessary, either by the means of stubbing the parallel regions with a sequential region or commenting out parallel OpenMP statements or thread library functions.

5.7 Function pointers

Programmers that have been in contact with languages may feel the urge to use function pointers extensively when programming in C. This was actually the case for the Face recognition application used in this thesis as in Figure 5.10. Function pointers are however not supported by any automatic parallelization tool, so they need to be avoided. As a side note: In the particular case in Figure 5.10, zipWith is only called with three differ- ent functions (Add(), Sub() and Mul()). Ideally a compiler could generate new functions by identifying which function are sent as parameters to the functions. To remove the function pointer.

1 void zipWith( double f(double,double), //function pointer 2 double* t1, //array term1 3 double* t2, //array term2 4 double* r, //store result here 5 uint32_t size ){ //Size of t1,t2 andr 6 int i; 7 for(i=0;i

Figure 5.10: Function pointers should be avoided.

44 5.8 Alias analysis problems: Pointer arithmetic and type casts

Pointer arithmetic creates complexity for the alias analysis component in the parallelizers. Avoiding this increases the chance of the parallelizers to perform correctly, and often times code becomes more readable when doing this change. Type casts are a different set of added complexity for parallelizers, it translates from one memory layout to another. Figure 5.11 shows the two different cases. The last example shows a commonly occurring statement where c is functioning as a short hand version for faster typing and in some cases readability, this is statement that complicates alias analysis as well.

1 //This type cast captures bothx andy 2 //When incremented it also points ata[] 3 double*z=(double*) &x; 4 5 int x; 6 int y; 7 8 double a[N-1]; 9 for(i=0;i

Figure 5.11: Two examples on how to complicate alias analysis.

5.9 Reductions

Reductions comes in many forms but only a subset of them are handled by automatic paral- lelizers. For instance, the OpenMP clause only handles primitive functions with the property of being associative and commutative. The reason for this is that the order the reduction is executed does not matter with this properties. Threads may then perform their own reduc- tion on a subset of the data to be reduced. The reductions may then afterwards be further reduced into a common reduction variable. Examples of primitive functions that are associative and commutative is addition, subtraction, multiplication and boolean operators. A primitive function that is not commutative is the division operator. The polyhedral automatic parallelizers currently does not handle reductions within loops. To go around this problem the programmer can fission out the reduction from the loop and let it execute separately from the parallel loop in its own sequential loop (see Figure 5.12).

1 for(i = 0; i < N; ++i){ 1 for(i=0;i

Figure 5.12: Fission out the reduction.

45 5.10 Conclusion

In this chapter, a description was given on how to write and structure legacy code to be accepted by automatic parallelizers. Several examples has been given that are connected to the legacy face recognition application used in this study. Function analysis is a feature that is necessary for reliable parallelization which only one of the tools is capable of. Additionally variable privatization and reduction is not handled by all of the tools, which adds to the constraints of structuring the code. The occurrence of the changes necessary in legacy projects has not been quantified, and the benefit of automating any of these refactoring steps cannot be deduced. It is however clear that many of these coding patterns occurs in general applications and this was also the case for the applications the tools were evaluated on.

46 Chapter 6

Implementation

The study phase resulted in a knowledge base for a large number of the different parallelization methods. I also composed a set of guidelines, that can be used to create applications that can make use of automatic parallelization tools. The objective of the implementation phase was to evaluate how well the tools work and if it is beneficial to use them. The second objective was to identify the improvements that can be made on the automatic parallelization tools using the findings from the study and the results and knowledge gained from this implementation. In this chapter the implementation approach is presented and a set of requirements that an automatic parallelizer should have in order to be beneficial for the programmer.

6.1 Implementation approach

Before the implementation phase, I ported a face recognition application from Matlab code to C code by hand. The resulting C code is in a sequential form and this application is what I consider as the legacy application to parallelize. The application in this state was not suitable to run through any of the parallelization tools, which meant it was necessary to refactor it. The implementation consisted of doing the refactoring steps of the C code by hand, using the guidelines from the study phase. This was done in order, to get better success with the parallelizing tools. This process is similar to parallelize the code by hand. Because of this it is interesting whether refactoring the code by hand, to make use of parallelizing tools, is worth it compared to parallelizing the code by hand. Both methods were timed with a rough estimate. The implementation phase also consisted of a comparison of the efficiency and performance of the parallelization tools. The efficiency is defined as how well the parallelization tool is able to parallelize the sequential program. The performance is defined as the speed up of the parallelized program as compared to the sequential. This resulted in an analysis on why some tools perform better than another, that can be used to identify the key features of a parallelization tools. The efficiency and performance was measured using the PolyBench benchmark, which consists of small highly parallel programs in sequential form. The efficiency of the tools was also measured using the legacy application. Lastly, each tool will be checked against a set of requirements to determine whether the tool is beneficial for a programmer to use to parallelize her code. This list will also be used to identify how the tools can be improved to be more beneficial. The tools that have been selected for the evaluation are Aesop [44] and Polly [35], Pluto [39] and Par4all [42] and lastly Cetus [46], since these are free open source alternatives. They are also

47 easy to use which, was the major argument for selecting these tools. Additionally Pareon [52] has been used to analyze the legacy application for parallelism and helped out when parallelizing the application.

6.2 Requirements

The following requirements (seen in Table 6.1) have been identified to be useful traits of an automatic parallelizer that ports legacy code to different architectures for embedded systems. It is interesting to know which tools fulfill the most of these requirements to find where future effort on developing a parallelization tools for embedded systems should be put.

Table 6.1: The list of requirements for an automatic parallelization tool.

Tag Requirement SWREQ01 Handle serial C code SWREQ02 Handle interprocedural C code SWREQ03 Handle serial code for other languages SWREQ04 Handle interprocedural code with multiple languages SWREQ05 Handle already parallelized code SWREQ06 Parallelization must not result in race conditions SWREQ07 Parallelization must not result in deadlocks SWREQ08 Reduction dependencies shall be handled SWREQ09 Privatizable variables shall be privatized and not introduce a dependency SWREQ10 Code shall be maintainable with one version SWREQ11 Code generation shall be configurable for different target architectures SWREQ12 Code generation shall be able to target symmetric multi-processors SWREQ13 Code generation shall be able to target distributed systems SWREQ14 Code generation shall be able to target hybrid distributed systems SWREQ15 Code generation shall be able to target heterogeneous systems SWREQ16 Code generation shall be able to make use of vector instructions SWREQ17 Code generated for accelerators shall be asynchronous SWREQ18 Parallel code shall have a load balanced schedule SWREQ19 Parallelized programs shall give a performance increase SWREQ20 Less time consuming than to parallelize by hand SWREQ21 Find parallelism automatically

SWREQ01 - SWREQ04 An automatic parallelization tool has to at least handle one lan- guage, C code is a popular language for embedded systems and is a reasonable first target. There are however several C standards to select from and ideally all of them shall be supported. Additionally inter-procedural C code is a necessity for efficient dependency analysis, since it is likely that legacy code contains function calls within parallel regions. Embedded software is written in several other languages such as Assembly and C++ to name two, it is reasonable that a parallelization tool is able to handle these languages as well. Ideally, programs that are composed of several modules in different languages shall be supported.

SWREQ05 - SWREQ07 Legacy code may contain functions that already makes use of parallelism. An automatic parallelizer shall therefore be able to handle it in different ways. The parallelizer could ignore the region and trust that its fine. It could also replace it with its own parallelization. The biggest problem here is to analyze the parallelism. Additionally,

48 parallelization shall not introduce faults in the program such as race conditions. If the variables in the region are private, then they shall be privatized. if the region has variables that are shared, a critical region shall be created when reasonable, so that the loop could be parallelized. If critical regions are inserted, they shall not create deadlocks or other types of starvation. As this results in a non functional program.

SWREQ08, SWREQ09 Many loop carried dependencies in applications are often reduc- tions, or the result of reusing a variable scoped outside the loop that is possible to scope inside it. An automatic parallelizer shall be able to analyze the dependencies and identify these two cases and handle them accordingly for higher chance on successful parallelization. Only loops that have all of its dependencies removed shall be parallelized.

SWREQ10 - SWREQ17 For simplicity, the porting of the legacy application shall be portable to several targets, maintaining all ported versions requires extensive work whenever a new change is added to the main version. Multicore embedded systems comes in several differ- ent forms thus different code generation targets are required. Multicore processors with shared memory is an initial target requirement as it is heavily standardized by the OpenMP committee. Distributed systems and hybrid distributed systems needs means to communicate with threads that are spawned on remote processors. An automatic parallelizer can deal with this use case by inserting message passing instructions, there are several message passing implementations that suites embedded systems that can be used instead of the bigger MPI libraries. Hetero- geneous systems are a common design for embedded systems, specialized cores can accelerate the program and is a very attractive addition to a system. These cores are not standardized however and generating specialized code is not always possible. Vendor specific compilers may be required as well. Code that will be run on the accelerator shall be asynchronous so that the CPU is not idling. An automatic parallelizer shall have support for vector instructions as they have become a common element in modern processors.

SWREQ18 - SWREQ21 Load balance is an important aspect for a high performing pro- gram. By making use of the OpenMP standard dynamic schedules, this is taken care of for SMP systems. Load balancing becomes a much more complex problem when dealing with dis- tributed memory however as there will be added communication overheads between processors when fetching for more work or copying of memory between processors. A parallelized program shall execute faster than the sequential version to be an acceptable port. A slower program is a useless port, and small gains can demotivate parallelizing the application as well. The cores are better off processing something else. The work required for porting the application using automatic parallelization tools shall be significantly less than parallelizing by hand. A human can generally analyze applications more efficiently than a tool and heaving to rewrite the application to suite parallelization tools gives the human bigger insight of the application which leads to that he can easily parallelize the application himself.

49 Chapter 7

The applications to parallelize

To evaluate the parallelization tools on their efficiency, it was decided to run two complex face recognition applications and a number of small application that do heavy parallel processing of data. In this chapter the applications are presented in minimal detail.

7.1 Face recognition applications

I have written two applications that will be considered a legacy application that we want to run on a parallel architecture. These applications have been written for running on a single core, since this was the previous target platform. At first look it can be complex to parallelize this applications by hand, so they are ideal to show the benefit of using automatic parallelizers. There are several face recognition applications out there already making use of parallel hard- ware, in fact the biggest computer vision library OpenCV [54] is already optimized with vector instructions and course grain parallelism using OpenMP. Therefore it was decided to create the two applications from scratch, only making use of some essential OpenCV library calls for loading and resizing images. These particular applications are based on the Viola Jones object detection framework [55] using an AdaBoost learning algorithm [56] which picks the best features out from a big set of features. Each feature on its own is for the most part very weak for classifying a face, but by combining several features together a strong classifier can be created. This work is not about the actual face recognition algorithm used, but a short explanation of the two applications are presented in this section, as it is necessary for understanding the parallelism of the applications.

7.1.1 Training application

The training application is used to find features that can be used to distinguish a set of related pictures from unrelated pictures. In my implementation the trainer will find the features that distinguish human faces from other pictures. In Figure 7.1 a rough control flow of the application is shown. The execution starts with loading a number of training images of size 19x19 pixels using OpenCV. To speed up processing of these images later, the integral image is calculated for each one. This is a parallel task. The features that are going to be used are then enumerated. A feature checks for differences in intensity on a region of a picture. These detects for example edges and corners. The enumeration

50 Figure 7.1: Training application for face recognition. of features creates all sizes and positions which are possible within the 19x19 region. This part contains a counter that is non trivial to remove, and is therefore not parallel. After the features have been created, the learning algorithm starts. The algorithm is built using an outer loop and a parallel inner loop that iterates over features. It starts with setting up weights which describe how much impact a correct classification of an image will have. If an image is incorrectly classified the weight will be updated so that in the next iteration the image will have bigger impact on the selection of the feature. In the parallel inner loop, a feature is applied on each training image. For each image a value will be calculated which determines how significant this feature was in that particular image. By adjusting a threshold that will be compared with the feature significance, it is possible to minimize the number of miss classified images. Each inner loop iteration is independent from another and is therefore parallelizable. After the inner loop is complete the feature that had the least number of miss classified images is selected and saved. Using this selected feature the weights associated with each image is updated. Updating the weights creates a loop carried dependency and the loop is because of this not parallelizable. When an arbitrary amount of features have been found, which is specified by the user, the execution is complete and the features which together make up for one strong classifier can be saved to a file for later use by the detector application. This application is highly parallel since it spends most of the execution time in the parallel

51 inner loop. This application is thus a very good test to see how well the tools work on a more complex application.

7.1.2 Detector application

The detector application is used to classify whether a picture has a face inside it or not. The training images used are 19x19 pixels while real pictures are up to several millions of pixels. The classifiers trained are only able to determine if a picture contains a face if it is applied on the same picture size it was trained for. Therefore, it is necessary to split up a picture into subwindows where each window has to be checked for a face. Additionally, a face can be of a different scale in the picture thus scaling the picture to different sizes before applying the classifiers becomes a requirement.

Figure 7.2: Detector application for face recognition.

In Figure 7.2 the flow of the detector application is shown. First it loads a classifier from a file. After this it starts to read input frames one by one. When a frame is fetched it is first converted to gray scale, and after this the program enters a function for detecting faces. This function consists of a nested loop of four levels. The most outer loops iterates over different scale factors that will be applied to the frame. The next two loops iterates a sub window of size 19x19 both horizontally and vertically on the frame. For each sub window the strong classifier is applied, which is a list of a number of features that will be applied on the sub window. Each feature will give a score that is summed and compared against a threshold. If the score is greater than the threshold, the sub window will be classified to contain an image. Each positively classified sub window will result in a rectangle drawn on the output frame from the application. All the loops mentioned here are parallelizable. The number of sub windows are however bigger in number then the number of scaling factors (in general), thus the sub window iterator is the one that will benefit the execution time the most if parallel, as the processor count grows.

52 Like the training application, this also has a highly parallel region and it will be interesting to see how the parallelization tools perform.

7.2 PolyBench benchmark applications

Simple applications are used to benchmarks the tools, to gain an insight on how the tools compare in terms of efficiency. PolyBench-3.2 [57] has been selected as the benchmark, since Pluto, Par4all, Polly and Aesop is able to parse and run them. The benchmark is written in C whereas NAS parallel benchmark [58] which was also considered is mainly written in Fortran. Each application contains around 10 to 30 lines of code, with complex loop nests performing specialized mathematical calculations. PolyBench provides predefined array sizes that can be selected during compile time. In this benchmark the standard dataset and the large dataset has been selected because smaller datasets is not going to show significant performance differences to the reference value. Furthermore the smallest (mini) dataset is used to verify that the parallel versions produces the same output as the sequential version. The PolyBench kernels contain several issues as identified by Yuki [59], regarding the algorithms used in the kernels. This benchmark is only used as a verification that the automatic paral- lelizers are able to parallelize some applications and to get an overview on how the parallelizers compare. The PolyBench kernels can be categorized into these categories: data mining problems, lin- ear algebra, linear algebra solvers and stencil problems. They have all been designed to be parallelizable. The linear algebra kernels are vector and matrix operations in different constellations. The benchmark kernels that fall into this category are: 2mm, 3mm, atax, bicg, doitgen, gemm, gemver, gesummv, mvt, symm, syr2k, syrk and trmm. The linear algebra solvers solves systems of linear equations using different algorithms. The benchmark kernels that fall into this category are: cholesky, durbin, dynprog, gramschmidt, ludcmp and trisolv. The data mining problems, correlation and covariance, calculates the statistical covariance, to determine how linearly related two variables are. Stencil problems involves updating a grid of data iteratively. The benchmark kernels that fall into this cateogory are: adi, fdtd-2d, fdtd-apml, jacobi-1d-imper, jacobi-2d-imper and seidel- 2d. The following two applications do not fit to any of the categories above: floyd-warshall and reg-detect. Floyd-warshall computes the shortest paths between each pair of nodes in a graph. It is unknown what reg-detect is procesing [59].

53 Chapter 8

Results from evaluating the tools

In this chapter the results from the benchmarks on the selected automatic parallelization tools will be presented. Additionally the results from implementing the face recognition application is presented and analyzed. The evaluation was made on a machine with an Intel(R) Xeon(R) CPU E5-2440 @ 2.40GHz with 8 cores.

8.1 Compilation flags

Table 8.1: Compilation flags for the individual tools.

Tool Version Flags Aesop1 3.4 -lgomp -losal thread barrier -lpthreads -lruntime pthreads create Par4All 1.4.5 – Pluto 0.10.0 –tile –parallel Polly1 3.4 -O3 -polly-ignore-aliasing -polly -enable-polly-openmp GCC2 4.9 -fopenmp -O3 1 Flags are used by the LLVM environment. 2 Source code generated from Pluto and Par4all was compiled using GCC.

To run the parallelization steps on the applications the compilation flags seen in Table 8.1 was used. The only flag that needs further explanation is –polly-ignore-aliasing. At the time of benchmark, there was a bug in the tool. What this flag does is that it assumes that no aliasing is performed in the source code which is not a safe assumption in real legacy applications.

8.2 PolyBench results

As described in section 6.1, five tools was to be benchmarked using the PolyBench benchmark. To run the PolyBench benchmark on Cetus would require additional work and was therefore left out from this evaluation. But it is expected that it would perform closely to Aesop since it makes use of the same concepts. The results from the PolyBench benchmarks Figure 8.1, Figure 8.2 and Figure 8.3 shows the speedup over the sequential version of each benchmark for each tool that was tested. Each application was run four times, first with a smaller data set using 4 cores and then the same data set using 8 cores. Each kernel was also run with a bigger data set using 4 and 8 cores respectively. These different runs show if the performance scales

54 with the number of processor cores, and the data size shows the effect of intelligent organization of read and writes due to tiling. There were several issues when compiling or running the benchmarks which resulted in that they could not be used in the result. Whenever this occurred the result would be set to 0 which can be seen in the figures as a non existent staple. Why these issues occurred is hard to say without analyzing and debugging each failing combination thoroughly. The reason for the missing staples on application 13 is due to the big amount of data used in the application which the sequential version could not handle. When an application does not run after using a parallelization tool, it is considered a failure. It is also considered a failure when the optimized application is slower than the sequential version. Overall there is a benefit of using parallelization tools on the PolyBench applications, for the most part you will not get a speed decrease when using the tools. The applications are, how- ever, very small in code size and written in a format that is easily analyzed by the automatic parallelizers. In spite of this, there appears to be some applications that the tools struggle with. This adds to the skepticism of using parallelization tools in real situations. All tools failed to parallelize application 11 and 24, in this case it would have been better to not parallelize at all. Additionally, Polly fails to parallelize application 10 and 20. Pluto fails on parallelizing 7 and 10 and Aesop fails on parallelizing application 6, 18, 29 and 30. In other words, more than 10% of the parallelized applications resulted in a failure.

55 Figure 8.1: Results from Polybench benchmarks (part1). Y axis is speed up.

56 Figure 8.2: Results from Polybench benchmarks (part2). Y axis is speed up.

57 Figure 8.3: Results from Polybench benchmarks (part3). Y axis is speed up.

58 The performance gain that the polyhedral loop optimizers are capable of producing shows that there is a big benefit of using automatic parallelizers on some types of problems. The linear algebra applications 1, 2, 4, 5, 9, 15, 16, 17, 22, 26, 27, 28 and 30 are easy to parallelize but generally favor the parallelizers that uses polyhedral optimizations with tiling. These are Pluto and Polly, with unimaginable speedups. It is however not guaranteed that the generated loop schedule is the most favorable for performance, as can be seen in 27 and 28 when comparing Pluto with Polly. Pluto and Polly also perform well on the two data mining problems 7 and 8 for the same reason. The linear algebra solvers 6, 10, 11, 18, 21 and 29 were more difficult to parallelize. Par4All managed to get good speedups on half of the applications, and close to the sequential version on the other half of the applications. It is unknown why this is, since the problems look similar to the linear algebra applications. The stencil problems 3, 12, 13, 10, 20 and 25 favors Par4All and Aesop. These problems are simple to parallelize but the polyhedral method appears to have difficulty finding good schedules for these kind of problems. The problems contains statements of the form A[i] = A[i-1] + A[i] + A[i+1] . Andi´onet al. [48] identifies the same limitations using their own test.

8.3 Parallelization results on the face recognition applications

The more interesting test is whether the parallelizers are able to parallelize a bigger application such as the two described in section 7.1. The source-to-source compilers often come with a coding guide to be able to utilize the automatic parallelizers. For this implementation these guides were not considered when developing the application, thus it was not possible to make use of all the tools right away. When the applications were finished, they were refactored using the coding guidelines presented in this report. The time it took to get the training application accepted by each tool is shown in Table 8.2. The same procedure was made with the classifier application and it is shown in Table 8.3. Both applications are about the same code size approximately 2000 lines of code. In contrast to the smaller applications in the PolyBench, these two applications are made up of several functions, but the loop nests are simpler. The original application had several regions where writes to shared memory would be needed if parallelized. These regions had to be rewritten so that writes would only occur to private memory. This is what took the longest time to refactor. Apart from this, Pluto and Cetus are more limited. Cetus requires that the whole application is visible in one function. While Pluto requires that if statements with unpredictable control flow to be hidden away in a function. But also that the loops to check for parallelism are annotated with a special pragma. To simplify analysis further since only Par4all has interprocedural analysis, the functions that were called from within the parallel part of the training program were inlined when running on Cetus, Aesop and Polly. Getting accepted by the automatic parallelizer is the first step. Another is to get a valid optimized application. For the training application Aesop did not output a valid application. Neither did Par4All, but the mistake was easily fixed manually in the parallel source code. The classifier application was more difficult to parallelize. Only Par4all and Pluto managed to output an application that compiled. Cetus could not analyze the application properly to move on to optimization. Both Aesop and Polly generated a program that resulted in a segmentation fault. In Figure 7.1, the resulting speed up from parallelizing the training application can be seen. The application was run with 1,2,4 and 8 threads respectively. This was configugred using the

59 Table 8.2: Refactoring time and validity of parallelized training application.

Tool Refactoring time Valid program Aesop None No Cetus 2 days Yes Par4all 1 days No1 Pluto 2 days Yes Polly None Yes 1 An error on reduction recognition occured which invalidated the program. It was however trivial to remove this parallelization by hand.

Table 8.3: Refactoring time and validity of parallelized classification application.

Tool Refactoring time Valid program Aesop None No Cetus 2 days No Par4all 1 days Yes Pluto 2 days Yes Polly None No

Figure 8.4: Speed up on different number of cores on the training application after parallelization using the different tools.

OMP NUM THREADS environment variable before executing the application. Cetus, Par4all and Pluto are the four tools that manages to parallelize the training application with significant performance gain. All three applications managed to parallelize the loop that iterates over each feature, where the application spends most of its execution time, hence the good speedup. Polly performs weakly on this particular application, for reasons unknown. For the second application only Par4All and Pluto was able to build a program that is runnable. In Figure 7.2, the performance can be seen. Surprisingly, Pluto generated a schedule on the sub window iterating loop which only utilized half of the cores, while Par4All was able to parallelize the same loop, but much simpler by having assigning each column (the outer loop) to each thread. As can be seen from the results of this evaluation, the complex application was too difficult to parallelize for the selected tools. It is hard to claim a winner in this test. All the tools performed poorly when considering the level of autonomy in the tool, refactoring time and the validity of the application. The tool that performed the best was Par4all, since it had high

60 Figure 8.5: Speed up on different number of cores on the classification application after paral- lelization using the different tools. level of autonomy and the output had great performance. Disregarding the minor error in the training application. Pluto also managed to handle both applications. But it has to be taken into account that after refactoring, the parallel regions contains function calls that Pluto assumes are side effect free. Which is generally an unsafe assumption. Additionally the programmer has to annotate the loop region that should be parallelized. In other words the autonomy is really low. Aesop and Polly who are both very young tools does not require the same level of refactoring, unfortunately they were unable to handle these two programs. The resulting program after optimization is much tougher to analyze because of the LLVM IR format. So it is difficult to identify and correct their errors they introduced.

8.4 Discussion

The two application could have been rewritten more extensively to give the tools a better chance to parallelize them, but it defeats the purpose of the parallelization tools since in reality it is as trivial for a programmer to parallelize this application himself using OpenMP pragma statements, as refactoring the code to enable the use of parallelization tools. Interprocedural analysis is a key method for increasing the autonomy of the parallelization tools since inlining functions was one of the most time consuming tasks. Assuming a function is side effect free is unsafe, and an interprocedural analysis is necessary to automatically identify the properties of called functions within the parallel region. For porting legacy code, it was also clear that variable privatization plays a huge factor when parallelizing the training application. This is similar to what Bae et al. [37] shows with their empirical tuning when evaluating Cetus. This features are therefore a necessity in an automatic parallelizer. OpenMP pragma tags privatizes data arrays stored on the stack, but to privatize data stored on the heap, it is required that the allocation is performed within a parallel region with a privatized pointer. Ideally an automatic step could move dynamic allocations and free statements from outside of a loop, into the loop, so that the data becomes privatized. This would remove additional loop dependencies. Reduction detection also plays an important part when parallelizing since this can remove loop carried dependencies that may prevent a loop

61 from being parallelized. In summary, from this result it can be concluded that automatic parallelizers are not yet ready for large legacy code projects. Big projects takes a significant time to refactor, and refactoring the code for automatic parallelizers is both time consuming and it is not guaranteed that the application is significantly improved after a run through with the parallelization tools. To avoid this time consuming refactoring there are three paths. One path is to extend current research source-to-source compilers to work more autonomously using interprocedural analysis and handle more special cases in the source language. The second path is to not look at source code at all, and move to a more low level syntax, like in production compilers such as LLVM-Clang. As long as the source compiles, it can be analyzed for parallelism. The IR is however different in the investigated source-to-source compilers compared to GCC and Clang. Production compilers uses an IR with low abstraction level of the source where code like C have become hardware instructions and functions and loops have become several basic blocks interconnected with branches. Parallelizing on this level needs other models than those used for source-to-source parallelizers. It requires that the low level code is transformed into general structures that are easily analyzed. LLVM provides passes for doing this which both Aesop and Polly are using. Lastly, one could take the path to not use automatic parallelization tools at all and just paral- lelize the code by hand with the help of a profiling tool that finds parallel parts. This path is more reliable and secure, which is an important aspect when working with embedded systems. Pareon was able to identify where the program would benefit the most if it was parallelized. The implementation hints that Pareon provide are, however, vague and only identifies which variables are creating loop dependencies. The programmer will have to analyze the loop region more thoroughly to identify how the loop should be rewritten. Using a tool like Pareon took less than an hour to analyze the program, and doing the parallelization by hand using the hints should take similar time to that of refactoring the code to work with parallelization tools.

62 Chapter 9

Requirements fulfilled by automatic parallelizers

In order for automatic parallelizers to be a usable tool in product development and specifically in the area of embedded systems. There are a number of requirements that needs to be fulfilled. These requirements relates to performance, reliability, portability and developing costs. In this chapter each investigated tool in this study is evaluated against the requirements presented in chapter 6.

9.1 Code handling and parsing

All tools are able to handle serial C code (SWREQ01) but only Par4all handles interprocedural C code and this is with some limitations (SWREQ02), such as stubbing functions in linked libraries. Polly and Aesop handles other languages (SWREQ03) and can optimize the code after the linking phase which means that they could potentially handle interprocedural code for both C and multiple other languages (SWREQ04). None of the tools are able to detect already parallelized code (SWREQ05), which results in over parallelizing a parallel region.

9.2 Reliability and exposing parallelism

All the tools are able to parallelize without introducing race conditions (SWREQ06) or deadlocks (SWREQ07). Par4All, Cetus and Aesop have an edge over Pluto and Polly, as they are capable of detecting reductions (SWREQ08) and private variables (SWREQ09) thus exposing more parallel regions. However, Pluto and Polly can get bigger speedups for special parallel regions by optimizing for cache locality.

9.3 Maintenance and portability

Code can be maintainable with one version (SWREQ10) and using the tools it is possible to keep the sequential version and generate different parallel versions. All tools can generate a SMP parallel version (SWREQ12). Additionally, Pluto can generate a version for distributed systems (SWREQ13) and hybrid distributed systems (SWREQ14). Par4All can also generate code for GPUs (SWREQ15) although not any other heterogeneous system. The generated code for the GPU is, however, not asynchronous (SWREQ17) meaning that the CPU will be idle.

63 9.4 Parallelism performance and tool efficiency

The SMP code generated by all the tools can make use of vector instructions (SWREQ16) in the trivial cases handled by GCC or Clang. The parallel loops in the generated code can have load balanced schedules using the dynamic scheduling OpenMP clause (SWREQ18). Although, none of the tools have heuristics to find the most beneficial region to parallelize in the case of nested parallelism. Additionally, heuristics for guaranteeing performance increase (SWREQ19) is missing in all tools except Cetus. It is, however, less time consuming to parallelize by hand then to make use of any of the investigated tools (SWREQ20). But the tools are capable of finding parallelism automatically when the input code has been prepared (SWREQ21).

64 Chapter 10

Conclusions

In this report, I have presented a collection of different concepts within the subject of automatic parallelization. Additionally, I have investigated and compared tools on which of these concepts they are using. Their efficiency was compared using the PolyBench benchmarks consisting of smaller programs and two complex face recognition application. In this chapter I present the conclusions that can be drawn from the study and implementation as a whole.

10.1 Limitations of parallelization tools

I carried out Benchmark experiments to get an understanding on how well the parallelization tools are able to port legacy code. The results show that the investigated tools are not reliable enough for larger projects of code. Extensive human involvement for refactoring the code is necessary before inserting it into current parallelization tools to draw benefit from them. The reason for this is that generally legacy code may not follow the coding standard supported by the parallelization tool. Pointer arithmetic, type casts, recursive functions and functions pointers are some constraints that may affect the parallelization. Stubs for functions that are inaccessible for analysis and stubs for already parallelized code are also required. It can also be sensitive to where the variables are declared. The LLVM parallelization tools are less sensitive to this, due to the fact that they operate on a low level intermediate representation form. They are also not as restricted from doing code optimizations since readable source code is not expected from them. In addition to that, several different source languages can be used.

10.2 Manual versus Automatic parallelization

I have categorized the tools into two groups in order to parallelize legacy code. The first category of tools gives assistance for when parallelizing legacy code by hand like Pareon. They can help with identifying parallel sections and data dependencies, and potentially speed up the process of parallelizing legacy code manually. The second category of tools parallelizes the code automatically. The recommended approach based on this report is to use tools in the first category and to parallelize by hand. The reason for this is because of the limitations in the automatic tools. To make up for the limitations by refactoring the code, is as costly as to parallelize the code by hand. The potential of using automatic parallelization, however, still exists. Par4All, which is the most efficient automatic parallelization tool in this study, nearly handled the two legacy face

65 recognition applications. If the efficiency of this tool could be migrated to a production compiler, it will not be long until automatic parallelization for course-grain parallelism would become a standard during compilation. Currently fine-grained parallelism, such as SIMD instructions, has already become standard in both GCC and LLVM. There are some critical issues that has to be dealt with, but the complexity of fixing the issues are uncertain. Except for minor bugs in analysis and optimization, there is a need for profitability heuristics to determine whether an optimization will result in a speedup. Portability is also an area where automatic parallelization may have an edge over hand par- allelized code. Maintaining several versions that target different platforms becomes difficult. Assume that a new feature is to be implemented, this would mean that all versions have to be updated with the new feature and preferably optimized for each target. By using automatic parallelization, there could instead be one sequential file that is maintained, and the binaries for each target are recompiled using this version. This, however, assumes that a lot of work is put in by vendors of different platforms so that compilers are able to generate optimized code for their platform.

10.3 Future work

In this report, I have only evaluated the tools using two complex applications, for efficiency on parallelizing legacy software. Ideally additional programs should be used to evaluate the tools, to validate that the automatic parallelization tools are stable and mature. Migrate the functionality of Par4All to a production compiler to reduce the code sensitivity of using it. Extending the polyhedra optimizers with functionality to solve common dependencies such as reductions and privatization of scalars and arrays, will allow further identification of coarse grain parallelism with a possibility of better parallel schedules. Thus performing better than the traditional parallelizers. Analysis for side-effects in functions are also a necessity to identify parallel regions. Other parallelizing methods could be evaluated in comparison to those evaluated in this thesis. Method such as thread level speculation, or automatically generating OpenMP task model to name two examples. Embedded systems comes in many shapes and forms. Targeting several different parallel plat- forms will decrease maintenance time of the code. Therefore an investigation on how to trans- parently deploy optimized parallel code on different architectures has to be carried out. Developing a heuristic that can be used to find whether it is profitable to parallelize a loop or not is useful to makes compiler reliable in product development for embedded systems.

66 List of References

[1] The OpenMP R API specification for parallel programming. http://www.openmp.org/. [Online: accessed 20140328]. [2] ITEA2/MANY. State of Art, 2013. [3] Many-core Programming and Resource Management for High-Performance Embedded Sys- tems. http://www.eurekamany.org. [Online: accessed 20140328]. [4] Alten Sverige AB. http://www.alten.se/en/about-alten/. [Online: accessed 20140328]. [5] ConstRaint and Application driven Framework for Tailoring Embedded Real-time Systems. http://www.crafters-project.org/. [Online: accessed 20140328]. [6] Georgios Tournavitis and Bj¨ornFranke. Semi-automatic extraction and exploitation of hierarchical pipeline parallelism using profiling information. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, PACT ’10, pages 377–388, New York, NY, USA, 2010. ACM. [7] Michael S¨ußand Claudia Leopold. Common mistakes in openmp and how to avoid them. In OpenMP Shared Memory Parallel Programming, pages 312–323. Springer, 2008. [8] Message Passing Interface Forum. http://www.mpi-forum.org/. [Online: accessed 20140328]. [9] Tim Mattson. Patterns for parallel programming. http://parlang.pbworks.com/f/ programmability.pdf.

[10] Multicore Communications API Working Group (MCAPI R ). http://www. multicore-association.org/workgroup/mcapi.php. [Online: accessed 20140328]. [11] The open standard for parallel programming of heterogeneous systems. https://www. khronos.org/opencl/. [Online: accessed 20140328]. [12] CUDA Zone. https://developer.nvidia.com/cuda-zone. [Online: accessed 20140328]. [13] OpenACC, Directives for Accelerators. http://www.openacc-standard.org/. [Online: accessed 20140328]. [14] OpenHMPP directives. http://www.caps-entreprise.com/openhmpp-directives/. [Online: accessed 20140328]. [15] Chunhua Liao, Yonghong Yan, Bronis R de Supinski, Daniel J Quinlan, and Barbara Chapman. Early experiences with the openmp accelerator model. In OpenMP in the Era of Low Power Devices and Accelerators, pages 84–98. Springer, 2013. [16] GCC 4.9 Release Series Changes, New Features, and Fixes. http://gcc.gnu.org/gcc-4. 9/changes.html. [Online: accessed 20140328].

67 [17] ARM Cortex-A53. http://www.arm.com/products/processors/cortex-a/ cortex-a53-processor.php. [Online: accessed 20140716]. [18] The Parallella board. http://www.parallella.org/board/. [Online: accessed 20140716]. [19] ARM Cortex A9. http://www.arm.com/products/processors/cortex-a/cortex-a9. php. [Online: accessed 20140716]. [20] Xilinx Zynq-7000 series. http://www.xilinx.com/products/silicon-devices/soc/ zynq-7000.html. [Online: accessed 20140716]. [21] Epiphany IV. http://www.adapteva.com/epiphanyiv/. [Online: accessed 20140716]. [22] Haoqiang Jin, Dennis Jespersen, Piyush Mehrotra, Rupak Biswas, Lei Huang, and Barbara Chapman. High performance computing using {MPI} and openmp on multi-core parallel systems. Parallel Computing, 37(9):562 – 575, 2011. Emerging Programming Paradigms for Large-Scale Scientific Computing. [23] R. Rabenseifner, G. Hager, and G. Jost. Hybrid mpi/openmp parallel programming on clusters of multi-core smp nodes. In Parallel, Distributed and Network-based Processing, 2009 17th Euromicro International Conference on, pages 427–436, Feb 2009. [24] ROSE Compiler Infrastructure. http://rosecompiler.org/. [Online: accessed 20140328]. [25] The Portland Group. http://www.pgroup.com/index.htm. [Online: accessed 20140328]. [26] The Caps Compilers. http://www.caps-entreprise.com/products/caps-compilers/. [Online: accessed 20140328]. [27] Convey Computer. http://www.conveycomputer.com. [Online: accessed 20140328]. [28] AMD A-Series APU Processors. http://www.amd.com/en-us/products/processors/ desktop/a-series-apu. [Online: accessed 20140328]. [29] Albert Sa`a-Garriga,David Castells-Rufas, and Jordi Carrabina. Omp2hmpp: Hmpp source code generation from programs with pragma extensions. In High Performance Energy Efficient Embedded Systems (HIP3ES 2014), Jan 2014. [30] Romain Dolbeau, Fran¸coisBodin, and Guillaume Colin de Verdiere. One to rule them all? In Multi-/Many-core Computing Systems (MuCoCoS), 2013 IEEE 6th Interna- tional Workshop on, pages 1–6. IEEE, 2013. [31] K. Kyriakopoulos and K. Psarris. Nonlinear symbolic analysis for advanced program par- allelization. Parallel and Distributed Systems, IEEE Transactions on, 20(5):623–640, May 2009. [32] Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. A practical au- tomatic polyhedral program optimization system. In ACM SIGPLAN Conference on Pro- gramming Language Design and Implementation (PLDI), June 2008. [33] Arturo Gonz´alez-Escribanoand Diego R. Llanos. Speculative parallelization. Computer, 39(12):126–128, December 2006. ISSN 0018-9162. [34] Arnamoy Bhattacharyya and Jos´eNelson Amaral. Automatic speculative parallelization of loops using polyhedral dependence analysis. In Proceedings of the First International Workshop on Code OptimiSation for MultI and many Cores, page 1. ACM, 2013. [35] Polly - LLVM Framework for High-Level Loop and Data-Locality Optimizations. http: //polly.llvm.org/. [Online: accessed 20140328].

68 [36] Georgios Tournavitis, Zheng Wang, Bj¨ornFranke, and Michael FP O’Boyle. Towards a holistic approach to auto-parallelization: integrating profile-driven parallelism detection and machine-learning based mapping. 44(6):177–187, 2009. [37] Hansang Bae, Dheya Mustafa, Jae-Woo Lee, Aurangzeb, Hao Lin, Chirag Dave, Rudolf Eigenmann, and SamuelP. Midkiff. The cetus source-to-source compiler infrastructure: Overview and evaluation. International Journal of Parallel Programming, 41(6):753–767, 2013.

[38] PoCC: the Polyhedral Compiler Collection. http://www.cs.ucla.edu/~pouchet/ software/pocc/. [Online: accessed 20140328]. [39] PLUTO - An automatic parallelizer and locality optimizer for multicores. http:// pluto-compiler.sourceforge.net/. [Online: accessed 20140328]. [40] Mehdi Amini, Corinne Ancourt, Fabien Coelho, Fran¸coisIrigoin, Pierre Jouvelot, Ro- nan Keryell, Pierre Villalon, B´eatriceCreusillet, and Serge Guelton. Pips is not (just) polyhedral software. In International Workshop on Polyhedral Compilation Techniques (IMPACT’11), Chamonix, France, 2011. [41] Dounia Khaldi, Corinne Ancourt, and Fran¸cois Irigoin. Towards automatic c programs optimization and parallelization using the pips-pocc integration. PDF from http://www. rocq. inria. fr/˜ pouchet/software/pocc/doc/ht mldoc/htmldoc/index. html, 2011. [42] The Par4all Compiler - An automatic parallelizing and . http://www. par4all.org/. [Online: accessed 20140328]. [43] Tobias Grosser, Armin Gr¨oßlinger,and Christian Lengauer. Polly—performing polyhe- dral optimizations on a low-level intermediate representation. Parallel Processing Letters, 22(04), 2012. [44] A. Koth, T. Creech, and R. Barua. Aesop: The autoparallelizing compiler for shared memory computers. Technical report, Department of Electrical and Computer Engineering, University of Maryland, College Park, April 2013. [45] Graphite: Gimple Represented as Polyhedra. http://gcc.gnu.org/wiki/Graphite. [On- line: accessed 20140328]. [46] The Cetus Project. http://cetus.ecn.purdue.edu/. [Online: accessed 20140328]. [47] Parallware, Automatic Parallelization of Sequential Codes. http://www.appentra.com. [Online: accessed 20140328]. [48] Jos´eM. Andi´on,Manuel Arenaz, Gabriel Rodr´ıguez,and Juan Touri˜no.A novel compiler support for automatic parallelization on multicore systems. Parallel Computing, 39(9):442 – 460, 2013. Novel On-Chip Parallel Architectures and Software Support. [49] OpenHMPP (HMPP for Hybrid Multicore Parallel Programming). http://en.wikipedia. org/wiki/OpenHMPP. [Online: accessed 20140328]. [50] STEP: Syst`eme de Transformation pour l’Ex´ecution Parall`ele. http://picoforge. int-evry.fr/projects/svn/step/index.html. [Online: accessed 20140328]. [51] Daniel Millot, Alain Muller, Christian Parrot, and Fr´ed´erique Silber-Chaussumier. Step: A distributed openmp for coarse-grain parallelism tool. In Rudolf Eigenmann and BronisR. Supinski, editors, OpenMP in a New Era of Parallelism, volume 5004 of Lecture Notes in Computer Science, pages 83–99. Springer Berlin Heidelberg, 2008.

69 [52] Vector Fabrics, Improving software performance. http://www.vectorfabrics.com. [On- line: accessed 20140328]. [53] Mehdi Amini, Fabien Coelho, Fran¸coisIrigoin, and Ronan Keryell. Static compilation analysis for host-accelerator communication optimization. In Languages and Compilers for Parallel Computing, pages 237–251. Springer, 2013. [54] Open Source Computer Vision. http://opencv.org/. [Online: accessed 20140328]. [55] Viola–Jones object detection framework. http://en.wikipedia.org/wiki/Viola-Jones_ object_detection_framework. [Online: accessed 20140528]. [56] AdaBoost, short for Adaptive Boosting. http://en.wikipedia.org/wiki/AdaBoost. [On- line: accessed 20140528]. [57] PolyBench/C the Polyhedral Benchmark suite. http://www.cse.ohio-state.edu/ ~pouchet/software/polybench/. [Online: accessed 20140528]. [58] NAS Parallel Benchmark. http://www.nas.nasa.gov/publications/npb.html. [Online: accessed 20140328]. [59] Tomofumi Yuki. Polybench kernels. 2013.

70

TRITA-ICT-EX-2014:153

www.kth.se