Analysis of Automatic Parallelization Methods for Multicore Embedded Systems

DEGREE PROJECT IN INFORMATION AND COMMUNICATION TECHNOLOGY, SECOND LEVEL STOCKHOLM, SWEDEN 2015 Analysis of Automatic Parallelization Methods for Multicore Embedded Systems FREDRIK FRANTZEN KTH ROYAL INSTITUTE OF TECHNOLOGY INFORMATION AND COMMUNICATION TECHNOLOGY Analysis of Automatic Parallelization Methods for Multicore Embedded Systems Fredrik Frantzen 2015-01-06 Master’s Thesis Examiner Mats Brorsson Academic adviser Detlef Scholle KTH Royal Institute of Technology School of Information and Communication Technology (ICT) Department of Communication Systems SE-100 44 Stockholm, Sweden Acknowledgement I want to thank my examiner Mats Brorsson and my two supervisors Detlef Scholle and Cheuk Wing Leung for their helpful advice and for making this report possible. I also want to thank the other two thesis workers, Andreas Hammar and Anton Hou, that have made the time at Alten, really enjoyable. 1 Abstract There is a demand for reducing the cost of porting legacy code to different embedded platforms. One such system is the multicore system that allows higher performance with lower energy consumption and it is a popular solution in embedded systems. In this report, I have made an evaluation of a number of open source tools supporting the parallelization effort. The evaluation is made using a set of small highly parallel programs and two complex face recognition applications that show what the current advantages and disadvantages are of different parallelization methods. The results show that parallelization tools are not able to parallelize code automati- cally without substantial human involvement. Therefore it is more profitable to parallelize by hand. The outcome of the study is a number of guidelines on how to parallelize their program and a set of requirement that serves as a basis for designing an automatic parallelization tool for embedded systems. 2 Sammanfattning Det finns ett behov av att minska kostnaderna förportning av legacykod till olika inbyggda system. Ett s˚adant system ärde flerkärnigasystemen som möjliggörhögreprestanda med lägreenergiförbrukningoch ären populärlösningi inbyggda system. I denna rapport, har jag utförten utvärdering av ett antal open source-verktyg, som hjälper till med arbetet att parallelisera kod. Detta görsmed hjälpav sm˚aparalleliserbara program och tv˚akomplexa ansiktsigenkännings-applikationer som visar vad de nuvarande för-och nackdelar de olika par- allelliserings metoderna har. Resultaten visar att parallelliseringsverktygen inte klarar av att parallellisera automatiskt utan avsevärdmänskliginblandning. Detta medföratt det ärlönsam- mare att parallelisera förhand. Utfallet av denna studie ärett antal riktlinjer förhur man ska göraföratt parallelisera sin kod, samt ett antal krav som agerar som bas till att designa ett automatiskt paralleliseringsverktyg förinbyggda system. 3 Contents Acknowledgement 1 List of Tables 7 List of Figures 8 Abbreviations 10 1 Introduction 11 1.1 Background . 11 1.2 Problem statement . 12 1.3 Team goal . 12 1.4 Approach . 12 1.5 Delimitations . 12 1.6 Outline . 13 2 Parallel software 14 2.1 Programming Parallel Software . 14 2.1.1 Where to parallelize . 15 2.1.2 Using OpenMP for shared memory parallelism . 17 2.1.3 Using MPI for distributed memory parallelism . 19 2.1.4 Using vector instructions for spatially close data . 20 2.1.5 Offloading to accelerators . 20 2.2 To code for different architectures . 21 2.2.1 Use of hybrid shared and distributed memory . 21 2.2.2 Tests on accelerator offloading . 22 2.3 Conclusion . 22 3 Parallelizing methods 24 3.1 Using dependency analysis to find parallel loops . 24 3.1.1 Static dependency analysis . 25 3.1.2 Dynamic dependency analysis . 26 3.2 Profiling . 26 3.3 Transforming code to remove dependencies . 26 3.3.1 Privatization of variables to remove dependencies . 26 3.3.2 Reduction recognition . 27 3.3.3 Induction variable substitution . 27 3.3.4 Alias analysis . 28 3.4 Parallelization methods . 29 3.4.1 Traditional parallelization methods . 29 3.4.2 Polyhedral model . 29 4 3.4.3 Speculative threading . 30 3.5 Auto-tuning . 30 3.6 Conclusion . 31 4 Automatic parallelization tools 32 4.1 Parallelizers . 32 4.1.1 PoCC and Pluto . 32 4.1.2 PIPS-Par4all . 33 4.1.3 LLVM-Polly . 33 4.1.4 LLVM-Aesop . 33 4.1.5 GCC-Graphite . 34 4.1.6 Cetus . 34 4.1.7 Parallware . 34 4.1.8 CAPS . 34 4.2 Translators . 35 4.2.1 OpenMP2HMPP . 35 4.2.2 Step . 35 4.3 Assistance . 35 4.3.1 Pareon . 35 4.4 Comparison of tools and reflection . 36 4.4.1 Polyhedral optimizers and performance . 36 4.4.2 Auto-tuning incorporation and performance . 36 4.4.3 Functional differences . 37 4.5 Conclusion . 38 5 Programming guidelines for automatic parallelizers 39 5.1 How to structure loop headers and bounds . 39 5.2 Static control parts . 40 5.3 Loop bodies . 41 5.4 Array accesses and allocation . 42 5.5 Variable scope . 43 5.6 Function calls and stubbing . 43 5.7 Function pointers . 44 5.8 Alias analysis problems: Pointer arithmetic and type casts . 45 5.9 Reductions . 45 5.10 Conclusion . 46 6 Implementation 47 6.1 Implementation approach . 47 6.2 Requirements . 48 7 The applications to parallelize 50 7.1 Face recognition applications . 50 7.1.1 Training application . 50 7.1.2 Detector application . 52 7.2 PolyBench benchmark applications . 53 8 Results from evaluating the tools 54 8.1 Compilation flags . 54 8.2 PolyBench results . 54 8.3 Parallelization results on the face recognition applications . 59 8.4 Discussion . 61 5 9 Requirements fulfilled by automatic parallelizers 63 9.1 Code handling and parsing . 63 9.2 Reliability and exposing parallelism . 63 9.3 Maintenance and portability . 63 9.4 Parallelism performance and tool efficiency . 64 10 Conclusions 65 10.1 Limitations of parallelization tools . 65 10.2 Manual versus Automatic parallelization . 65 10.3 Future work . 66 References 67 6 List of Tables 4.1 Functional differences in the tools. 37 4.2 A rough overview of what the investigated tools take as input and what they can output. 38 6.1 The list of requirements for an automatic parallelization tool. 48 8.1 Compilation flags for the individual tools. 54 8.2 Refactoring time and validity of parallelized training application. 60 8.3 Refactoring time and validity of parallelized classification application. 60 7 List of Figures 2.1 Two parallel tasks are in separate critical sections and holding a resource each, when requesting to get the others resource, a deadlock is created. 15 2.2 Parallelism in a loop. 16 2.3 A false sharing situation. 16 2.4 A sequential program split up into pipeline stages. 17 2.5 Pipeline parallelism, displaying different balancing of the stages. 17 2.6 Thread creation and deletion in OpenMP. [1] . 17 2.7 A subset of OpenMP pragma directives. 18 2.8 Dynamic and static scheduling.

Analysis of Automatic Parallelization Methods for Multicore Embedded Systems

Using the GNU Compiler Collection (GCC)

Memory Tagging and How It Improves C/C++ Memory Safety Kostya Serebryany, Evgenii Stepanov, Aleksey Shlyapnikov, Vlad Tsyrklevich, Dmitry Vyukov Google February 2018

Statically Detecting Likely Buffer Overflow Vulnerabilities

Pattern Matching

Pathscale ENZO GTC12 S0631 – Programming Heterogeneous Many-Cores Using Directives C

Lambda Calculus and Functional Programming

Aliasing Restrictions of C11 Formalized in Coq

Locality-Aware Automatic Parallelization for GPGPU with Openhmpp Directives

Teach Yourself Perl 5 in 21 Days

User-Directed Loop-Transformations in Clang

Purity in Erlang

How to Write Code That Will Survive