Hardware Architecture Impact on Manycore Programming Model

UPTEC IT 21001 Degree Project in Computer and Information Engineering March 2, 2021 Hardware Architecture Impact on Manycore Programming Model Erik Stubbfalt¨ Civilingenjorsprogrammet¨ i informationsteknologi Master Programme in Computer and Information Engineering Abstract Institutionen for¨ Hardware Architecture Impact on Manycore informationsteknologi Programming Model Besoksadress:¨ ITC, Polacksbacken Lagerhyddsv¨ agen¨ 2 Erik Stubbfalt¨ Postadress: Box 337 751 05 Uppsala This work investigates how certain processor architectures can affect the implementation and performance of a parallel programming model. Hemsida: The Ericsson Many-Core Architecture (EMCA) is compared and con- http:/www.it.uu.se trasted to general-purpose multicore processors, highlighting differences in their memory systems and processor cores. A proof-of-concept implementation of the Concurrency Building Blocks (CBB) programming model is developed for x86-64 using MPI. Benchmark tests show how CBB on EMCA handles compute-intensive and memory-intensive scenarios, compared to a high-end x86-64 machine running the proof- of-concept implementation. EMCA shows its strengths in heavy com- putations while x86-64 performs at its best with high degrees of data reuse. Both systems are able to utilize locality in their memory systems to achieve great performance benefits. Extern handledare: Lars Gelin & Anders Dahlberg, Ericsson Amnesgranskare:¨ Stefanos Kaxiras Examinator: Lars-Ake˚ Norden´ ISSN 1401-5749, UPTEC IT 21001 Tryckt av: Angstr˚ omlaboratoriet,¨ Uppsala universitet Sammanfattning Det har¨ projektet undersoker¨ hur olika processorarkitekturer kan paverka˚ implementa- tioner och prestanda hos en parallell programmeringsmodell. Ericsson Many-Core Ar- chitecture (EMCA) analyseras och jamf¨ ors¨ med kommersiella multicore-processorer. Skillnader i respektive minnessystem och processorkarnor¨ tas upp. En prototyp av en Concurrency Building Blocks-implementation (CBB) for¨ x86-64 tas fram med hjalp¨ av MPI. Benchmark-tester visar hur CBB tillsammans med EMCA hanterar beraknings-¨ intensiva samt minnesintensiva scenarion, i jamf¨ orelse¨ med ett modernt x86-64-system tillsammans med den utvecklade prototypen. EMCA visar sina styrkor i tunga berak-¨ ningar och x86-64 presterar bast¨ nar¨ data ateranv˚ ands¨ i hog¨ grad. Bada˚ systemen anvan-¨ der lokalitet i respektive minnessystem pa˚ ett satt¨ som har stora fordelar¨ for¨ prestandan. iv Contents 1 Introduction 1 2 Background 2 2.1 Multicore and manycore processors . .2 2.2 Parallel computing . .2 2.2.1 Different types of parallelism . .3 2.2.2 Parallel programming models . .3 2.3 Memory systems . .4 2.3.1 Cache and scratchpad memory . .5 2.4 Memory models . .6 2.5 SIMD . .7 2.6 Prefetching . .7 2.7 Performance analysis tools . .7 2.8 The actor model . .8 2.9 Concurrency Building Blocks . .8 2.10 The baseband domain . .9 3 Purpose, aims, and motivation 10 3.1 Delimitations . 10 4 Methodology 11 4.1 Literature study . 11 4.2 Development . 11 4.3 Testing . 11 v 5 Literature study 12 5.1 Comparison of architectures . 12 5.1.1 Memory system . 12 5.1.2 Processor cores . 13 5.1.3 SIMD operations . 14 5.1.4 Memory models . 14 5.2 Related academic work . 15 5.2.1 The Art Of Processor Benchmarking: A BDTI White Paper . 15 5.2.2 A DSP Acceleration Framework For Software-Defined Radios On x86-64 . 17 5.2.3 Friendly Fire: Understanding the Effects of Multiprocessor Pre- fetches . 17 5.2.4 Analysis of Scratchpad and Data-Cache Performance Using Sta- tistical Methods . 18 6 Selection of software framework 19 6.1 MPI . 19 6.1.1 Why MPI? . 19 6.1.2 MPICH . 20 6.1.3 Open MPI . 20 7 Selection of target platform 21 8 Evaluation methods 21 8.1 Strong scaling and weak scaling . 21 8.1.1 Compute-intensive benchmark . 22 8.1.2 Memory-intensive benchmark without reuse . 23 vi 8.1.3 Memory-intensive benchmark with reuse . 23 8.1.4 Benchmark tests in summary . 23 8.2 Collection of performance metrics . 24 8.3 Systems used for testing . 25 9 Implementation of CBB actors using MPI 26 9.1 Sending messages . 27 9.2 Receiving messages . 27 10 Creating and running benchmark tests 28 10.1 MPI for x86-64 . 28 10.2 CBB for EMCA . 29 11 Results and discussion 29 11.1 Compute-intensive benchmark . 30 11.1.1 Was the test not compute-intensive enough for EMCA? . 33 11.2 Memory-intensive benchmark with no data reuse . 35 11.3 Memory-intensive benchmark with data reuse . 39 11.4 Discussion on software complexity and optimizations . 43 12 Conclusions 44 13 Future work 45 13.1 Implement a CBB transform with MPI for x86-64 . 45 13.2 Expand benchmark tests to cover more scenarios . 45 13.3 Run benchmarks with hardware prefetching turned off . 45 13.4 Combine MPI processes with OpenMP threads . 46 vii 13.5 Run the same code in an ARMv8 system . 46 viii List of Figures 1 Memory hierarchy of a typical computer system [7]. .4 2 Memory hierarchy and address space for a cache configuration (left) and a scratchpad configuration (right) [2, Figure 1]. .5 3 Main artefacts of the CBB programming model. .9 4 Conceptual view of a multicore system implementing TSO [7, Fig- ure 4.4 (b)]. Store instructions are issued to a FIFO store buffer before entering the memory system. 15 5 Categorization of DSP benchmarks from simple (bottom) to com- plex (top) [5, Figure 1]. The grey area shows examples of benchmarks that BDTI provides. 16 6 Processor topology of the x86-64 system used for testing. 25 7 The CBB application used for implementation. 26 8 Normalized execution times for the compute-intensive benchmark test with weak scaling. 30 9 Normalized execution times for the compute-intensive benchmark test with strong scaling. 31 10 Speedup for the compute-intensive benchmark test with strong scaling. 32 11 Speedup for the compute-intensive benchmark test with strong scaling and 64-bit floating-point addition. Only EMCA was tested. 34 12 Normalized execution times for the memory-intensive benchmark with no data reuse and weak scaling. 35 13 Normalized execution times for the memory-intensive benchmark with no data reuse and strong scaling. 36 14 Speedup for the memory-intensive benchmark with no data reuse and strong scaling. 37 15 Normalized execution times for the memory-intensive benchmark with data reuse and weak scaling. 39 ix 16 Cache miss ratio in L1D for the memory-intensive benchmark with data reuse and weak scaling. 40 17 Normalized execution times for the memory-intensive benchmark with data reuse and strong scaling. 41 18 Speedup for the memory-intensive benchmark with data reuse and strong scaling. 42 19 Cache miss ratio for the memory-intensive benchmark with data reuse and strong scaling. 43 List of Tables 1 Flag synchronization program to motivate why memory models are needed [20, Table 3.1]. .6 2 One possible execution of the program in Table 1 [20, Table 3.2]. .6 x 1 Introduction 1 Introduction This work is centered around the connections between two areas within computer sci- ence, namely hardware architecture and parallel programming. How can a programming model, developed specifically for a certain processor type, be expanded and adapted to run on a completely different hardware architecture? This question, which is a general problem found in many areas of industry and research, is what this thesis revolves around. The project is conducted in collaboration with the Baseband Infrastructure (BBI) depart- ment at Ericsson. They develop low-level software platforms and tools used in baseband software within the Ericsson Radio System product portfolio. This includes the Con- currency Building Blocks (CBB) programming model, which is designed to take full advantage of the Ericsson Many-Core Architecture (EMCA) hardware. EMCA has a number of characteristics that sets it apart from commercial off-the-shelf (COTS) designs like x86-64 and ARMv8. EMCA uses scratchpad memories and sim- plistic DSP cores instead of the coherent cache systems and out-of-order cores with si- multaneous multithreading found in general-purpose hardware. These differences, and more, are investigated in a literature study with a special focus on how they might affect run-time performance. MPI is used as a tool for developing a working CBB prototype that can run on both x86- 64 and ARMv8. This choice is motivated by the many similarities between concepts used in CBB and concepts seen in MPI. Finally, a series of benchmark tests are run with CBB on EMCA and on the CBB-prototype on a high-end x86-64 machine. These tests aim to investigate some compute-intensive and memory-intensive scenarios, which are both relevant for actual baseband software. Each test is run with a fixed problem size which is divived equally among the available workers, and also with a problem size that increases linearly with the number of workers. EMCA shows very good performance with the compute-intensive tests. The test (using 16-bit integer addition) is in fact deemed to not be compute-intensive enough to highlight the expected scaling behavior, and a modified benchmark (using 64-bit floating point addition) is also tested. In the memory-intensive tests, it is shown that x86-64 performs at its best when the degree of data reuse is high and it can hold data in its L1D cache. In this scenario it shows better scaling behavior than EMCA. However, x86-64 takes a much larger performance hit than EMCA when the number of processes exceed the number of available processor cores. The rest of this report is structured as follows: Section 2 describes the necessary background theory on the problem at hand. Section 3 discusses the purpose, aims and mo- 1 2 Background tivation behind the project, along with some delimitations.

Load more