UPTEC IT 21001 Degree Project in and Information Engineering March 2, 2021

Hardware Impact on Manycore Programming Model

Erik Stubbfalt¨

Civilingenjorsprogrammet¨ i informationsteknologi Master Programme in Computer and Information Engineering

Abstract

Institutionen for¨ Hardware Architecture Impact on Manycore informationsteknologi Programming Model

Besoksadress:¨ ITC, Polacksbacken Lagerhyddsv¨ agen¨ 2 Erik Stubbfalt¨

Postadress: Box 337 751 05 Uppsala This work investigates how certain can affect the implementation and performance of a parallel programming model. Hemsida: The Ericsson Many-Core Architecture (EMCA) is compared and con- http:/www.it.uu.se trasted to general-purpose multicore processors, highlighting differ- ences in their memory systems and processor cores. A proof-of-concept implementation of the Concurrency Building Blocks (CBB) program- ming model is developed for -64 using MPI. Benchmark tests show how CBB on EMCA handles compute-intensive and memory-intensive scenarios, compared to a high-end x86-64 machine running the proof- of-concept implementation. EMCA shows its strengths in heavy com- putations while x86-64 performs at its best with high degrees of data reuse. Both systems are able to utilize locality in their memory systems to achieve great performance benefits.

Extern handledare: Lars Gelin & Anders Dahlberg, Ericsson Amnesgranskare:¨ Stefanos Kaxiras Examinator: Lars-Ake˚ Norden´ ISSN 1401-5749, UPTEC IT 21001 Tryckt av: Angstr˚ omlaboratoriet,¨ Uppsala universitet

Sammanfattning

Det har¨ projektet undersoker¨ hur olika processorarkitekturer kan paverka˚ implementa- tioner och prestanda hos en parallell programmeringsmodell. Ericsson Many-Core Ar- chitecture (EMCA) analyseras och jamf¨ ors¨ med kommersiella multicore-processorer. Skillnader i respektive minnessystem och processorkarnor¨ tas upp. En prototyp av en Concurrency Building Blocks-implementation (CBB) for¨ x86-64 tas fram med hjalp¨ av MPI. Benchmark-tester visar hur CBB tillsammans med EMCA hanterar beraknings-¨ intensiva samt minnesintensiva scenarion, i jamf¨ orelse¨ med ett modernt x86-64-system tillsammans med den utvecklade prototypen. EMCA visar sina styrkor i tunga berak-¨ ningar och x86-64 presterar bast¨ nar¨ data ateranv˚ ands¨ i hog¨ grad. Bada˚ systemen anvan-¨ der lokalitet i respektive minnessystem pa˚ ett satt¨ som har stora fordelar¨ for¨ prestandan.

iv Contents

1 Introduction 1

2 Background 2 2.1 Multicore and manycore processors ...... 2 2.2 ...... 2 2.2.1 Different types of parallelism ...... 3 2.2.2 Parallel programming models ...... 3 2.3 Memory systems ...... 4 2.3.1 and ...... 5 2.4 Memory models ...... 6 2.5 SIMD ...... 7 2.6 Prefetching ...... 7 2.7 Performance analysis tools ...... 7 2.8 The ...... 8 2.9 Concurrency Building Blocks ...... 8 2.10 The baseband domain ...... 9

3 Purpose, aims, and motivation 10 3.1 Delimitations ...... 10

4 Methodology 11 4.1 Literature study ...... 11 4.2 Development ...... 11 4.3 Testing ...... 11

v 5 Literature study 12 5.1 Comparison of architectures ...... 12 5.1.1 Memory system ...... 12 5.1.2 Processor cores ...... 13 5.1.3 SIMD operations ...... 14 5.1.4 Memory models ...... 14 5.2 Related academic work ...... 15 5.2.1 The Art Of Processor Benchmarking: A BDTI White Paper . . 15 5.2.2 A DSP Acceleration Framework For -Defined Radios On x86-64 ...... 17 5.2.3 Friendly Fire: Understanding the Effects of Multiprocessor Pre- fetches ...... 17 5.2.4 Analysis of Scratchpad and Data-Cache Performance Using Sta- tistical Methods ...... 18

6 Selection of software framework 19 6.1 MPI ...... 19 6.1.1 Why MPI? ...... 19 6.1.2 MPICH ...... 20 6.1.3 Open MPI ...... 20

7 Selection of target platform 21

8 Evaluation methods 21 8.1 Strong scaling and weak scaling ...... 21 8.1.1 Compute-intensive benchmark ...... 22 8.1.2 Memory-intensive benchmark without reuse ...... 23

vi 8.1.3 Memory-intensive benchmark with reuse ...... 23 8.1.4 Benchmark tests in summary ...... 23 8.2 Collection of performance metrics ...... 24 8.3 Systems used for testing ...... 25

9 Implementation of CBB actors using MPI 26 9.1 Sending messages ...... 27 9.2 Receiving messages ...... 27

10 Creating and running benchmark tests 28 10.1 MPI for x86-64 ...... 28 10.2 CBB for EMCA ...... 29

11 Results and discussion 29 11.1 Compute-intensive benchmark ...... 30 11.1.1 Was the test not compute-intensive enough for EMCA? . . . . . 33 11.2 Memory-intensive benchmark with no data reuse ...... 35 11.3 Memory-intensive benchmark with data reuse ...... 39 11.4 Discussion on software complexity and optimizations ...... 43

12 Conclusions 44

13 Future work 45 13.1 Implement a CBB transform with MPI for x86-64 ...... 45 13.2 Expand benchmark tests to cover more scenarios ...... 45 13.3 Run benchmarks with hardware prefetching turned off ...... 45 13.4 Combine MPI processes with OpenMP threads ...... 46

vii 13.5 Run the same code in an ARMv8 system ...... 46

viii List of Figures

1 of a typical computer system [7]...... 4 2 Memory hierarchy and address space for a cache configuration (left) and a scratchpad configuration (right) [2, Figure 1]...... 5 3 Main artefacts of the CBB programming model...... 9 4 Conceptual view of a multicore system implementing TSO [7, Fig- ure 4.4 (b)]. Store instructions are issued to a FIFO store buffer before entering the memory system...... 15 5 Categorization of DSP benchmarks from simple (bottom) to com- plex (top) [5, Figure 1]. The grey area shows examples of benchmarks that BDTI provides...... 16 6 Processor topology of the x86-64 system used for testing...... 25 7 The CBB application used for implementation...... 26 8 Normalized execution times for the compute-intensive benchmark test with weak scaling...... 30 9 Normalized execution times for the compute-intensive benchmark test with strong scaling...... 31 10 for the compute-intensive benchmark test with strong scaling. . 32 11 Speedup for the compute-intensive benchmark test with strong scaling and 64-bit floating-point addition. Only EMCA was tested...... 34 12 Normalized execution times for the memory-intensive benchmark with no data reuse and weak scaling...... 35 13 Normalized execution times for the memory-intensive benchmark with no data reuse and strong scaling...... 36 14 Speedup for the memory-intensive benchmark with no data reuse and strong scaling...... 37 15 Normalized execution times for the memory-intensive benchmark with data reuse and weak scaling...... 39

ix 16 Cache miss ratio in L1D for the memory-intensive benchmark with data reuse and weak scaling...... 40 17 Normalized execution times for the memory-intensive benchmark with data reuse and strong scaling...... 41 18 Speedup for the memory-intensive benchmark with data reuse and strong scaling...... 42 19 Cache miss ratio for the memory-intensive benchmark with data reuse and strong scaling...... 43

List of Tables

1 Flag synchronization program to motivate why memory models are needed [20, Table 3.1]...... 6 2 One possible execution of the program in Table 1 [20, Table 3.2]. . . . .6

x

1 Introduction

1 Introduction

This work is centered around the connections between two areas within computer sci- ence, namely hardware architecture and parallel programming. How can a programming model, developed specifically for a certain processor type, be expanded and adapted to run on a completely different hardware architecture? This question, which is a gen- eral problem found in many areas of industry and research, is what this thesis revolves around. The project is conducted in collaboration with the Baseband Infrastructure (BBI) depart- ment at Ericsson. They develop low-level software platforms and tools used in baseband software within the Ericsson Radio System product portfolio. This includes the Con- currency Building Blocks (CBB) programming model, which is designed to take full advantage of the Ericsson Many-Core Architecture (EMCA) hardware. EMCA has a number of characteristics that sets it apart from commercial off-the-shelf (COTS) designs like x86-64 and ARMv8. EMCA uses scratchpad memories and sim- plistic DSP cores instead of the coherent cache systems and out-of-order cores with si- multaneous multithreading found in general-purpose hardware. These differences, and more, are investigated in a literature study with a special focus on how they might affect run-time performance. MPI is used as a tool for developing a working CBB prototype that can run on both x86- 64 and ARMv8. This choice is motivated by the many similarities between concepts used in CBB and concepts seen in MPI. Finally, a series of benchmark tests are run with CBB on EMCA and on the CBB-prototype on a high-end x86-64 machine. These tests aim to investigate some compute-intensive and memory-intensive scenarios, which are both relevant for actual baseband software. Each test is run with a fixed problem size which is divived equally among the available workers, and also with a problem size that increases linearly with the number of workers. EMCA shows very good perfor- mance with the compute-intensive tests. The test (using 16-bit integer addition) is in fact deemed to not be compute-intensive enough to highlight the expected scaling be- havior, and a modified benchmark (using 64-bit floating point addition) is also tested. In the memory-intensive tests, it is shown that x86-64 performs at its best when the degree of data reuse is high and it can hold data in its L1D cache. In this scenario it shows better scaling behavior than EMCA. However, x86-64 takes a much larger performance hit than EMCA when the number of processes exceed the number of available processor cores. The rest of this report is structured as follows: Section 2 describes the necessary back- ground theory on the problem at hand. Section 3 discusses the purpose, aims and mo-

1 2 Background tivation behind the project, along with some delimitations. Section 4 goes in to the methodology used. The literature study is contained in Section 5, and then Sections 6 and 7 describes the hardware and software that is used for development. The develop- ment of a CBB proof-of-concept and a series of benchmark tests is described in Sec- tions 8, 9 and 10. Finally Section 12 contains some conclusions and a summary of the contributions made, and Section 13 describes how this work could be continued in the future.

2 Background

To arrive at a concrete problem description, a bit of background theory on and parallel programming is required. The following paragraphs provide details about some of the concepts that will be central later on.

2.1 Multicore and manycore processors

Traditionally, the key to making a run faster was to increase the performance of the processor core that it was running on. The increase in single core performance over the years was made possible by Moore’s law [13, p. 17], which de- scribes how the number of transistors that fits in an of a given size has doubled approximately every two years. However, this rate of progression started to level off in the late 00s, mainly as a consequence of limits in power consumption. To get more of energy, the solution was to add more processor cores and have them collaborate on running program code. This type of construction is commonly referred to as a multicore processor. In cases where the number of cores is especially high, the term manycore processor is often used.

2.2 Parallel computing

The performance observed when running a certain algorithm rarely scales perfectly with the number of processor cores added. Instead, the possible speedup for a fixed problem size is indicated by Amdahl’s law [14], which can be formulated as 1 Speedup = f , (1 − f) + s

2 2 Background where f is the fraction of the program that can benefit from additional system resources (in this case processor cores) to get a speedup of s. Another way of looking at this is that s is the number of processor cores used. This means that there is a portion of the program that can be split up in parallel tasks. The fraction of the code that can not be parallelized, and thus not benefit from more cores, is represented by 1 − f. Finding and exploring parallelism, i.e. maximizing f, is crucial to getting the most out of modern multicore hardware in terms of overall performance. Scaling the number of processors used for a fixed problem size, like described by Amdahl’s law, is often referred to as strong scaling [17]. There is also a possibility of increasing the size of the problem along with the number of processor cores. This is called weak scaling. The possible speedup gains for this type of scenario is depicted in Gustafson’s law [12],

Speedup = (1 − f) + f · s , where f and s have the same meaning as previously. We can see that there is no the- oretical upper limit to the speedup that can be achieved with weak scaling, since the speedup increases linearly. In contrast, the non-parallelizable part 1 − f poses a hard upper-limit on the speedup for strong scaling even if s approaches infinity.

2.2.1 Different types of parallelism

There are many ways to divide parallelism into subgroups. In the context of multicore processors and especially in this project, two of the most important types are:

: The same set of operations is performed on many pieces of data, for example across items in a data vector, and the iterations are independent from one another. This means that the iterations can be split up beforehand and then be performed in parallel. This is referred to as data parallelism. • : Different parts of a program are split up into tasks, which are sets of operations that are typically different from one another. A set of tasks can operate on the same or different data. If multiple tasks are completely independent of one another they can be run in parallel, which gives task parallelism.

2.2.2 Parallel programming models

When creating a parallel program, the programmer typically uses a parallel program- ming model. This is an abstraction of available hardware that gives parallel capabilities

3 2 Background to an existing programming language, or in some cases introduces an entirely new pro- gramming language. Programming models can for example support features such as creation, and synchronization primitives.

2.3 Memory systems

Ideally a processor core would be able to access data items to operate on without any delay, regardless of which piece of data it requests. In reality, memory with short access times are very expensive to manufacture and are also not perfectly scalable. This is why modern computer systems has a hierarchy of memory devices attached to them. Figure 1 shows an example of a memory hierarchy, and the technologies typically associated with each level.

Figure 1 Memory hierarchy of a typical computer system [7].

One of the main ideas behind memory hierarchies is to take advantage of the principle of locality, which states that programs tend to reuse data and instructions they have used recently [13, p. 45]. This means that access patterns both in data and in code can often be predicted to a certain extent. Temporal locality refers to a specific item being reused multiple times in the a short time period, while spatial locality means that items with adjacent memory addresses are accessed close together in time.

4 2 Background

Figure 2 Memory hierarchy and address space for a cache configuration (left) and a scratchpad configuration (right) [2, Figure 1].

2.3.1 Cache and scratchpad memory

Cache memory typically sits close to the processor core. These are high-speed on-chip memory modules that share the same address space as the underlying main memory. Data and instructions are automatically brought into the cache, and there is typically a cache coherency mechanism to ensure that all data items get updated accordingly in all levels of the cache system. All this is typically implemented in hardware, and is therefore invisible to the programmer. Some processor designs have scratchpad memory, which have the same high-speed characteristics that a cache has [2]. The transferring of data to and from a scratch- pad memory is typically controlled in software, which makes it different from a cache where this is handled entirely in hardware. Scratchpad memory requires more effort by software developers, but are more predictable since they do not suffer from cache misses. Scratchpad memory consumes less energy per access than cache memory since it has its own address space, meaning that no address tag lookup mechanism is needed. Figure 2 shows a schematic view of differences between cache and scratchpad memory. Important to note is that there are many possible configurations of cache and scratchpad memory within a chip. They may for example be shared across multiple cores or private to a single core, and there may be multiple levels of cache or scratchpad memory where each level has different properties. It is also possible to utilize a scratchpad memory in conjunction with software constructs that make it behave similar to a cache.

5 2 Background

2.4 Memory models

The memory model is an abstract description of what memory ordering properties that a particular system has. Important to note here is that these properties are visible to software threads in a multicore system. Different memory models give different levels of freedom for compilers to do optimizations in the code.

Core 1 Core 2 Comments S1: Store data = NEW; /* Initially, data = 0 & flag 6= SET */ S2: Store flag = SET; L1: Load r1 = flag; /* L1 & B1 may repeat many times */ B1: if (r1 6= SET) goto L1; L2: Load r2 = data;

Table 1 Flag synchronization program to motivate why memory models are needed [20, Table 3.1].

To understand what a memory model is and why it is needed, look at Table 1. Here, core 2 spins in a loop while waiting for the flag variable to be SET by core 1. The question is, what value will core 2 observe when it loads the data variable in the end? Without knowing anything about the memory model of the system, this is impossible to answer. The memory model describes how instructions may be reordered at a local core.

Cycle Core 1 Core 2 Coherence Coherence state of data state of flag 1 S2: Store flag = SET Read-only Read-write for C2 for C1 2 L1: Load r1 = flag Read-only Read-only for C2 for C2 3 L2: Load r2 = data Read-only Read-only for C2 for C2 4 S1: Store data = NEW Read-write Read-only for C1 for C2 Table 2 One possible execution of the program in Table 1 [20, Table 3.2].

One possible outcome of running the program is shown in Table 2. We can see that a store-store reordering has occured at core 1, meaning that it has executed instruction S2

6 2 Background before instruction S1 (violating the program order). In this case core 2 would observe that the data variable has the “old” value of 0. With knowledge about the memory model, the programmer can for example know where to insert memory barriers to ensure correctness in the program.

2.5 SIMD

Processor designs with Single Instruction Multiple Data (SIMD) features offer a way of performing the same calculation on multiple data items in parallel within the same processor [13, p. 10]. These features can be used to obtain data parallelism of a degree beyond the core count of a system, and are often implemented using wide vector registers and operations on these registers. Since this approach needs to fetch and exe- cute fewer instructions than the number of data items, it is also potentially more power efficient than the conventional Multiple Instruction Multiple Data (MIMD) approach.

2.6 Prefetching

Prefetching is a useful way of hiding memory access latency during program execution. It involves predicting what data and instructions that will be used in the near future, and bringing them into a nearby cache. An ideal prefetching mechanism would accu- rately predict addresses, make the prefetches at the right time, and place the data in the right place (which might include choosing the right data to replace) [9]. Inaccu- rate prefetching may pollute the cache system, possibly evicting useful items. Many common prefetching schemes try to detect sequential access patterns (possibly with a constant stride), which can be fairly accurate. Another method is to, for each memory access, bring in a couple of adjacent items from memory instead of just the item referred to in the program. Prefetching can be implemented both in hardware and in software.

2.7 Performance analysis tools

Measuring how a computer system behaves while running a certain application can be done through an automated performance analysis tool. These tools can be divided into two broad categories: Static analysis tools rely on source code insertions for collecting data, while dynamic analysis tools makes binary-level alterations and procedure calls at run-time [27]. There are also hybrid tools that utilize both techniques.

7 2 Background

Most tools use some kind of statistical sampling, where the program flow of the tested application is paused at regular intervals to run a data collection routine. This can for example provide information about the time spent in each function, and how many times each function has been called. Many tools also utilize a feature present in most modern processor chips, namely hardware performance counters. These are special-purpose registers that can be programmed to react whenever a specific event occurs, for example an L1 cache miss or a branch misprediction. This can provide very accurate metrics without inducing any significant overhead.

2.8 The actor model

The actor model is an abstract model for concurrent computation centered around prim- itives called actors [1]. They are independent entities which can do computations ac- cording to a pre-defined behavior. The actor model is built around the concept of asyn- chronous message passing, in that every actor has the ability to send and receive mes- sages to and from other actors. These messages can be sent and received at any time without coordinating with the actor at the other end of the communication, hence the asynchrony. When receiving a message, an actor has the ability to:

1. Send a finite number of messages to one or many other actors. 2. Create a finite number of new actors. 3. Define what behavior to use when receiving the next message.

Actors have unique identification tags, which are used as “addresses” for all message passing. They can have internal state information which can be changed by local behav- ior, but there is no ability to directly change the state of other actors.

2.9 Concurrency Building Blocks

CBB is Ericsson’s proprietary programming model for baseband functionality devel- opment, designed as a high-level domain-specific language (DSL). A CBB application is translated into C code for specific hardware platforms through a called the CBB transform. This makes for great flexibility, since developers do not need to target one platform specifically when writing baseband software. Application behavior is defined inside CBB behavior classes (CBCs), seen in the middle of Figure 3. The CBCs are based on the actor model, described in Section 2.8. When

8 2 Background

Figure 3 Main artefacts of the CBB programming model. a CBC handles an incoming message it can initiate an activity, shown in the right of Figure 3. An activity is an arbitrarily complex directed acyclic graph (DAG) of calls to C functions, and can also contain syncronization primitives. Different message types can be mapped to different activities. The simplest form of a CBC is an actor with a single serializing first-in first-out (FIFO) message queue. It is possible to define CBCs with different queue configurations, but those will not be focused on here. At the top level, an application is structured inside a CBB structure class (CSC). This CSC can in itself contain instances of CBCs and other CSCs, forming a hierarchy of application components.

2.10 The baseband domain

In a cellular network, the term baseband is used to describe the functionality in between the radio unit and the core network. This is where functionalities from Layer 1 (the physical layer) and Layer 2 (the data link layer) of the OSI model [30] are found. The baseband domain also contains Radio Resource Management (RRM) and a couple of other features.

• Layer 1: Responsible for modulation and demodulation of data streams for down- link and up-link respectively. It also performs link measurements and other tasks. This layer is responsible for approximately 75% of the compute cycles within the

9 3 Purpose, aims, and motivation

baseband domain, since it does a lot of computationally intensive signal process- ing.

• Layer 2: Handles per-user packet queues and multiplexing of control data and user data, among other tasks. This layer has a lot of memory intensive work, and produces around 5% of the compute cycles.

• RRM: The main task of RRM is to schedule data streams on available radio chan- nel resources, which means solving bin-packing problems with a large number of candidates. This produces 15% of the compute cycles within the baseband domain.

3 Purpose, aims, and motivation

The broader purpose of this work is to investigate how hardware can affect software, and more specifically how certain hardware architectures affect the implementation and performance of a parallel programming model. CBB will be at the center of this investi- gation, and the result will be a proof-of-concept showing how it can be implemented on COTS hardware such as a x86-64 or ARMv8 chip. Differences between the selected ar- chitecture and EMCA will be analyzed, including how these differences manifest them- selves in performance. Adapting a programming model to run on new architectures is a general problem that exists in many parts of industry and research. If done successfully it can create entirely new use cases and products. There is also potential to learn how to utilize hardware features that has not previously been considered.

3.1 Delimitations

This work focuses on important aspects for adapting a programming model to new hard- ware, and not on the actual implementation. Therefore this project does not include a new, fully functional CBB implementation. It instead results in a prototype with suf- ficient functionality for running performance tests. Section 13 of the report, which describes future work, discusses some necessary steps to make a more complete imple- mentation. All of the aspects described in the comparative part of the literature study will not be evaluated with performance tests, since this would require more time and resources than

10 4 Methodology available. Instead, the evaluation focuses on a few of the most relevant metrics. These are described in Section 8.2. The project is also not focused on comparing different programming models with each other. This is however a topic that is investigated in another ongoing project within the BBI department at Ericsson.

4 Methodology

The project is divided into three main parts, which will be described in the following sections.

4.1 Literature study

The first part of this work will consist of a literature study. The goal is to identify key features and characteristics, both of the programming model and available hardware, to analyze. Special emphasis will be put on key differences between different hardware architectures. The literature study can be found in Section 5.

4.2 Development

A proof-of-concept implementation of some parts of CBB on a new hardware architec- ture will be built. This is described in Section 9. Some of the tools used in this process are provided by the BBI department at Ericsson. This includes EMCA IDE, which is the Eclipse-based Integrated Development Environment (IDE) that is used internally at Ericsson to create CBB applications. The literature study will be used as a basis when selecting what hardware platform, and additional technologies, that will be used during the implementation phase. This selection process is outlined in Section 7 and Section 6.

4.3 Testing

A series of benchmark tests will be created using the previously created CBB proto- type. The same set of tests will also be created using CBB for EMCA. This process is described in detail in Section 10.

11 5 Literature study

A performance analysis tool will be used for gathering performance metrics from the targeted hardware platform. The most important requirement is to access hardware per- formance counters (see Section 2.7 for more details) for collecting cache performance metrics. The perf tool fits this requirement [18]. It is available in all Linux systems by default, and is therefore the performance tool of choice for this project. Execution will be measured in code, with built-in timing functions which are described in Section 8.2.

5 Literature study

The literature study is split up in three parts. Section 5.1 contains a comparison of the characteristics of three different hardware architectures. Section 2.9 describes the programming model used in this project. Section 5.2 summarizes academic work which may be valuable in later parts of the project.

5.1 Comparison of architectures

This section will compare EMCA to x86-64 and also to ARMv8, and highlight some of the key differences. Intel 64 (Intel’s x86-64 implementation) and ARMv8-A (the general-purpose profile of ARMv8) will be used as reference for most of the compar- isons.

5.1.1 Memory system

The memory system of EMCA is one of the key characteristics that sets if apart from the typical commercial architecture. Each processor core has a private scratchpad memory for data, and also a private scratchpad memory for program instructions. These memory modules will be referred to as DSP data scratchpad and DSP instruction scratchpad throughout the rest of this report. There is some hardware support for loading program instructions into the DSP instruction scratchpad automatically, making it behave similar to an instruction cache, but for the DSP data scratchpad all data handling has to be done in software. There is also an on-chip memory module, the , that all cores can use. It has significantly larger capacity than the scratchpad memories, and it is designed to behave in a predictable way (for example by offering bandwidth guarantees for every access).

12 5 Literature study

One of the main reasons for doing so much of the baseband software development in- house is the memory system of EMCA. Most software is designed to run on a cache coherent memory model, which is not present in EMCA. x86-64 designs like Intel’s Sunny Cove cores (used within the Ice Lake processor fam- ily) has a three-tier [8]. Each core has a split Level 1 (L1) cache, one for data (L1D) and one for instructions (L1I), and a unified Level 2 (L2) cache which has ∼5-10x the capacity of the combined L1. The architecture features a unified Level 3 (L3) cache which is shared among all cores. It is designed so that each core can use a certain amount of its capacity. Information about the cache coherency protocol used is not publicly available, but earlier Intel designs have been reported to use the MESIF (Modified, Exclusive, Shared, Invalid, Forward) protocol [13, p. 362], which is a snoop-based coherence protocol. Contemporary ARMv8 designs feature a cache hierarchy of two or more levels [3]. Each core has its own L1D and L1I cache combined with a larger L2 cache, just like x86-64. The cache sizes vary between implementations. It is possible to extend the cache system with an external last-level cache (LLC) that can be shared among a cluster of processor cores, but this depends on the particular implementation. Details about the cache coherency protocol used by ARM is not publicly available.

5.1.2 Processor cores

EMCA is characterized as a manycore design, and it has a higher number of cores than many x86-64 or ARMv8 chip. However most x86-64 chips and many ARMv8 chips support simultaneous multithreading (SMT), so that each processor core can issue mul- tiple instructions from different software threads simultaneously [28]. This is achieved by duplicating some of the elements of the processor pipeline, and this technique gives the programmer access to a higher number of virtual cores and threads than the actual core count. EMCA does not support SMT. The processor cores inside EMCA are characterized as Very Long Instruction Word (VLIW) processors. This means that they have wide pipelines with many functional units that can do calculations in parallel, and it is the compiler’s job to find instruction- level parallelism (ILP) and put together bundles of instructions (i.e. instruction words) that can be issued simultaneously. The instruction bundles can vary in length depending on what instructions they contain. It has an in-order pipeline, as opposed to the out-of- order pipelines found in both x86-64 and ARMv8. Since EMCA is developed for a certain set of calculations, namely digital signal pro- cessing (DSP) algorithms, its processor cores are optimized for this purpose. There is

13 5 Literature study however nothing fundamentally different about how they execute each instruction com- pared to other architectures.

5.1.3 SIMD operations

The x86-64 features a range of vector instruction sets, of which the latest generation is named Advanced Vector Extensions (AVX) and exists in a couple of different ver- sions. AVX512, introduced by Intel in 2013 [22], features 512 bit wide vector registers which can be used for vector operations. All instructions perform operations on vectors with fixed lengths. AMD processors currently support only AVX2 (with 256 bits as its maximum vector length), while Intel have support for AVX512 in most of its current processors. ARMv8 features the Scalable Vector Extension (SVE) [24]. As the name implies, the size of the vector registers used in this architecture is not fixed. It is instead an im- plementation choice, where the size can vary from 128 bits to 2048 bits (in 128-bit increments). Writing vectorized code for this architecture is done in the Vector-Length Agnostic (VLA) programming model, which consists of assembly instructions that au- tomatically adapts to whatever vector registers that are available at run-time. This means that there is no need to recompile code for different ARM chips to take advantage of vectorization, and also no need to write assembly intrinsics by hand. SVE was first an- nounced in 2017, and details about the latest version (SVE2) was released in 2019 [19]. As of today, only higher-end ARM designs feature SVE. Most designs do however sup- port the older NEON extension, utilizing fixed-size vector registers of up to 128 bits. The Instruction Set Architecture (ISA) used with EMCA has support for SIMD instruc- tions targeted at certain operations commonly used in DSP applications. One example is multiply-accumulate (MAC) which is accelerated in hardware. Similar instructions are available in AVX and SVE as well.

5.1.4 Memory models

The x86-64 architecture uses Total Store Order (TSO) as its memory model [20, p. 39]. There has been a bit of debate about this statement, but most academic sources claim that this is true and the details are not relevant enough to cover here. With TSO, each core has a FIFO store buffer that ensures that all store instructions from that core are issued in program order. The load instructions are however issued directly to the memory system, meaning that loads can bypass stores. This configuration is shown in Figure 4. ARM systems use a weakly consistent memory model [6] (also called relaxed consis-

14 5 Literature study

Figure 4 Conceptual view of a multicore system implementing TSO [7, Figure 4.4 (b)]. Store instructions are issued to a FIFO store buffer before entering the memory system. tency). This model makes no guarantees at all regarding the observable order of loads and stores. It can do all sorts of reordering: store-store, load-load, load-store and store- load. Writing parallel software for an ARM processor can therefore be more challenging than doing the same for an x86 processor, since weak consistency requires more effort to ensure program correctness (for example by inserting memory barriers/fences where order must be preserved). The upside is that more optimizations can be done both in software and in hardware, giving the weakly consistent system potential to run an in- struction stream faster than a TSO system could. Two store instructions can for example be issued in reverse program order, which is not possible under TSO. The memory model of EMCA does not guarantee a global ordering of instructions, although there are synchronization primitives for enforcing a global order when needed. Further details on its memory model are not publicly available.

5.2 Related academic work

This section summarizes earlier academic work in related areas which are useful within this project.

5.2.1 The Art Of Processor Benchmarking: A BDTI White Paper

Berkeley Design Technology, Inc. (BDTI) is one of the leading providers of benchmark- ing suites for DSP applications. This white paper [5] aims to explain the key factors that determine the relevance and quality of a benchmark test. It discusses how to distinguish

15 5 Literature study good benchmarks from bad ones, and when to trust their results.

Figure 5 Categorization of DSP benchmarks from simple (bottom) to complex (top) [5, Figure 1]. The grey area shows examples of benchmarks that BDTI provides.

They argue that a trade-off has to be done between practicality and complexity. Figure 5 shows four different categories of signal processing benchmarks. Simple ones based for example on additions and multiply-accumulate (MAC) operations may be easy to design and run, but may not provide very meaningful results for the particular testing purpose. On the other side of the spectrum there are full applications that may provide useful results, but may be unnecessarily complex to implement across many hardware architectures. Somewhere, often in between the two extremes, there is a sweetspot that provides meaningful results without being too specific. A useful benchmark must however perform the same kind of work that will be used in the real-life scenario that the processor is tested for. Another factor is optimization. The performance-critical sections of embedded sig- nal processing applications are often hand-optimized, sometimes down to assembly level. Different processors support different types of optimizations (for example dif- ferent SIMD operations), and allowing all these optimizations in a benchmark makes it more complex and hardware-specific but can also expose more of the available perfor- mance. Many benchmarks focus on achieving maximum speed, but other metrics (such as mem- ory use, energy efficiency and cost efficiency) might also be important factors when determining if a particular processor is suitable for the task. It can also be important to reason about the comparability of the results across multiple hardware architectures, instead of only looking at one processor in isolation.

16 5 Literature study

5.2.2 A DSP Acceleration Framework For Software-Defined Radios On x86-64

This article [11] is concerned with the use of COTS devices for implementing baseband functions within Software-Defined Radios (SDR). The goal is to accelerate common DSP operations with the use of SIMD instructions available in modern x86-64 proces- sors. The OpenAirInterface (OAI), which is an open-source framework for deploying cel- lular network SDRs on x86 and ARM hardware, is used as a baseline. Some of its existing functions are using 128-bit vector instructions. The authors extend OAI with an acceleration and profiling framework using Intel’s AVX512 instruction set. They implement a number of common algorithms, targeting massive multiple-input multiple output (MIMO) use cases. A speedup of up to 10x is observed for the DSP functions implemented, compared to the previous implementation within OAI. Most previous studies within the field have focused on application-specific processors and architectures. This study highlights some of the potential for using SIMD features in modern x86-64 processors for baseband applications.

5.2.3 Friendly Fire: Understanding the Effects of Multiprocessor Pre- fetches

Prefetching is an important feature of modern computer systems, and its effects are widely understood in single core systems. This article [16] investigates side-effects that different prefetching schemes can cause in multicore systems with cache coherency, and when these can become harmful. Four prefetching schemes are investigated – sequential prefetching, Content Directed Data Prefetching (CDDP), wrong path prefetching and exclusive prefetching. Mea- surements are done in a simulator implementing an out-of-order sequentially consistent system using the MOESI protocol. The result is a taxonomy of 29 different prefetch interactions and their effects in a mul- ticore system. The harmful prefetch scenarios are categorized into three groups:

• Local conflicting prefetches: A prefetch in the local core forces an eviction of a useful cache line, which is referenced in the code before the prefetched cache line is.

17 5 Literature study

• Remote harmful prefetches: A prefetch that causes a downgrade in a remote core followed by an upgrade in the same remote core, before the prefetched cache line is referenced locally. This upgrade will evict the cache line in the local core, making it useless.

• Harmful speculation: Prefetching a cache line speculatively, causing unnecessary coherence transactions in other cores. Can for example cause a remote harmful prefetch.

Performance measurements within the different prefetching schemes show that these prefetching effects can be harmful to performance. Some optimizations that can mitigate this effect are also briefly discussed.

5.2.4 Analysis of Scratchpad and Data-Cache Performance Using Statis- tical Methods

Choosing the right memory technology is important to get good performance and energy efficiency out of embedded systems. This study [15] compares how cache memory and scratchpad memory perform in different types of data-heavy application workloads. It is commonly believed that scratchpad memory is better for regular and predictable access patterns, while cache memory is preferrable when access patterns are irregular. The authors use a statistical model involving access probabilities, i.e. the probability that a certain data object is the next to be referenced in the code. They use this to calculate the optimal behavior of a scratchpad memory, and compare it to cache hit ratios. This is done both analytically and empirically. Matrix multiplication is used as an example of a workload with a regular access pattern. Applications involving trees, heaps, graphs and linked lists are seen as having irregular access patterns. This work proves that scratchpad memory can always outperform cache memory, if an optimal mapping based on access probabilities is used. Increasing the cache associa- tivity is shown not to improve the cache performance significantly.

18 6 Selection of software framework

6 Selection of software framework

6.1 MPI

Message Passing Interface (MPI) is a library standard formed by the MPI Forum, which has a large number of participants (including hardware vendors, research organizations and software developers) [4]. The first version of the standard specification emerged in the mid-90s, and MPI has since then become the de-facto standard for implementing message passing programs for high-performance computing (HPC) applications. MPI 3.1, approved in Jun 2015, is the latest revision of the standard. MPI can create pro- cesses in a computer system, but not threads. MPI is designed to primarily use the message-passing parallel programming model, where messages are passed by moving data from the address space of one process to the address space of another process. This is done through certain operations that both the sender and the receiver must participate in. One of the main goals of MPI is to offer portability, so that the software developer can use the same code in different systems with varying memory hierarchies and communication interconnects. Since MPI is a specification rather than a library, there are many different implemen- tations. Some examples are Open MPI, MPICH and Intel MPI. There are differences across implementations that might affect performance, but these differences are not fo- cused on within this project.

6.1.1 Why MPI?

Choosing MPI as a basis for implementing CBB on new hardware architecture is moti- vated by the great compatibility of its concepts compared to the concepts found in the actor model and CBB. This includes:

• Independent execution. An MPI process is an independent unit of computation with its own internal state, just like an actor. It can execute code completely inde- pendent of other processes. A process has its own address space in the computer system.

• Asynchronous message passing. An MPI process can send any number of mes- sages to other processes, using ranks (equivalent to process IDs) as addresses. The operations for sending and receiving messages can be run asynchronously, just like with actors. An MPI processes has one FIFO message queue by default.

19 6 Selection of software framework

6.1.2 MPICH

The MPI implementation used within this project is MPICH. It was initially developed along with the original MPI standard in 1992 [25]. It is a portable open-source imple- mentation, one of the most popularly used today. It supports the latest MPI standard and has good user documentation. The goals of the MPICH project, as stated on the project website, are:

• To provide an MPI implementation that efficiently supports different computa- tion and communication platforms including commodity clusters, high-speed net- works and proprietary high-end computing systems

• To enable cutting-edge research in MPI through an easy-to-extend modular frame- work for other derived implementations.

MPICH has been used as a basis for many other MPI implementations including Intel MPI, Microsoft MPI and MVAPICH2 [29]. Since MPICH is designed to be portable it can run on x86-64, ARMv8 and also in a variety of other computer systems.

6.1.3 Open MPI

The initial choice for an MPI implementation to use within this project was Open MPI. It is one of the most commonly used implementations and it has excellent user docu- mentation. Open MPI is an open-source implementation developed by a consortium of partners within academic research and the HPC industry [10]. Some of the goals of the Open MPI project are:

• To create a free, open source, peer-reviewed, production-quality complete MPI implementation.

• To directly involve the HPC community with external development and feedback (vendors, 3rd party researchers, users, etc.).

• To provide a stable platform for 3rd party research and commercial development.

Unfortunately there were problems with getting Open MPI to run properly inside the x86-64 system used for testing (which is described in Section 8.3). Instead of spending time debugging these problems, Open MPI was replaced with MPICH which worked without problems.

20 8 Evaluation methods

7 Selection of target platform

As seen in the architecture comparisons in Section 5.1, modern x86-64 and ARMv8 hardware have many similarities. They both incorporate high-performance out-of-order cores with multiple levels of cache, since they are both targeted at general-purpose com- puting where these features can be beneficial. The differences between EMCA and x86- 64 are of the same nature as the differences between EMCA and ARMv8. The decision between x86-64 and ARMv8 is therefore not as significant as the overall question:

What happens when we run CBB applications on modern general-purpose hardware?

The choice of a target platform instead comes down to availability. Ericsson’s develop- ment environment is run on x86-64 servers, and accessing additional x86-64 hardware for testing within Ericsson has been less difficult than accessing ARMv8 hardware. Do- ing all development and testing on x86-64 hardware was therefore the natural choice. As mentioned in Section 6.1.2, the CBB implementation created will work with both hardware platforms since MPICH programs can be compiled and run on both. Port- ing the CBB implementation from x86-64 to ARMv8 would simply mean moving the source code and recompiling it.

8 Evaluation methods

As described by BDTI [5] (see Section 5.2.1), there has to be a trade-off between prac- ticality and complexity when designing benchmark tests. There is often a sweetspot of tests that provides meaningful results without being too specific. This is the goal of the tests that will be used in this project, which are all described in the following sections.

8.1 Strong scaling and weak scaling

One of the fundamental goals of CBB and EMCA, as described in Section 2.9, is to enable massive parallelism. With this in mind, it is reasonable to test how the in an application affects performance within the targeted hardware platform. A simple two-actor application will be used, following this basic structure:

1. Actor A and actor B gets initiated.

21 8 Evaluation methods

2. Actor A sends a message to actor B, containing a small piece of data. 3. Actor B receives the message and performs some work (described in Sections 8.1.1, 8.1.2 and 8.1.3). 4. Actor B sends a message back to actor A, acknowledging that all calculations are completed. 5. Both actors terminate and the test is finished.

To test different degrees of parallelism, multiple instances of the two-actor application will be run. These instances are completely independent of each other, which means that the parallelism present in the software (namely the data parallelism) will scale perfectly. See Section 2.2 for more details on parallel computing. Each benchmark test will be run with four actors (two of actor A, two of actor B), and then the number of actors will be increased before running the test again. This process will be repeated until reaching 1024 actors, which is significantly more than the number of available processor cores in the computer systems used for testing (described in Section 8.3). Two variations of each benchmark test will be evaluated:

1. Weak scaling: The size of the problem will increase along with the number of program instances. 2. Strong scaling: The size of the problem will be fixed, and split up among all instances of the two-actor applications.

8.1.1 Compute-intensive benchmark

As described in Section 2.10, Layer 1 functionality is computationally intensive and is responsible for most of the compute cycles within the baseband domain. This will be simulated by letting actor B perform a large amount of computation after receiving data from actor A. The computation at actor B will consist of addition operations of the following form:

result = result + 10000;

Here, result will be an unsigned 16-bit integer. It will overflow many times during the test, so that the result will always be between 0 and 216. The addition operation will 1million be repeated 1 million times by each actor B in the weak scaling scenario, and N times by each actor B in the strong scaling scenario with N program instances.

22 8 Evaluation methods

8.1.2 Memory-intensive benchmark without reuse

The behavior of Layer 2 applications, as described in Section 2.10, does memory inten- sive work. This will be simulated by letting actor B allocate and loop through a data vector after receiving its message from actor A, touching each element once. Two vec- tor sizes will be tested: 1 MB and 128 kB. In the weak scaling scenario, each actor B will have its own data vector with this size. With strong scaling, each actor B will have initial vector size a data vector sized N with N program instances. The data vectors will be dynamically allocated (on the heap).

8.1.3 Memory-intensive benchmark with reuse

This is a variation of the test described in Section 8.1.2. The only difference is that, in this test, actor B will loop through its own data 1000 times. This means that there will be data reuse, so that the application can make use of locality and caches.

8.1.4 Benchmark tests in summary

To cover all variations of the different benchmarks, 10 individual test cases will be created and run. These are:

1. Compute-intensive test with weak scaling.

2. Compute-intensive test with strong scaling.

3. Memory-intensive test with no data reuse and weak scaling, with a 1 MB vector.

4. Memory-intensive test with no data reuse and weak scaling, with a 128 kB vector.

5. Memory-intensive test with no data reuse and strong scaling, with a 1 MB vector.

6. Memory-intensive test with no data reuse and strong scaling, with a 128 kB vector.

7. Memory-intensive test with data reuse and weak scaling, with a 1 MB vector.

8. Memory-intensive test with data reuse and weak scaling, with a 128 kB vector.

9. Memory-intensive test with data reuse and strong scaling, with a 1 MB vector.

10. Memory-intensive test with data reuse and strong scaling, with a 128 kB vector.

23 8 Evaluation methods

8.2 Collection of performance metrics

At each individual step of the two benchmark tests described above, these metrics will be recorded:

• Execution time: This indicates the overall performance, in terms of pure speed. Shorter execution times are better. When running on x86-64, this metric will be measured using the built-in MPI Wtime function in MPI. A corresponding function available in EMCA systems will be used there. The timing methodology follows this scheme:

1. All actors in the system gets initialized, and then synchronize using a barrier or similar. 2. One actor collects the current time. 3. The actors do their work. 4. All actors synchronize again. 5. One actor collects the current time again and subtracts the previously col- lected time.

With this method, all overhead associated with initializing and terminating the execution environment is excluded. The measured time instead only shows how long the actual work within the benchmark tests take.

• Cache misses in L1D: This shows us how the cache system and the hardware prefetcher performs. Lower numbers of cache misses are preferable. See Sec- tion 2.3 for more details. This metric is available in x86-64 systems but not in EMCA, which do not have caches. The perf tool will be used for this. The com- mand used for running a test and measuring cache behavior is: $ perf stat -e L1-dcache-loads,L1-dcache-load-misses This shows the number of loads and misses in the L1D cache, and also the miss ratio in %.

All steps of each benchmark test will be run three times, and then the average numbers produced from these three runs will be used as a result. This is to even out the effects of unpredictable factors that might produce noisy results.

24 8 Evaluation methods

8.3 Systems used for testing

The x86-64 system that will be used for running benchmarks is a high-end server ma- chine with AMD processors built on their [23]. Some of its specifications are summarized below.

• 2 x AMD EPYC 7662 processors.

• 128 physical cores in total (64 per chip).

• 256 virtual cores in total (128 per chip) using SMT.

• 32 kB of private L1D cache per physical core.

• 512 kB of private L2 cache per physical core.

• 4 MB of L3 cache per physical core (16 MB shared across four cores in each core complex).

• The system is running Linux Ubuntu 20.04.1 LTS.

Machine (503GB total) Package L#0 NUMANode L#0 P#0 (252GB)

L3 (16MB) L3 (16MB) L3 (16MB) 16x total L2 (512KB) L2 (512KB) L2 (512KB) L2 (512KB) L2 (512KB) L2 (512KB) L2 (512KB) L2 (512KB) L2 (512KB) L2 (512KB) L2 (512KB) L2 (512KB)

L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB)

L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB)

Core L#0 Core L#1 Core L#2 Core L#3 Core L#4 Core L#5 Core L#6 Core L#7 Core L#60 Core L#61 Core L#62 Core L#63 PU L#0 PU L#2 PU L#4 PU L#6 PU L#8 PU L#10 PU L#12 PU L#14 PU L#120 PU L#122 PU L#124 PU L#126 P#0 P#1 P#2 P#3 P#4 P#5 P#6 P#7 P#60 P#61 P#62 P#63 PU L#1 PU L#3 PU L#5 PU L#7 PU L#9 PU L#11 PU L#13 PU L#15 PU L#121 PU L#123 PU L#125 PU L#127 P#128 P#129 P#130 P#131 P#132 P#133 P#134 P#135 P#188 P#189 P#190 P#191

Package L#1 NUMANode L#1 P#1 (252GB)

L3 (16MB) L3 (16MB) L3 (16MB) 16x total L2 (512KB) L2 (512KB) L2 (512KB) L2 (512KB) L2 (512KB) L2 (512KB) L2 (512KB) L2 (512KB) L2 (512KB) L2 (512KB) L2 (512KB) L2 (512KB)

L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB)

L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB)

Core L#64 Core L#65 Core L#66 Core L#67 Core L#68 Core L#69 Core L#70 Core L#71 Core L#124 Core L#125 Core L#126 Core L#127 PU L#128 PU L#130 PU L#132 PU L#134 PU L#136 PU L#138 PU L#140 PU L#142 PU L#248 PU L#250 PU L#252 PU L#254 P#64 P#65 P#66 P#67 P#68 P#69 P#70 P#71 P#124 P#125 P#126 P#127 PU L#129 PU L#131 PU L#133 PU L#135 PU L#137 PU L#139 PU L#141 PU L#143 PU L#249 PU L#251 PU L#253 PU L#255 P#192 P#193 P#194 P#195 P#196 P#197 P#198 P#199 P#252 P#253 P#254 P#255

Figure 6 Processor topology of the x86-64 system used for testing.

Figure 6 shows the topology of the processor cores and cache hierarchy in the x86-64 system. This graphical representation was obtained with the hwloc command line tool in Linux.

25 9 Implementation of CBB actors using MPI

The benchmarks will also be run on a recent iteration of Ericssons manycore hardware. As discussed in 5.1 it has a number of specialized DSP cores with private scratchpad memories for instructions and data, an also an on-chip shared memory that all cores can use. Detailed specifications of this processor are confidential and can not be described in detail.

9 Implementation of CBB actors using MPI

sc

dataActor printerActor dataP dataP :dataActor ~out in :printerActor

Figure 7 The CBB application used for implementation.

A simple CBB application was created using EMCA IDE, consisting of just two CBCs. A graphical representation of the application is seen in Figure 7. Here, the outer box labeled sc represents the top-level CSC. There is one CBC instance named dataActor which is of the CBC type with the same name, and one instance printerActor of the type with the same name. The names of the CBCs originate from some of Ericsson’s user tutorial material, and are not representative of their behavior. The out port of dataActor is connected to the in port of printerActor, meaning that they are aware of each other’s existence and can send messages to each other. The coloring of the ports in Figure 7 and the “∼” label on one of them symbolizes the direction of the communication; certain message types can be sent from dataActor to printerActor, and other message types can be sent in the other direction. The ports named dataP, which connects dataActor to the edge of the CSC, will not be used. The application was run through the CBB transform targeting EMCA, which generated a number of C code and header files. These files were then used as a basis for creating new C functions targeting the x86-64 platform, with the help of MPI calls. Details about all the MPI routines mentioned in the following sections can be found on the official MPICH documentation webpage [26].

26 9 Implementation of CBB actors using MPI

9.1 Sending messages

CBB generates a “send” function for each port of each CBC. The contents of these functions was rewritten to do the following:

1. MPI Isend is used to post a non-blocking send request. This call returns im- mediately without ensuring that the message has been delivered to its destination. The function gets arguments describing the message contents and what process to deliver it to. 2. MPI Wait will then block code execution until the MPI Isend has delivered its message away from its own send buffer, so that it can be reused.

The reason for using these two MPI calls instead of MPI Send, which is a single block- ing call that performs the same task, is to enable an overlap between communication and computation. This could be accomplished by doing some calculations in between the two MPI calls.

9.2 Receiving messages

There is a generated “receive” function for each port of each CBC. These were rewritten to have the following behavior:

1. MPI Probe is a blocking call that checks for incoming messages. When it de- tects a message, it will write some information to a status variable and return. The status information includes the tag of the message and also the ID of the source process. 2. MPI Get count is then used to determine how many bytes of data that the mes- sage contains. 3. MPI Recv is called last. This function is a blocking call, but will not cause a stall in execution since the previous calls have assured that there is actually an incoming message.

Using these three MPI calls instead of only MPI Recv allows for receiving messages without knowing all specifics; the MPI Recv functions requires arguments describing the tag, source and size of the incoming message, which we find out using MPI Probe and MPI Get count. This enables the definition of message types with varying data contents.

27 10 Creating and running benchmark tests

10 Creating and running benchmark tests

10.1 MPI for x86-64

With MPI, there is a need to initialize the execution environment and create the MPI pro- cesses corresponding each CBC. This is done in an additional code file, test cases.c, which is used to run the actual tests. Each MPI process runs its own copy of the code, which follows this basic structure:

1. Initiate the MPI execution environment with MPI Init and determine the pro- cess ID by calling MPI Comm rank.

2. Use the process ID to determine which actor type and instance that its ID corre- sponds to, according to a lookup table or similar. The actor now knows if it is a dataActor or printerActor in this case.

3. Run code corresponding to the current test case. This part will differ depending on what kind of test that is being run, see Section 8.1. The time measurement, as discussed in Section 8.2, is also a part of this step.

4. Terminate the MPI process and exit the execution environment with MPI Finalize.

All necessary code files and headers are compiled using mpicc, which is a compiler command for MPICH that uses the default C compiler of the system (gcc in this case) along with the addidional linkage needed by MPICH. make is used to produce a single binary for the complete application. To run a test case, a variation of the following command is used:

$ mpiexec -n X -bind-to hwthread bin/test

Here X is the total number of actors (MPI processes) that will be present in the system. Since the actors operate in pairs (with one doing work and the other one just sending messages), X must be an even number. The -bind-to hwthread makes sure that every MPI process gets associated with one hardware thread (virtual core), which re- duces the process management overhead. This is beneficial for performance and also makes the behavior more predictable. bin/test points to the binary generated by mpicc.

28 11 Results and discussion

10.2 CBB for EMCA

Creating the benchmark tests with CBB is a simpler process. The two-actor application seen in Figure 7 had already been generated using EMCA IDE and the CBB transform (as described in Section 9). The code for doing the actual work (as described in 8.1.1, 8.1.2 and 11.3) is then added inside code files generated by the CBB transform. Further details about the structure of the code structure inside CBB will not be described here. Since the memory allocation in the DSP data scratchpad and the shared memory has to be done manually on EMCA, and they do not share address spaces, this structure was used for allocating memory in the memory-intensive benchmark tests on EMCA:

if (vector_size < threshold_size) // allocate space in DSP data scratchpad else // allocate space in shared memory

Here, the threshold size is used as a cross-over point between the two memory units. It is smaller than the actual size of the DSP data scratchpad memory, which leaves space that could be used for system-internal data.

11 Results and discussion

This section contains results collected when running the benchmark tests on EMCA and x86-64. To make the results comparable across architectures, all execution times have been normalized. This means that each individual execution time is divided with the first execution time in that series, so that every data series (every line in a graph) starts with 1.0. Hence any value above 1.0 means worse (slower) performance, and any value below 1.0 means better (faster) performance. This makes it possible to focus on the scaling properties in each benchmark test, instead of execution times in absolute terms. Actual execution times are discussed briefly but not shown in figures. In the tests that involve strong scaling it is also relevant to look at the speedup, which is the inverse of normalized execution time. This means that we divide the initial execution time with the current execution time. That will show us how many times faster the execution is, compared to the first run That will show us how many times faster the execution is, compared to the first run (represented by 1.0 here as well). Thus, a lower number means worse performance and a higher number means better performance.

29 11 Results and discussion

In Section 11.3, covering the memory-intensive benchmark with data reuse, cache miss ratios in x86-64 will also be presented. This is the only test that takes advantage of caches, which is why this metric is relevant here but not in the other tests. Finally there will be a discussion about complexity and optimizations in the software, and how these factors affect the performance results. This discussion is found in Sec- tion 11.4.

11.1 Compute-intensive benchmark

x86 EMCA 90

80

70

60

50

40

30

20

10 EXECUTION TIME (NORMALIZED PER DATA SERIES) EXECUTION (NORMALIZED TIME PER DATA 0 0 64 128 192 256 320 384 448 512 576 640 704 768 832 896 960 1024 TOTAL NUMBER OF ACTORS/PROCESSES

Figure 8 Normalized execution times for the compute-intensive benchmark test with weak scaling.

Figure 8 shows how both systems perform in the compute-intensive test with weak scaling. We can see that the execution time in the EMCA system scales linearly with the number of actors created. This is true even when the number of actors get significantly larger than the number of processor cores available. The x86-64 system behaves very differently. It performs well compared to EMCA up until hitting 256 actors, which is also the number of virtual processor cores available

30 11 Results and discussion in the system. After that we can see a big performance penalty. The execution time increases fairly linearly until 768 actors. At this point, the execution time is over 80 times longer than the baseline. The normalized execution times for x86-64 finally tapers off slightly, ending at around 75 for 1024 actors. This is an interesting phenomenon that could possibly be explained by the distribution of actors on to processor cores. Half of the actors spend most of their time performing computations and the other half spends their time waiting for a message response, so there could be a congestion of many actors doing computations in the same virtual core. In this case these have to wait for each other, causing a per- formance penalty. If the actors doing computations are more evenly distributed across the cores, they interfere less with each other, which could make for better performance even though the number of actors increase. This is a theory that could be confirmed with further experiments in the future. The phenomenon seen here is seen in all the x86-64 benchmark tests, but will not be discussed again in later sections.

x86 EMCA 180

160

140

120

100

80

60

40

20 EXECUTION TIME (NORMALIZED PER DATA SERIES) EXECUTION (NORMALIZED TIME PER DATA

0 0 64 128 192 256 320 384 448 512 576 640 704 768 832 896 960 1024 TOTAL NUMBER OF ACTORS/PROCESSES

Figure 9 Normalized execution times for the compute-intensive benchmark test with strong scaling.

In Figure 9 we can see the normalized execution times when testing the strong scaling property. The behavior of EMCA is indistinguishable from the previous test, with a linearly increasing normalized execution time ending at just below 60. Since the num-

31 11 Results and discussion bers are so similar in the compute-intensive benchmark both when the total amount of computation increases (weak scaling) and when it is fixed (strong scaling), one could assume that this kind of computation is “cheap” even when done many times. In that case the overhead cost of managing many actors in the system dominates, and it it this cost that we can see in Figure 8 and Figure 9. This theory is investigated further in Section 11.1.1. The x86-64 system seems to have a similar behavior as before when the number of actors pass the number of virtual processor cores. The big difference is that the normalized execution time ends with around 160 in this test, which is close to double what we saw in the weak scaling test. This reveals a possible drawback of the methodology used. Since each series starts with two actors doing computations (and two more which sends and waits for messages), the weak scaling actors do 1 million additions each. When looking at absolute execution time (not seen in the figures), this test takes 2.35 ms. The 1 million strong scaling actors start at 2 = 0.5 million additions, taking 1.11 ms. With 1024 actors the weak scaling variant takes 153 ms while the strong scaling variant takes 174 ms, which are much more similar numbers. If both tests had started with only one actor doing computations (two actors in total), both scaling variants would have started at around 2 ms and we would have seen much more similar numbers for normalized execution time when the number of actors get large.

x86 EMCA 7

6

5

4

SPEEDUP 3

2

1

0 0 16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256 TOTAL NUMBER OF ACTORS/PROCESSES

Figure 10 Speedup for the compute-intensive benchmark test with strong scaling.

32 11 Results and discussion

Finally, Figure 10 shows the speedup of the execution in the strong scaling variant. We can see that EMCA does not speed up at all. This matches the previous discussion well; computations are “cheap” and the overhead for creating actors dominate, so we get no benefit by splitting up the amount of computation among the processor cores. The x86-64 data shows a different behavior. The amount of computation is actually visible in its performance, since there is an evident benefit in splitting it up among the cores. We get the best performance with 64 actors, producing a speedup of around 6. After that the overhead from creating and managing processes in the system takes over, and the speedup is back below 1.0 when we hit 256 actors. After that it only decreases further, which is not meaningful to include in the figure.

11.1.1 Was the test not compute-intensive enough for EMCA?

As discussed previously, the compute-intensive benchmark on EMCA gave us identical execution time with weak scaling and strong scaling. This was not expected, and a theory is that the test was not compute-intensive enough for the EMCA processor. To explore this theory, the test was modified to use 64-bit floating-point addition, which is a more computationally intensive operation than the previously used 16-bit integer addition. This test was run with 4, 16 and 128 actors.

33 11 Results and discussion

EMCA 4

3,5

3

2,5

2 SPEEDUP 1,5

1

0,5

0 0 16 32 48 64 80 96 112 128 TOTAL NUMBER OF ACTORS/PROCESSES

Figure 11 Speedup for the compute-intensive benchmark test with strong scaling and 64-bit floating-point addition. Only EMCA was tested.

The results from the modified compute-intensive benchmark test with floating-point ad- dition on EMCA is shown in Figure 11. We can see that there was a performance benefit in doing strong scaling here, i.e. splitting up the floating-point calculations among ac- tors. This shows that floating-point operations are indeed more computationally heavy on EMCA than what we tested previously, and it strengthens the theory that the previous test was not compute-intensive enough.

34 11 Results and discussion

11.2 Memory-intensive benchmark with no data reuse

x86 1MB Vector x86 128kB Vector EMCA 1MB Vector EMCA 128kB Vector 700

600

500

400

300

200

100 EXECUTION TIME (NORMALIZED PER DATA SERIES) EXECUTION (NORMALIZED TIME PER DATA 0 0 64 128 192 256 320 384 448 512 576 640 704 768 832 896 960 1024 TOTAL NUMBER OF ACTORS/PROCESSES

Figure 12 Normalized execution times for the memory-intensive benchmark with no data reuse and weak scaling.

Looking at Figure 12, we can see how the two systems perform when doing the memory- intensive benchmark with no data reuse and weak scaling. The EMCA system performs in a similar way with both the larger data vector and the smaller data vector, ending in normalized execution times of 73 and 100 respectively. When looking at actual execu- tion times, we can see that the initial test run with the larger data vector has a 15-20 times longer execution time than the initial test run with the smaller vector. What we are seeing here are normalized execution times for allocating large vectors in the shared memory. It seems as though allocating and looping through large vectors takes longer time to begin with, but scales better, than allocating and looping through small vectors. An interesting point here is that there are cases when we are trying to allocate more memory in the shared memory than what is available. This did not seem to cause any trouble when running the test; all actors could allocate and write to their memory vectors at all times. A theory here is that the function used for dynamic memory allocation in EMCA does not return until there is available space in the shared memory. This should cause a congestion of actors waiting for memory allocation. This is however a

35 11 Results and discussion theory that has not been confirmed, and it is not evident when looking at the normalized execution times. The x86-64 system shows very different scaling properties for the two vector sizes, with similar behavior as EMCA for the larger vector and significantly worse scaling for the smaller vector. This can be explained by looking at the absolute execution times. With the larger vector, the initial execution time (with 4 actors in total) is 2.07 ms and the final execution time (with 1024 actors in total) is 179 ms. The corresponding times for the smaller vector are 0.3 ms and 172 ms respectively. Since the initial execution time is almost 600 times longer with the larger vector size than the smaller vector size, and they end up in very similar execution times in the end, the difference in normalized execution times is also 500-600. This is a drawback of showing normalized execution times instead of actual execution times. Showing actual execution times would however make it harder to compare architectures. In conclusion, allocating and writing data to a small vector is much faster than doing the same with a large vector, but when the number of processes and vectors get very large the overhead from process management dominates and the execution times start to look more similar.

x86 1MB Vector x86 128kB Vector EMCA 1MB Vector EMCA 128kB Vector 1600

1400

1200

1000

800

600

400

200 EXECUTION TIME (NORMALIZED PER DATA SERIES) EXECUTION (NORMALIZED TIME PER DATA 0 0 64 128 192 256 320 384 448 512 576 640 704 768 832 896 960 1024 TOTAL NUMBER OF ACTORS/PROCESSES

Figure 13 Normalized execution times for the memory-intensive benchmark with no data reuse and strong scaling.

When testing strong scaling, it is hard to evaluate EMCA by only looking at Figure 13

36 11 Results and discussion since the x86-64 numbers dominate. If we would have a figure showing only the EMCA numbers, we would see that the normalized execution times with a 1 MB vector ends at 0.79 with 1024 actors, which means that this execution is even faster than when running 4 actors in total. This is a symptom of the difference in response time when dealing with the DSP data scratchpad compared to the shared memory. With the smaller vector the normalized execution time with 1024 actors is 10.1, which is because we have to deal less with shared memory to begin with. More on DSP data scratchpad versus shared memory when discussing the speedup numbers below. When looking at the x86-64 numbers, we can see a similar behavior as in the weak scaling test when the number of actors grow large. The scaling between the two is however even larger here, with the normalized execution times hitting a maximum of almost 1400. Like before, this is a result of the difference in initial execution times; 1.02 ms with the larger vector and 0.131 ms with the smaller vector. With 1024 actors, the execution times are 192 ms and 160 ms respectively. Again, they start with very different execution times because of the difference in vector sizes, but end up with very similar execution times when the process management overhead dominates.

x86 1MB Vector x86 128kB Vector EMCA 1MB Vector EMCA 128kB Vector 18

16

14

12

10

8 SPEEDUP

6

4

2

0 0 16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256 TOTAL NUMBER OF ACTORS/PROCESSES

Figure 14 Speedup for the memory-intensive benchmark with no data reuse and strong scaling.

As a continuation of the previous discussion we can now look at Figure 14, showing how

37 11 Results and discussion the two systems benefit from splitting up the work across many actors. With EMCA, we can see something interesting when testing the large vector. The speedup increases dramatically when the number of actors increase, peaking at a 16x performance gain for 64 actors. This is because the 1 MB vector is split up among the actors so that each actor can put its chunk inside the DSP data scratchpad at some point, instead of the shared memory. As described in Section 10.2, all this has to be done in code instead of hardware. Since the benchmark is written in such a way that anything smaller than a threshold value is put in the DSP data scratchpad and anything larger is put in the shared memory, there is a point (somewhere between 32 and 64 actors in total) where we cross this threshold. The biggest jump in speedup is seen when going from 32 to 64 actors in total, so the results seem correct. After that point, the overhead from using many actors start to show and the speedup tapers off. With the smaller vector size tested on EMCA, the same effect is seen but in a much smaller scale. In the initial execution the data vectors are put in the shared memory, but very soon the vectors are instead put in the DSP data scratchpad. The maximum speedup observed is 4.6 with 8 actors in total. Then the overhead takes over and the speedup starts to decrease again. With the x86-64 system, we see a similar speedup phenomenon as in EMCA for both the larger vector and the smaller vector. Here, the memory management is done with malloc and free and the memory system is abstracted away from the programmer. When examining the numbers closer we can see that the test with the larger vector peaks at 32 actors in total (16 using memory), and 16 actors in total (8 using memory) for the smaller vector. This translates to allocation of vectors of 64 kB and 4 kB respectively. It is not entirely apparent why the performance peaks at these particular points. The im- plementation of malloc and free can depend on both the compiler and the operating system, which is a factor that has not been investigated in this project. Also, there is no obvious caching benefit since we have no data reuse in this case.

38 11 Results and discussion

11.3 Memory-intensive benchmark with data reuse

x86 1MB Vector x86 128kB Vector EMCA 1MB Vector EMCA 128kB Vector 90

80

70

60

50

40

30

20

10 EXECUTION TIME (NORMALIZED PER DATA SERIES) EXECUTION (NORMALIZED TIME PER DATA 0 0 64 128 192 256 320 384 448 512 576 640 704 768 832 896 960 1024 TOTAL NUMBER OF ACTORS/PROCESSES

Figure 15 Normalized execution times for the memory-intensive benchmark with data reuse and weak scaling.

Figure 15 shows how the test systems behave when having a high degree of data reuse, with weak scaling. Both the larger vector and the smaller vector are larger than the threshold value used for the DSP data scratchpad of the EMCA system, meaning that all data is stored in shared memory in this entire test. We see a linear behavior with normalized execution times ending at just above 80 for both vector sizes. The conclusion that we can draw from this is that the allocation of shared memory memory behaves in a very predictable way. The caches in the x86-64 system show their strengths in this benchmark test, with great scaling properties. The normalized execution times stay at just above 1 until we hit 256 actors, which is the same as the number of virtual cores in the system. More importantly, this means that the number of actors doing memory-intensive work is 128, which is the same as the number of physical cores. After that the normalized execution times increase, ending at around 5 for both vector sizes. The 1 MB vector is too large to fit in the L1D (32 kB) and L2 (512 kB), but it fits in the shared L3 cache (16 MB shared across four physical cores). The 128 kB data vector is also too large for the L1D cache but it

39 11 Results and discussion

fits inside the private L2 cache. The actual execution times show us that the test with the larger vector takes about 10x longer than with the smaller vector. This is assumed to be the difference in performance when comparing the L2 and the L3 caches in this particular system.

1MB Vector 128kB Vector 2,00% 1,80% 1,60% 1,40% 1,20% 1,00% 0,80% 0,60%

CACHE CACHE MISSRATIO 0,40% 0,20% 0,00% 0 64 128 192 256 320 384 448 512 576 640 704 768 832 896 960 1024 TOTAL NUMBER OF ACTORS/PROCESSES

Figure 16 Cache miss ratio in L1D for the memory-intensive benchmark with data reuse and weak scaling.

The cache miss ratios observed in L1D in this test, with weak scaling, is shown in Figure 16. Note the scaling of the y-axis; the observed miss ratios range from 0.37% to 1.68%, so the miss ratio stays very low at all times. Since the L1D in the system is at 32 kB, none of the vectors will fit inside L1D at any time during this test. The low cache miss ratios is likely due to a very accurate hardware prefetcher, which can detect sequential access patterns and bring data in to the L1D cache before it is needed. It is hard to analyze the fluctuations observed in cache miss ratios, since the numbers are so small. The overall conclusion is instead that the hardware prefetcher does a good job of keeping the miss ratios low in L1D. It could be interesting to collect this metric with the hardware prefetcher turned off. This is discussed in Section 13.3.

40 11 Results and discussion

x86 1MB Vector x86 128kB Vector EMCA 1MB Vector EMCA 128kB Vector 2

1,8

1,6

1,4

1,2

1

0,8

0,6

0,4

0,2 EXECUTION TIME (NORMALIZED PER DATA SERIES) EXECUTION (NORMALIZED TIME PER DATA 0 0 64 128 192 256 320 384 448 512 576 640 704 768 832 896 960 1024 TOTAL NUMBER OF ACTORS/PROCESSES

Figure 17 Normalized execution times for the memory-intensive benchmark with data reuse and strong scaling.

When looking at the strong scaling properties of this benchmark test in Figure 17, we can see that both the EMCA system and the x86-64 system benefits greatly from split- ting up the memory-intensive work with data reuse across multiple processor cores (note the scaling of the y-axis here). With a large number of actors, we can see that EMCA outperforms the x86-64 system with better scaling properties. The tests on x86-64 shows a performance degradation when the number of actors pass the number of virtual cores. Just like what we have seen in previous tests, the normaliza- tion makes the test with the smaller vector size stand out as though it takes the biggest performance hit. This occurs when the data chunk used by each actor gets so small that looping through it does not take a significant amount of time. Instead, the process management overhead starts to show and the normalized execution time goes up. With the larger vector size, looping through the data (which is getting smaller and smaller) still dominates in the time measurements. Just like in previous tests, the test with the larger vector and the test with the smaller vector starts at very different execution times with 4 actors in total (770 ms and 105 ms respectively) but end up in similar execution times with 1024 actors in total (181 ms and 170 ms respectively).

41 11 Results and discussion

x86 1MB Vector x86 128kB Vector EMCA 1MB Vector EMCA 128kB Vector 40

35

30

25

20 SPEEDUP 15

10

5

0 0 32 64 96 128 160 192 224 256 288 320 352 384 TOTAL NUMBER OF ACTORS/PROCESSES

Figure 18 Speedup for the memory-intensive benchmark with data reuse and strong scaling.

When looking at the speedup numbers in Figure 18, the benefit of using locality in the memory system is obvious. With EMCA, the test with the larger vector size peaks at a speedup of 15.3 with 48 actors doing memory-intensive work (96 actors in total). The speedup with the smaller vector size get to similar numbers, but decline more rapidly when the number of actors grow large. This is because proportion of time used for looping through the vector gets very small, and instead the overhead costs start to be more visible. In the x86-64 system, we can see that the test with the larger vector size achieves great speedup numbers. Initially, the vector will fit in the L2 cache but not the L1D cache. The vector should fit inside the 32 kB L1D cache when the number of actors using 1MB memory hits 32kB = 32, so 64 actors in total. We expected to see super-linear speedup at that point, but we are not. It is possible that the distribution of actors on to processor cores prevents this from happening, if many actors doing memory-intensive work are placed in the same physical core (like discussed in Section 11.1). Instead, the numbers increase in a more or less linear way until hitting a speedup of 34 with 256 actors in total (128 using memory). This means that, when splitting up a large vector across actors, we get the maximum performance benefit when using all the available cores in the system. After that, the process management overhead starts to show and the speedup goes down

42 11 Results and discussion quickly. With the smaller vector, the data should fit in the L1D cache when the number of actors 128kB using memory gets to 32kB = 4, so 8 actors in total. We can see that the speedup increases most rapidly in the beginning, which means that we get the most benefit from adding processor cores here. The best speedup achieved is 18.3 at 192 actors in total. Then the process management overhead starts to show, and after passing 256 actors the speedup declines rapidly.

1MB Vector 128kB Vector 4,00%

3,50%

3,00%

2,50%

2,00%

1,50%

1,00% CACHE CACHE MISSRATIO 0,50%

0,00% 0 64 128 192 256 320 384 448 512 576 640 704 768 832 896 960 1024 TOTAL NUMBER OF ACTORS/PROCESSES

Figure 19 Cache miss ratio for the memory-intensive benchmark with data reuse and strong scaling.

Figure 19 shows the cache miss ratio observed when running this particular test case. Just like in Figure 16 we can see that the cache miss ratios stay very low; the highest observed miss ratio is 3.5% for the 128 kB with 8 actors in total, and the lowest observed is 0.65% with the 1 MB vector and 64 actors in total. The small peaks that are seen with small numbers of actors are likely due to the characteristics of the hardware prefetcher in the system, which is not investigated in detail here. The conclusion from Figure 19 is, just like in the previous test, that the hardware prefetcher helps to keep the cache misses in L1D fairly low at all times.

11.4 Discussion on software complexity and optimizations

There are some factors affecting the performance results which have not been investi- gated in detail. One important aspect is that there are different levels of software com- plexity and optimization in the two test systems. Since CBB was developed specifically to run on EMCA, its implementation is highly optimized for this hardware. This is one

43 12 Conclusions of the main benefits of co-designing software and hardware. A potential drawback to this approach is that it can be complex to migrate the software to another hardware plat- form. In this project MPI turned out to be a helpful tool to get a working prototype for new hardware platforms, but no time has been spent optimizing the code for the x86-64 test machine. Instead the MPI-based prototype produced results that are assumed to be “good enough” to reason about, which leads to useful conclusions. The performance achieved with MPI can be viewed as a “lower bound” to what is actually achievable on an x86-64 system. In addition, it would be possible to optimize the benchmark-specific code for the dif- ferent hardware platforms. This was briefly discussed in Section 5.2.1. An example would be to, in the memory-intensive tests with data reuse, change the access pattern to the data vectors. It would then be possible to loop through a chunk of the vector small enough to fit inside the DSP data scratchpad and the L1D cache respectively, before moving on to the next chunk. This would dramatically increase the use of locality in the software, and would likely result in higher performance on both EMCA and x86-64.

12 Conclusions

The objective of this project was to investigate connections between parallel program- ming models and the hardware that they run on. We saw that the EMCA processors have many differences when compared to commercial designs like x86-64 and ARMv8. Their memory systems and processor cores were described and analyzed, highlighting a number of their key characteristics. The project has provided insights on how a programming model can be adapted to run on new hardware. The MPI library was suggested as a tool for making a CBB im- plementation which can be used on both x86-64 and ARMv8. A prototype supporting actors with message passing capabilities was developed. Benchmarks targeting compute-intensive and memory-intensive scenarios was devel- oped and tested on EMCA and a high-end x86-64 machine. In the compute-intensive tests we saw that EMCA had better scaling properties than the x86-64 system with very many actors, but worse scaling properties with few actors. A modified benchmark was tested on EMCA, highlighting the theory that EMCA is very good at doing heavy calculations and that the original test was not compute-intensive enough. The memory- intensive tests showed how both systems can utilize locality in their memory systems by putting data in a cache or a scratchpad memory. We also saw how the hardware prefetcher in the x86-64 system helps to keep cache miss ratios low. Overall, EMCA showed more linear scaling behavior than the x86-64 system. Having more actors than

44 13 Future work the number of processor cores gave an immediate performance penalty in the x86-64 system in all tests, something that was not as apparent on EMCA.

13 Future work

There are a number of aspects that has not fitted within the scope of this project, but could be investigated in the future. Some examples are listed below.

13.1 Implement a CBB transform with MPI for x86-64

As previously stated, this project focused on creating a proof-of-concept model for CBB on x86-64. This could be expanded to become a full CBB implementation. The CBB transform could then take any CBB code and automatically generate the corresponding MPI code. It is difficult to estimate how comprehensive and time-consuming this work would be.

13.2 Expand benchmark tests to cover more scenarios

The benchmark tests used within this project could be expanded and modified to test more performance aspects. For example, it could be interesting to use a real baseband application with the kind of signal processing operations used in a cellular base station. Comparing the performance of such an application across multiple hardware architec- tures might be relevant for Ericsson.

13.3 Run benchmarks with hardware prefetching turned off

To dive deeper in to the characteristics of the cache system in the x86-64 machine, it could be interesting to run the benchmark tests with the hardware prefetcher turned off. This would worse performance overall, but the results would then be easier to relate to the cache specifications.

45 13 Future work

13.4 Combine MPI processes with OpenMP threads

There could be a benefit of splitting up tasks inside an MPI process using threads. As mentioned in Section 6.1, MPI only supports the creation of processes. It could be in- teresting to make a CBB implementation using MPI for process creation and something like OpenMP for thread-level parallelism. This is an approach which is widely used and studied, for example by Rabenseifner et al. [21].

13.5 Run the same code in an ARMv8 system

As mentioned in Section 6.1.2, MPICH is portable and can run in ARMv8-based sys- tems as well. The code written in this project could easily be compiled and run in such a system, testing the same or other benchmarks. This would be an easy way to make additional comparisons between EMCA, x86-64 and also ARMv8.

46 References

References

[1] G. A. Agha, Actors: a model of concurrent computation in distributed systems. Cambridge, Mass: MIT Press, 1986.

[2] B. Anuradha and C. Vivekanandan, “Usage of scratchpad memory in embedded systems — State of art,” in 2012 Third International Conference on Computing, Communication and Networking Technologies (ICCCNT’12), 2012, pp. 1–5.

[3] ARM Holdings. (2020, Sep.) ARM Cortex-A Series Programmer’s Guide for ARMv8-A – Chapter 11. Caches. Accessed: 2020-09-17. [Online]. Available: https://developer.arm.com/documentation/den0024/a/caches

[4] Barney, Blaise. (2020) Message Passing Interface (MPI). Lawrence Livermore National Laboratory. Accessed: 2020-10-27. [Online]. Available: https: //computing.llnl.gov/tutorials/mpi/

[5] Berkeley Design Technology, Inc. (2006) The Art of Processor Benchmarking: A BDTI White Paper. Accessed: 2020-10-06. [Online]. Available: https: //www.bdti.com/MyBDTI/pubs/artofbenchmarking.pdf

[6] N. Chong and S. Ishtiaq, “Reasoning about the ARM Weakly Consistent Memory Model,” in Proceedings of the 2008 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness: Held in Conjunction with the Thirteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’08), ser. MSPC ’08. New York, NY, USA: Association for Computing Machinery, 2008, p. 16–19. [Online]. Available: https://doi-org.ezproxy.its.uu.se/10.1145/1353522.1353528

[7] W. Chu. (2020, Jun.) Caching and Memory Hierarchy. Accessed: 2020-09-16. [Online]. Available: https://medium.com/@worawat.chu/caching-and-memory- hierarchy-fc7a9b9efcca

[8] I. Cutress. (2019, Aug.) Cache and TLB updates - The Ice Lake Benchmark Preview: Inside Intel’s 10nm. Accessed: 2020-09-17. [Online]. Available: https://www.anandtech.com/show/14664/testing-intel-ice-lake-10nm/2

[9] B. Falsafi and T. F. Wenisch, “A Primer on Hardware Prefetching,” Synthesis Lectures on Computer Architecture, vol. 9, no. 1, pp. 1–67, 2014. [Online]. Available: https://doi.org/10.2200/S00581ED1V01Y201405CAC028

[10] E. Gabriel, G. E. Fagg, G. Bosilca, T. Angskun, J. J. Dongarra, J. M. Squyres, V. Sahay, P. Kambadur, B. Barrett, A. Lumsdaine, R. H. Castain, D. J. Daniel,

47 References

R. L. Graham, and T. S. Woodall, “Open MPI: Goals, concept, and design of a next generation MPI implementation,” in Proceedings, 11th European PVM/MPI Users’ Group Meeting, Budapest, Hungary, September 2004, pp. 97–104.

[11] G. Georgis, A. Thanos, M. Filo, and K. Nikitopoulos, “A DSP Acceleration Framework For Software-Defined Radios On X86 64,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 1648–1652.

[12] J. L. Gustafson, “Reevaluating Amdahl’s Law,” Commun. ACM, vol. 31, no. 5, p. 532–533, May 1988. [Online]. Available: https://doi-org.ezproxy.its.uu.se/10. 1145/42411.42415

[13] J. L. Hennessy and D. A. Patterson, Computer Architecture - A Quantitative Approach (5th Edition). Elsevier, 2012. [Online]. Available: https://app.knovel. com/hotlink/toc/id:kpCAAQAE11/computer-architecture/computer-architecture

[14] M. D. Hill and M. R. Marty, “Amdahl’s Law in the Multicore Era,” Computer, vol. 41, no. 7, p. 33–38, Jul. 2008. [Online]. Available: https://doi.org/10.1109/MC.2008.209

[15] Javed Absar and F. Catthoor, “Analysis of scratch-pad and data-cache performance using statistical methods,” in Asia and South Pacific Conference on Design Au- tomation, 2006., 2006, pp. 6 pp.–.

[16] N. D. E. Jerger, E. L. Hill, and M. H. Lipasti, “Friendly fire: understanding the effects of multiprocessor prefetches,” in 2006 IEEE International Symposium on Performance Analysis of Systems and Software, 2006, pp. 177–188.

[17] X. Li. (2018, Nov.) : strong and weak scaling. PDC Center for High Performance Computing – KTH Royal Institute of Technology. Accessed: 2020- 09-17. [Online]. Available: https://www.kth.se/blogs/pdc/2018/11/scalability- strong-and-weak-scaling/

[18] Linux Kernel Organization, Inc. (2020, Sep.) perf(1) — Linux manual page. Accessed: 2020-11-19. [Online]. Available: https://man7.org/linux/man- pages/man1/perf.1.html

[19] B. Mann and N. Stephens. (2019, Apr.) New Technologies for the Arm A-Profile Architecture. Accessed: 2020-09-23. [Online]. Avail- able: https://community.arm.com/developer/ip-products/processors/b/processors- ip-blog/posts/new-technologies-for-the-arm-a-profile-architecture

48 References

[20] V. Nagarajan, D. J. Sorin, M. D. Hill, and D. A. Wood, “A Primer on Memory Consistency and Cache Coherence, Second Edition,” Synthesis Lectures on Computer Architecture, vol. 15, no. 1, pp. 1–294, 2020. [Online]. Available: https://doi.org/10.2200/S00962ED2V01Y201910CAC049 [21] R. Rabenseifner, G. Hager, and G. Jost, “Hybrid MPI/OpenMP Parallel Program- ming on Clusters of Multi-Core SMP Nodes,” in 2009 17th Euromicro Interna- tional Conference on Parallel, Distributed and Network-based Processing, 2009, pp. 427–436. [22] J. Reinders. (2013, Jul.) Intel AVX-512 Instructions. Accessed: 2020-09- 23. [Online]. Available: https://software.intel.com/content/www/us/en/develop/ articles/intel-avx-512-instructions.html [23] T. Singh, S. Rangarajan, D. John, R. Schreiber, S. Oliver, R. Seahra, and A. Schae- fer, “2.1 zen 2: The amd 7nm energy-efficient high-performance x86-64 micro- processor core,” in 2020 IEEE International Solid- State Circuits Conference - (ISSCC), 2020, pp. 42–44. [24] N. Stephens, S. Biles, M. Boettcher, J. Eapen, M. Eyole, G. Gabrielli, M. Horsnell, G. Magklis, A. Martinez, N. Premillieu, A. Reid, A. Rico, and P. Walker, “The ARM Scalable Vector Extension,” IEEE Micro, vol. 37, no. 2, pp. 26–39, 2017. [25] The MPICH Project. (2019, Nov.) MPICH Overview. Accessed: 2021-01-05. [Online]. Available: https://www.mpich.org/about/overview/ [26] The MPICH Project. (2019, Nov.) MPICH User Documentation. Accessed: 2021-01-05. [Online]. Available: https://www.mpich.org/static/docs/latest/ [27] J. Thiel. (2006) An Overview of Software Performance Analysis Tools and Techniques: From GProf to DTrace. Washington University in St. Louis. Accessed: 2020-10-06. [Online]. Available: https://www.cse.wustl.edu/∼jain/ cse567-06/ftp/sw monitors1/ [28] D. M. Tullsen, S. J. Eggers, and H. M. Levy, “Simultaneous Multithreading: Maximizing on-chip Parallelism,” SIGARCH Comput. Archit. News, vol. 23, no. 2, p. 392–403, May 1995. [Online]. Available: https://doi-org.ezproxy.its.uu. se/10.1145/225830.224449 [29] Wikipedia contributors. (2019) MPICH – Wikipedia. Accessed: 2020-11-20. [Online]. Available: https://en.wikipedia.org/wiki/MPICH [30] Wikipedia contributors. (2020) OSI model – Wikipedia. Accessed: 2020-11-20. [Online]. Available: https://en.wikipedia.org/wiki/OSI model

49