Eindhoven University of Technology

MASTER

A flexibility metric for processors

Huang, S.

Award date: 2019

Link to publication

Disclaimer This document contains a student thesis (bachelor's or master's), as authored by a student at Eindhoven University of Technology. Student theses are made available in the TU/e repository upon obtaining the required degree. The grade received is not published on the document as presented in the repository. The required complexity or quality of research of student theses may vary by program, and the required minimum study period may vary in duration.

General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

• Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain Department of Electrical Engineering Electronic System Group

A Flexibility Metric for Processors

Master Thesis Report

Shihua Huang 1028224

Supervisors: Luc Waeijen Henk Corporaal Kees van Berkel

Version 1

Eindhoven, February 2019 Abstract

In recent years, the substantial growth in computing power has encountered a bottleneck, as the miniaturization of CMOS technology reaches its limit. No more exponential scaling occurs as being described by Moore’s law. In the coming years of computing, new advancements have to be made on the architectural side. To advance the state of the art, we need good understanding of the various trade-offs in architectural design. One property people refer to is flexibility. The claim that tradeoffs exist between flexibility and performance/energy efficiency, has been frequently used in diverse contexts. Processor flexibility, however, has not been properly defined, and as such these claims can not be validated nor “exploited”. Thus, it is impossible to conduct quantitative flexibility comparison among modern processors, such as ASIC, FPGA, DSP, GPU, and CPU. The lack of such a measure obstructs designers from obtaining a deeper understanding of this property and leads to the failure of characterizing its relation to other features of a processor, such as performance and energy efficiency. In this work, a quantifiable flexibility metric is proposed at application level, thus being applic- able for processors with diverse architectures, which allows a valid quantitative comparison between processors. With a novel approach to normalize applications based on their intrinsic workload, the proposed flexibility metric evaluates 24 platforms across CPU, GPU, DSP, and FPGA. More importantly, the obtained results confirm the tradeoffs between flexibility and performance/en- ergy/area efficiency, and hint the underlying relations between other properties such as area and energy efficiency. This work aims to provide a starting point in understanding processor flexibility, and raise awareness and discussions in the future design.

ii A Flexibility Metric for Processors Preface

Hi, this is a flexible preface. The topic is flexibility, but not completely related to processors, because my master thesis is not only full of flexible or inflexible processors, but also persons. So here I will introduce the flexible and inflexible persons in my thesis story. The preface starts with my motivations for selecting this project. I liked the course Embedded Computer Architecture (ECA) from Electronic System group, which provided me with a starting point in the field of computer architecture. Another reason is certainly the project itself. I still remember the first meeting with Luc, my thesis mentor. He told me that the word “flexible” can also be an adjective for processors, not in a way of bending them. Then he started the “story” about how diverse processors may differ in flexibility. Definitely, he did a good job at selling his “story” to me. Luc this guy is really amazing, he really knows everything. In this way, he is really flexible. Maybe his hand is able to reach his toe? He told me a lot of ”stories” during the whole thesis, sometimes it was kind of hard to stop him from immersing himself. Of course I also enjoyed the stories. I always wonder that he must have a different brain setup which is designed so general-purpose. However, general-purpose processors are claimed by most to have high energy consumption. So here comes the question: how can this brain setup be powered sufficiently by only eating bread as lunch? Who knows! Anyway Luc is the representation of a flexible person I met during my thesis. To conclude, he is like a CPU or FPGA, which is extremely flexible. Compared to Luc, my friend UDID, a sleepy margikarp, is extremely inflexible. Six months of observations revealed that the reason that makes her always sleepy is only eating bread! Indeed, I am insulting her unhealthy diet. Besides, she also has an incredible brain setup, which 90% of the time is busy with processing but no output. That is truly a time-consuming process with significant energy overhead. Half a year ago, it was really challenge for me to communicate with her. Mostly when you are saying something, she might keep quiet or will give a dot “.” to the conversation. Such an awkward inflexible physics girl! Anyhow I did enjoy her mostly silent company during the whole thesis. With my efforts, she eventually talked a lot more than before. Impressive! So the conclusion is: UDID is the least flexible ASIC? However, this special ASIC does not process anything useful and mostly get stucked in deadlocks. And me? I consider myself as a domain-specific accelerator GPU, a bit more flexible than the UDID ASIC and of course less flexible than the Luc CPU. This GPU could have high performance and efficiency but low flexibility. Its goal is to become flexible as the Luc CPU! During the whole thesis, every day was a fruitful day for me. Due to the decent scheduling and mapping, I could efficiently master new knowledge and handle different challenges. The “incredible” efficiency resulted in an interesting misunderstanding. One day in the past Luc tried to convince me to spend less time on study and more time in enjoying my life, since he thought I worked so hard everyday and even during weekends. To be honest, I was completely confused at that moment, since the thesis period was the most relaxing time of my master phase. However, this misunderstanding is also warm-hearted. To conclude, I am definitely a lovely pink GPU with high energy efficiency and performance. In the future, I will work hard at enriching myself and practicing my story skills in different domains, eventually becoming as flexible as the Luc CPU. Dear Henk and Kees, I appreciate your help for my thesis. Are you reading this line? If so, you must be a flexible person as well, congratulations! I hope you like my flexible preface. If not, never mind, have a nice day! :p

A Flexibility Metric for Processors iii Contents

1 Introduction 1 1.1 Background...... 1 1.2 Motivation...... 2

2 Literature Research3 2.1 Flexibility in Other Fields...... 3 2.1.1 Manufacturing Systems...... 3 2.1.2 Networks...... 3 2.1.3 Power Systems...... 4 2.2 Flexibility of Processors...... 4 2.2.1 Single-ISA Heterogeneous Processors...... 4 2.2.2 Computer Architecture...... 5 2.2.3 Processor Versatility...... 6

3 Flexibility as a New Measure7 3.1 Flexibility Definition...... 7 3.2 Flexibility Measure...... 8 3.2.1 Additional Properties of GM and GSD...... 9 3.3 Flexibility Framework...... 12 3.4 Data Normalization...... 12 3.4.1 Normalize to Dataset...... 13 3.4.2 Normalize to Baseline...... 14 3.4.3 Summary of Normalization...... 15

4 Experiment Setup 16 4.1 Benchmarks...... 16 4.2 Platforms...... 16 4.2.1 CPU...... 17 4.2.2 GPU...... 17 4.2.3 FPGA...... 17 4.2.4 DSP...... 17 4.3 Compiler Directives...... 18 4.3.1 CPU...... 18 4.3.2 FPGA...... 18 4.3.3 DSP...... 18

5 Implementation 19 5.1 CPU & GPU...... 19 5.2 FPGA...... 19 5.3 DSP...... 21 5.4 Intrinsic Workload Estimator...... 21 5.4.1 IR Interpreter...... 22 iv A Flexibility Metric for Processors CONTENTS

5.4.2 Intrinsic Transistors...... 25

6 Methodologies 29 6.1 Flexibility...... 29 6.2 Secondary Metrics...... 29 6.2.1 Performance...... 29 6.2.2 Energy Efficiency...... 30 6.2.3 Area Efficiency...... 30 6.3 Parallelism...... 30 6.4 Approximation...... 30

7 Results 32 7.1 Flexibility Analysis...... 33 7.1.1 Native Flexibility Results...... 34 7.1.2 Compiler Directives...... 35 7.2 Other results...... 36 7.2.1 Performance vs Parallelism...... 36 7.2.2 Area Efficiency vs Energy Efficiency...... 37

8 Conclusions 38

9 Future Recommendations 39

Bibliography 40

Appendix 42

A LLVM Pass: libDynCountPass.so 43

B Operations in Verilog For Synthesis 48

C Measurement Results 50

A Flexibility Metric for Processors v Chapter 1

Introduction

In this chapter, the constrains faced by processor designers today are discussed. Besides, we emphasize the importance of introducing a quantifiable flexibility metric of processors.

1.1 Background

Over the past years, the explosive growth of semiconductor integration has revolutionized the modern world. As predicted by Moore’s law in the 1960s, the computing power of processors is growing exponentially, approximately doubling every 18-24 months [1]. This incredible growth benefits from the miniaturization of CMOS technology, enabling billions of transistors on a single chip. As this technology has reached its limits, to continuously comply with Moore’s Law, the complexity of computer architecture has increased dramatically in order to achieve superior per- formance, such as parallelism at all levels of the hardware/software stack. In case of a single processor with a Single Instruction Single Data (SISD) architecture, ex- tending it with processing units (PUs), floating point (FP) and single instruction multiple data (SIMD) instruction set could turn it into a SIMD machine. In general, vector and SIMD machines exploit fine grain and instruction-level parallelism, but paying a significant price at architectural and programming complexity. The development of such instruction-level or thread-level parallel- ism inevitably introduces overhead, as the increase in speed is non-linear to the number of PUs or processors. Certainly, employing more PUs or dedicated hardware increases performance when executing certain types of applications where instruction-level parallelism could be fully exploited. However, the processor loses its capability and becomes less flexible to equally support other types of applications which benefit barely from SIMD extensions. In addition, for those applications, this additional hardware that does not perform computations on data, causes an overhead in energy. Thereby, the energy efficiency reduces. When considering energy efficiency in processor design, the other architecture feature flexibility is always referred. As being examined by Hameed, there exists a tradeoff between flexibility and energy efficiency in state-of-the-art architectures [2]. Designing ASICs for special-purpose applications delivers far higher performance under tight area and energy consumption budgets. It can achieve 2-3 orders of magnitude higher energy efficiency and performance than General- Purpose Processors (GPPs), such as CPUs[1]. However, due to such a high-level of specialization to applications, an ASIC can hardly execute other types of applications, exposing its inflexibility. In contrast, GPPs are designed to perform moderately despite of application type. Thus, GPPs qualitatively outperform ASICs in terms of flexibility. This seems to confirm the tradeoff between flexibility and performance/energy efficiency.

A Flexibility Metric for Processors 1 CHAPTER 1. INTRODUCTION

1.2 Motivation

Properties like performance, power, and energy efficiency have been defined in a quantitative manner. For instance, energy efficiency is normally measured in performance per watt, where performance is inversely proportional to execution time. With these metrics, quantitative com- parisons can be conducted among different architectures. In terms of flexibility, there exist various definitions and synonyms in the field of computer architecture, such as adaptability, programmab- ility, generality, and versatility. When taking flexibility as the adaptability of processors to new computations, a programmable processor that can be reused across applications is the most flex- ible. On the other hand, a processor with fixed logic such as ASICs cannot be adapted, exposing its inflexibility [3]. Among programmable processors, field programmable devices are claimed by some to be less flexible than software programmable processors due to its inadequate programabil- ity [4]. However, from another perspective, if only evaluating flexibility by how well the processor can support different applications, FPGAs would be the most flexible, as any hardware including DSPs, GPUs and CPUs can be instantiated on FPGAs. The fact that the qualitative arguments in terms of processor flexibility vary a lot, can be explained by the lack of a common understanding of what processor flexibility is, along with the absence of a proper way to quantify flexibility. Researchers could only conduct flexibility comparison in a qualitative manner. Karuri et al. compare processor alternatives for current and future Systems On Silicon (SoCs) by qualitatively positioning these processors in terms of performance, power dissipation, and flexibility[4]. Fasthuber et al. state that flexibility and energy efficiency are conflicting design goals [5]. Besides, a number of approaches and designs have already been attempted to balance energy efficiency and flexibility in specialized computing, such as a transition to homogeneous and heterogeneous multi-core systems[2]. A heterogeneous processor with a diverse set of cores is considered to have greater flexibility and better energy efficiency[6]. Although researchers claim flexibility, a common valid comparison cannot be adequately performed without a quantifiable flexibility metric. In this work several candidate definitions are discussed, then a quantifiable flexibility metric is proposed, which is applicable to diverse architectures. The underlying motivation for a quantifiable flexibility metric is to give a common notion and measure to processor flexibility. With this metric, hypothetical relations between flexibility and other properties can be further investigated. In product research and development, computing platforms are required to sufficiently support new or updated algorithms, as algorithms are changing at a striking speed. A flexibility metric can help system architects to design and select computing platforms, whilst improving the understanding of which technical concepts lead to better flexibility. The main contributions of this work are:

• Proposal of a quantifiable flexibility metric. • Flexibility measurements of 24 platforms on 14 benchmarks. • A novel approach to normalize applications based on their intrinsic workload. • An open source tool to automate the extraction of intrinsic workload for arbitrary programs. • Tradeoffs between flexibility and other properties are examined.

The remainder of the report is organized as follows: Section 2 discusses the existing flexibility definitions in literature. Section 3 introduces a new metric for processor flexibility and a normal- ization method based on intrinsic workload of applications. Section 4 explains the experiment setup. The implementations on diverse platforms and the workload estimator are described in Section 5. The applied methodologies and the flexibility results are analyzed in Section 6 and 7. Finally, concluding remarks and future recommendations are provided in Section 8 and 9.

2 A Flexibility Metric for Processors Chapter 2

Literature Research

In the field of computer architecture, few studies have striven to define and quantify processor flexibility. Therefore, this section starts with discussing the existing flexibility definitions in other fields. Next several candidate definitions for processor flexibility are discussed.

2.1 Flexibility in Other Fields 2.1.1 Manufacturing Systems For manufacturing systems, flexibility is defined as the sensitivity of the system to changes[7]. The lower the sensitivity, the higher flexibility. A flexible manufacturing system is claimed to be less sensitive to changes. Two distinct and generic viewpoints about flexibility are reviewed by Chryssolouris[7]. The first viewpoint considers flexibility as an intrinsic attribute of systems, which is represented as a function of available characteristics of the manufacturing system. For instance, the entropic flexibility index (FI) measures system flexibility based on available choices that the system provides. The second view point considers flexibility as a relative attribute which depends not only on the manufacturing system itself, but also on the external demands placed upon the manufacturing system. For example, the flexibility of a system or machine could be measured relative to a reference task set[8]. In conclusion, the first view point obviates the need for external demands, which also omits the actual variations that introduced by external demands. The obtained flexibility may be meaningless. Hence, the second viewpoint that considers external demands is preferable. Applying the second viewpoint in processors eliminates the associativity between system internal complexity and system flexibility, which is relatively easy to measure.

2.1.2 Networks For networks, flexibility refers to the ability to support new requests, e.g. changes in requirements or new traffic distributions[9]. Kellerer et al. challenge the network with “new requests” to examine network flexibility. The proposed flexibility metric is based on an architecture and a network topology, not specific to certain types of systems. This approach quantifies system changes above the basic architecture level. The metric defined for network flexibility is formalized in Eq.2.1[9]. |supported new requests within time threshold T | Network flexibility = (2.1) |total given new requests| The network flexibility defined by this equation varies within the interval [0, 1], since the number of supported new request is always smaller or equal to the total number of given requests. For network, the potential changes are new requests, while for processors, we consider diverse types of applications as changes. Another essential point is that the evaluation criteria in a network is to check whether new requests are supported or not, but for executing applications on processors, we also consider performance and energy efficiency for application support.

A Flexibility Metric for Processors 3 CHAPTER 2. LITERATURE RESEARCH

2.1.3 Power Systems For power systems, flexibility refers to the ability of a power system to deploy its resources in response to changes in the net load[10]. Net load is defined as the residual de- mand that must be supplied after us- ing all Renewable Energy (RE). With the increased share of RE in power system, the unpredictability and un- certainty associated with net load in- Figure 2.1: Steps to construct CFM creases, due to a high variations of the output of intermittent renewable energy source. It becomes crucial for power system planners to consider operational characteristics that might influence system flexibility. Vishwamitra and Sayed propose a composite metric for assessing flexibility availability from generating units[11]. A comprehensive set of flexibility para- meters of a power generator are considered, together with parameter importance to weight these parameters. In the last step, by assessing the robustness of Composite Flexibility Metric (CFM), the selected normalization scheme is later examined and adjusted in the Sensitivity Analysis step. Figure 2.1 illustrates the steps to construct a CFM[11]. This proposed framework ensures that the methodological decisions in each construction step are employed scientifically. Some worthwhile information can be extracted and applied in quan- tifying processor flexibility, such as the normalization step. The normalization step enables the comparison among individual indicators in different measurement units. In general, the examples above consider flexibility as a system property and quantify flexibility as the insensitivity of the system based on external changes, instead of formulating flexibility as a function of diverse system parameters. This approach is suitable for systems with complex designs, such as processors.

2.2 Flexibility of Processors 2.2.1 Single-ISA Heterogeneous Processors A single-Instruction Set Architecture (ISA) heterogeneous multi-core architecture is a chip multi- processor composed of a diverse set of cores with varying size, performance, and complexity[12]. The underlying motivation for single-ISA heterogeneity is that a diverse set of cores can enable runtime power flexibility and deliver better performance.

Figure 2.2: Two possibilities for selecting four cores To measure flexibility in Single-ISA heterogeneous processors, a metric of clumpiness for meas- uring entropy-based diversity is proposed, which states how well the selected heterogeneous cores can cover the design space[13]. A small clumpiness indicates that a set of cores maximizes runtime flexibility. The measure of clumpiness is based on the relation between normalized time and power.

4 A Flexibility Metric for Processors CHAPTER 2. LITERATURE RESEARCH

Two examples are presented in Figure 2.2 where points C1, C2, C3 and C4 represent four cores with different configurations. By benchmarking one application separately, four cores in terms of normalized time and power is placed. Clumpiness is formulated as Eq.2.2[13].

(x − r ) + (r − x ) + Pn−1 d clumpiness = 1 1 2 n i=2 i r2 − r1 (2.2) x + x d = |x − i−1 i+1 | i i 2

Clumpiness is calculated over the range [r1, r2] for N number of ordered points x. r1 and r2 indicate the upper- and lower- bound of the given range which is [0, 2] in both Selection 1 and 2. x1 and xn represent the first and the last point, namely the cores with the minimal and maximum normalized power. The distance between two points is represented by xi − xm and measured as Euclidean distance. In Fig.2.2, x1 and xn are C1 and C4 respectively. The summation term measures the distance from each intermediate point to the halfway mark between its two neighboring points. Thus, clumpiness quantifies diversity based on the distance between all local points. With this metric, Selection 2 is more clustered than Selection 1. Thus, Selection 1 with a smaller clumpiness is evaluated to be more flexible[13]. The drawback of this approach is that this metric is only applicable to heterogeneous processor, and the flexibility value is application-specific, thus varying with different applications. Remark- ably, quantifying flexibility as the statistical dispersion of the system behaviors is promising, since in this manner flexibility is defined relatively whilst capable of reflecting the system capability. By replacing different cores with applications that benchmarked from one processor, flexibility could also be quantified as clustering degree based on the distances between normalized performance or energy efficiency among different types of applications. Thus, the clumpiness measure would be a good candidate for quantifying processor flexibility.

2.2.2 Computer Architecture A model to define architecture flexibility is proposed by Fasthuber et al. [5]. The motivation for proposing this model is identical to ours, which is to provide a quantifiable measure of flexibility for different architectures and designs. The model targets at evaluating whether an architecture provides sufficient flexibility to support the minimum requirements for all benchmark parameters in a true/false manner.

Figure 2.3: Design flow of the flexibility evaluation model

As presented in Figure2.3, the proposed model is composed of two parts: architecture independ- ent and dependent. In the architecture independent part, system/algorithm-related requirements are extracted by profiling a set of codes. Due to its independency, the proposed system require- ments can be used in architectures with different characteristics, that is so called architecture dependent part. Although the authors claim the model is established in a quantitative manner,

A Flexibility Metric for Processors 5 CHAPTER 2. LITERATURE RESEARCH a partly qualitative evaluation of the hardware scalability is still involved when examining the architecture capability. Besides, it is hard to apply this model, as currently there is no tool that can comprehensively analyze codes then extract corresponding hardware requirements. Further- more, if only evaluating processor capability in a true/false manner, then there is no distinction between two architectures that fulfill all requirements. However, the idea that using different applications as external changes to computer architectures is worth considering when determining the definition for processor flexibility.

2.2.3 Processor Versatility A more generic metric that defines processor versatility is proposed by Van Berkel[14]. In this work, a versatile processor is claimed to be easy to program, and fairly support a variety of applications or uses with different algorithms, data types, etc[14]. The proposed versatility is presented in Eq.2.3, which considers the amount of information used to specify a unit of useful work as versatility. amount of information to specify a unit of work versatility = (2.3) amount of useful work specified for that unit of work For processors, a unit of work is defined as the instruction that processor hardware can be con- figured. Thus, the amount of information to specify instruction can be represented by instruction width for Instruction Set Architecture (ISA). In this manner, versatility is defined as a property of ISA, being independent from specific ISA implementations and executed applications. The amount of useful work of instructions is extrapolated as the number of useful operations per in- struction. Van Berkel further claims that extending the ISA for same work or instruction, namely providing more instruction options signifies an increasing versatility. For same-sized instruction, overhead is introduced when more work is required, indicating the versatility decreases[14]. The processor versatility is further formulated as Eq.2.4. average instruction size versatility = (2.4) number of useful operations per instruction As an example, a Fast Fourier Transform (FFT) based on radix-2 butterflies is used (Figure 2.4) to measure processor versatility. The reason for selecting FFT is that FFT data is available for most architectures and the impact from diverse compilers can be eliminated as FFT code is typically hand-optimized[14]. One complex radix-2 butterfly is assumed to take 8 operations including 4 times multiplications and 4 times additions/subtractions. Assuming the size of each FFT block is N and CPI indicates typical Cycles Per Instruction, operations per instruction can be formulated as Eq.2.5. 8N/2log N ops per instruction = 2 × CPI (2.5) #cycles per FFT The definition of processor versatility is essentially the same as processor flexibility. Although this work is the most rigorous attempt to formally define flexibility, some shortcomings hinder its application. This metric seems impractical for processors execut- ing without clock-based instructions, such as FPGAs. Regarding the approach to obtain useful operations, conducting the complex- Figure 2.4: A complex radix- ity analysis of applications is challenging without manual work. 2 butterfly[14] Furthermore, using operations as basic units implies that differ- ent operations are weighted equally, such as multiplications and additions, seeming arbitrary. Overall, flexibility has been frequently assessed based on external changes. Regarding modern processors, the three examples discussed above show that there are tremendous diversities in understanding and quantifying flexibility. The inherent properties of those metrics limit the scope of application, thus being impractical for wide use, which is tackled in our proposed flexibility metric.

6 A Flexibility Metric for Processors Chapter 3

Flexibility as a New Measure

After analyzing the existing flexibility definitions, in Chapter 3 we give a new measure for processor flexibility at application level. A flexibility framework is then provided to give an overview of the steps to assess processor flexibility. Decisions made in each step are explained in detail.

3.1 Flexibility Definition

In literature the definitions of flexibility in diverse fields are always associated with changes. For power systems, changes are variations in net load. For networks, changes refer to varying requirements or new traffic distributions in the network. One common point is that all these changes are generated from external demands. Regarding processors, we consider diverse types of applications as changes. Therefore, a disper- sion degree of how processors support different types of applications is used to quantify flexibility. A flexible processor provides fair support to applications regardless of application types, while only outperforming on few certain types of applications is considered to be inflexible. Hence, we define that processor flexibility refers to the invariance of an architecture’s energy efficiency, normalized performance, ...(or other secondary metrics), to changes of applications. Less variation indicates better flexibility. In the proposed definition, flexibility is defined as a relative attribute, determined with re- spect to other metrics, e.g., performance-flexibility, or energy-flexibility. A reference application set is used to assess the support capability of processors. Measuring in the application domain abstracts flexibility from the architectural internal complexity and architecture designs. Still this flexibility measure is able to reflect architectural designs. For instant, DSPs perform superior in accumulations and floating-point computations due to its specialized instructions, large accumu- lator registers, and floating-point hardware units. However, when running an application that can be highly parallelized, such as image processing, without doubt a DSP is inferior to a GPU with massive parallel processing power. For performance-flexibility in processor design, best-effort techniques including branch predic- ation and cache, could incur considerable performance variation, such as the penalties caused by cache misses. Besides, parallelism techniques like hardware multithreading and vector processing that benefit certain types of applications, fail to equally support other applications. Therefore variation in terms of performance is introduced, aggravating performance-flexibility. Note that processor flexibility in this context is only associated with the dispersity among the normalized results of different applications despite absolute values, meaning that it is unnecessary for a flexible processor to have high performance or energy efficiency. This property inevitably gives possibilities to artificially raise flexibility values, since unproductive operations can be added to narrow the gap between different applications to improve processor flexibility, such as inserting nop operations for applications on CGRA. However, this behavior lowers performance and energy efficiency of processors, which is normally undesirable. Hence, when using the proposed metric

A Flexibility Metric for Processors 7 CHAPTER 3. FLEXIBILITY AS A NEW MEASURE to quantify processor flexibility, the applied benchmark set that the flexibility is measured from should be provided, along with the absolute values of the secondary metrics, such as performance and energy efficiency.

3.2 Flexibility Measure

Several measures exist for quantifying the statistical dispersity among applications. In this section the chosen flexibility measure is justified by comparing some common measures. Common measures of statistical dispersion can be classified into two categories: robust meas- ures and conventional measures. A typical example of robust measures is median absolute devi- ation (MAD), which is resilient to extreme values in a quantitative dataset. MAD is defined as the median of the absolute deviations from the median of original data, as shown in Eq.3.1. Thus, this measure ignores a small number of extreme values and only focuses on the median of the dataset. This is in contrast to conventional measures of scale, such as arithmetic standard devi- ation (ASD) and geometric standard deviation (GSD), which consider the entire dataset including extreme values. This reveals that conventional measures are non-robust and are greatly influenced by extreme values[15]. With regard to ASD and GSD, both of these measures describe how spread out of a set of numbers are, while the core difference is the preferred average: arithmetic mean (AM) and geometric mean (GM).

MAD(x) = median(|xi − median(x)|) (3.1) v u n n u 1 X 2 1 X ASD(x) = t (xi − AM(x)) , where AM(x) = xi (3.2) n n i=1 i=1

v  1 u n  2 n ! n u 1 X xi Y GSD(x) = exp t ln , where GM(x) = xn (3.3)  n GM(x)  i=1 i=1 As shown in Eq.3.2, AM characterizes the average value in dataset by summing up all numbers, and then dividing by the length of this dateset. The obtained AM represents the average value of this dataset. ASD is preferred to measure data dispersity when AM is used. The value of ASD indicates the average distance of datapoints in the dataset to the mean of this dataset. High standard deviation yields high data dispersity in this dataset. In contrast to ASD, GSD shown in Eq.3.3 is dimensionless, being considered as a multiplicative factor. GM is applied to calculate average for GSD, which takes the product of all numbers, and then raises it to the inverse of the length of the dataset. To conclude, AM and ASD are sum-based values. As such, they are appropriate for additive processes. However, when dealing with multiplicative relationships, such as ratio, growth rate and speedup, AM over-estimates data average, and GM as the product-based value is suitable for such multiplicative processes[16]. An advantage of GSD is its invariance to multiplicative scaling, while ASD is not. The proof of this multiplicative property is shown below.

Lemma 3.2.1. GSD is invariant to multiplicative scaling, i.e., GSD(sx) = GSD(x) where x is a positive dataset and s is a positive constant.

1 Qn n Proof Assume GM(x) = ( i=1 xi) , we scale all xi with a positive factor s, thus

1 1 n ! n n ! n Y n Y GM(sx) = sxi = s xi = sGM(x) i=1 i=1 v  v  u n  2 u n  2 u 1 X sxi u 1 X xi GSD(sx) = exp t ln = exp t ln = GSD(x)  n GM(sx)   n GM(x)  i=1 i=1

8 A Flexibility Metric for Processors CHAPTER 3. FLEXIBILITY AS A NEW MEASURE

Lemma 3.2.2. ASD is variant to multiplicative scaling, i.e., ASD(sx) = sASD(x) where x is a positive dataset and s is a constant. 1 Pn Proof Assume AM(x) = n i=1 xi, we scale all xi with a factor s, thus n n ! 1 X 1 X AM(sx) = sx = s x = sAM(x) n i n i i=1 i=1 v v u n u 2 n u 1 X 2 us X 2 ASD(sx) = t (sxi − AM(sx)) = t (xi − AM(x)) = sASD(x) n n i=1 i=1 Taking cycles and execution time as an example, cycles are represented by execution time multiplying frequency as shown in Eq.3.4. With ASD, quantifying data dispersity in cycles and execution time leads to frequency times difference in terms of flexibility, while GSD remains un- changed. In this manner, with GSD performance-flexibility values are defined irrespective of frequencies, eliminating the inconsistency in flexibility values when boosting processors perform- ance by increasing processor frequency. Similarly, when comparing a processor A that executes each application 5x slower than a processor B, with ASD processor A is evaluated to have 5x less deviation than processor B. It means processor A is 5x more flexible than processor B. In contrast, the flexibility results of these two processors remain the same when applying GSD. Thus, it can be extrapolated that the deviation measured by ASD is strongly influenced by the absolute val- ues. That is to say, in case that the flexibility measurements are conducted among diverse types of processors, including desktop and embedded processors, embedded processors executed with low frequencies are most likely to be measured as more flexible than any desktop processor when applying ASD.

#(Cycles) = Execution time × Frequency (3.4) Overall, the flexibility measure is required to be sensitive to capture exceptional system be- haviors on applications, which may reflect the lack of software/hardware supports, or specialized instructions. For instance, a processor with hardware floating point units can speed up the pro- grams with a larger number of floating-point computations, leading to extremely low execution time compared to the processor that could only provide software routine supports. Thus, the proposed metric is expected to be sensitive to extreme values, resulting in the rejection of robust measures. Among conventional measures, GSD is preferred, as explicitly revealed in the flexibility definition, flexibility values are measured based on normalized values. After data normalization, the original dataset is inevitably transformed and redefined as ratios to the baseline, introducing multiplicative relations. ASD overestimates variation when involving multiplicative relations. In addition, GM as product-based values is claimed to be the only correct method for such multiplic- ative processes [17]. Hence, in this work, GSD is determined to quantify processor flexibility. Note that the application of GSD and GM is constrained to a positive dataset due to the logarithm and root in the calculation.

3.2.1 Additional Properties of GM and GSD Several properties regarding GM and GSD are further investigated below. The motivations of using GM and GSD are substantial, which can be summarized as: • Lemma 3.2.4 suggests that the chosen of baseline is irrelevant when using ratios to compare the GMs of the normalized metrics, thus maintaining the original ratio. • Lemma 3.2.5 suggests that the GSD values are identical when applied to two metrics that are the reciprocal of each other, e.g., energy consumption and energy efficiency. • Lemma 3.2.6 suggests that the equivalent relationship of GM indicated by Lemma 3.2.3 does x x GSD(x) GSD( b ) GSD(x) not hold for GSD, as GSD( ) ≥ . Therefore, y 6= , meaning that the b GSD(b) GSD( b ) GSD(y) chosen of baseline is relevant when using ratios to compare the GSDs of the normalized metrics.

A Flexibility Metric for Processors 9 CHAPTER 3. FLEXIBILITY AS A NEW MEASURE

Lemma 3.2.3. The GM of the normalized dataset is equal to the ratio of the GMs of the original x GM(x) dataset and its baseline, i.e., GM( b ) = GM(b) where x and b are positive datasets.

1 1 Qn n Qn n Proof Assume GM(x) = ( i=1 xi) and GM(b) = ( i=1 bi) , thus

1 1 n ! n n n 1 Q  n Q n x Y xi i=1 xi ( i=1 xi) GM(x) GM( ) = = Qn = n 1 = b b b Q n GM(b) i=1 i i=1 i ( i=1 bi) Lemma 3.2.4. The ratio of the GMs of different normalized dataset maintains the same as the x GM( b ) GM(x) ratio of the GMs of the original datasets, i.e., y = where x, y, b are positive datasets. GM( b ) GM(y) Proof GM(x) GM( x ) GM(x) b =Lemma 3.2.3 GM(b) = GM( y ) GM(y) GM(y) b GM(b) Lemma 3.2.5. The GSD of the normalized data is equal to the GSD of the reciprocal of the x b normalized data, i.e., GSD( b ) = GSD( x ) where x, b are positive datasets.

x µx Proof Assume µx = GM(x) and µb = GM(b). Due to Lemma 3.2.3, µ1 = GM( ) = and b µb b µb 1 µ2 = GM( ) = , yielding that µ2 = . x µx µ1 v  v  u n  xi 2 u n  2 x u 1 X bi u 1 X xi GSD( ) = exp t ln = exp t ln − ln µ1 (3.5) b  n µ   n b  i=1 1 i=1 i

v 2 v 2 u n bi ! 1 u n ! µ2= b u 1 X x µ1 u 1 X µ1 GSD( ) = exp t ln i  = exp t ln  xi x  n µ2   n  i=1 i=1 bi v  v  u n  2 u n  2 u 1 X xi u 1 X xi x = exp t ln µ1 − ln = exp t ln − ln µ1 = GSD( )  n b   n b  b i=1 i i=1 i (3.6)

Lemma 3.2.6. The GSD of the normalized dataset is always greater or equal to the ratio of the x GSD(x) GSDs of the original dataset and its baseline, i.e., GSD( b ) ≥ GSD(b) where x, b are positive datasets.

x µx Proof Assume µx = GM(x) and µb = GM(b). Due to Lemma 3.2.3, we know GM( ) = . b µb

v 2  v  u n xi ! r u n  2 x u 1 X b 1 X xi bi t i  u GSD( ) = exp ln µx = exp  · t ln − ln  (3.7) b  n  n µx µb i=1 µb i=1

r 2! n   exp 1 P ln xi  v v  n i=1 µx r u n  2 u n  2 GSD(x) 1 uX xi uX bi = = exp  · t ln − t ln  GSD(b) r 2! n µ µ n   x b exp 1 P ln bi i=1 i=1 n i=1 µb (3.8)

10 A Flexibility Metric for Processors CHAPTER 3. FLEXIBILITY AS A NEW MEASURE

x GSD(x) As exp is an increasing function, to further compare GSD( b ) and GSD(b) , we need to compare r 2 r 2 r 2 n   n   n   P ln xi − ln bi ( 1 ) and P ln xi − P ln bi ( 2 ). Obviously 1 ≥ 0 i=1 µx µb i=1 µx i=1 µb always holds, while 2 can be greater, less or equal to 0. Thus, we need to consider all cases. For 2 < 0, no doubt 1 > 2 always holds. While in case 2 ≥ 0 more analysis is required to compare 1 and 2 . Before the next step, we need to know: when a > b > 0, a2 > b2 must hold, as a2 − b2 = (a + b)(a − b) > 0. That is to say, when both numbers are greater then 0, taking the square of both numbers would not change the inequality relation between these two numbers. Thus it is valid to compare 1 2 and 2 2.

n  2 n  2  2! X xi bi X xi xi bi bi 1 2 = ln − ln = ln − 2 ln · ln + ln µx µb µx µx µb µb i=1 i=1 (3.9) n  2 n   n  2 X xi X xi bi X bi = ln − 2 ln · ln + ln µ µ µ µ i=1 x i=1 x b i=1 b v v n  2 u n  2 u n  2 n  2 X xi uX xi uX bi X bi 2 2 = ln − 2t ln · t ln + ln (3.10) µ µ µ µ i=1 x i=1 x i=1 b i=1 b From Eq.3.9 and Eq.3.10, it can be inferred: v v  u n  2 u n  2 n   uX xi uX bi X xi bi 1 2 − 2 2 = 2 t ln · t ln − ln · ln (3.11)  µ µ µ µ  i=1 x i=1 b i=1 x b

r 2 r 2 n   n   n   Then we need to compare P ln xi · P ln bi ( 3 ) and P ln xi · ln bi i=1 µx i=1 µb i=1 µx µb ( 4 ). If 3 > 4 holds, then we know 1 2 > 2 2, namely 1 > 2 . Similarly as 3 ≥ 0 always holds, while 4 is not sure. In this manner, if 4 < 0, then certainly 3 > 4 . In case 4 ≥ 0, then we can further compare 3 2 and 4 2.

n  2! n  2! X xi X bi 3 2 = ln · ln (3.12) µ µ i=1 x i=1 b

n  !2 X xi bi 4 2 = ln · ln (3.13) µ µ i=1 x b

xi bi Now we use Xi and Bi to represent ln and ln to give a more clear view. Thus, we need to µx µb Pn 2 Pn 2 Pn 2 compare i=1 Xi · i=1 Bi and ( i=1 (Xi · Bi)) where Xi,Bi ∈ R. The Cauchy-Schwarz inequality states that in Euclidean space Rn with the standard inner product, the inequality shown in Eq.3.14 always holds.

n !2 n ! n ! X X 2 X 2 (ui · vi) ≤ ui vi (3.14) i=1 i=1 i=1 When u1 = u2 = ... = un the equality holds. Hence, as indicated by the Cauchy-Schwarz v1 v2 vn inequality, it can be inferred:

n !2 n ! n ! X X 2 X 2 (Xi · Bi) ≤ Xi Bi (3.15) i=1 i=1 i=1

A Flexibility Metric for Processors 11 CHAPTER 3. FLEXIBILITY AS A NEW MEASURE

That is to say 3 ≥ 4 holds when 4 ∈ R, thus it can be further deduced that 1 ≥ 2 is always true for 2 ∈ R. Finally, the following inequality always holds.

x GSD(x) GSD( ) ≥ b GSD(b) Only when x1 = x2 = ... = xn the equality holds. b1 b2 bn

3.3 Flexibility Framework

The flexibility framework illustrated in Figure 3.1 provides a clear overview of how to measure flexibility step by step. Note that in this work compilers are considered to be a part of the platforms of which the flexibility is measured.

Figure 3.1: Framework for flexibility measurements

To quantify flexibility, the first step is to compile and run the same application set on pro- cessors, where a cross-platform language such as OpenCL or a multi-language supported bench- mark is preferred for executing applications across diverse platforms. Next, with obtained raw results, data normalization is prerequisite to ensure that benchmark results of diverse applications are comparable. Eventually, the proposed flexibility metric is applied to measure flexibility on normalized data.

3.4 Data Normalization

Before quantifying flexibility, secondary metrics, such as execution time, measured from diverse applications need to be normalized. The reason is that applica- tions as indicators are presented in different units and disproportionate scales, therefore benchmarked applic- ations have inequivalent computational work. For in- stance, using a Gaussian filter for images with differ- ent sizes results in inequivalent workload. Therefore, data normalization is required prior to any data analysis and comparison[11]. To distinctly explain and under- Figure 3.2: Sort execution time of n stand normalization methods, benchmark results (exe- applications on m processors into a m cution time) of n applications that are measured from x n matrix. m processors are represented in an m x n matrix. The benchmark results of each processor to

12 A Flexibility Metric for Processors CHAPTER 3. FLEXIBILITY AS A NEW MEASURE n applications can be found in its corresponding row. Two types of normalization methods are introduced below: normalize to dataset and normalize to baseline.

3.4.1 Normalize to Dataset This normalization technique requires a set of processors to be tested. The dataset here is de- termined as benchmark results (execution time) of one application on m processors, namely each column in the m x n matrix. The Xmin, µ and Xmax value used in this section represent the minimum, mean and maximum value in corresponding dataset (column).

Min-max Normalization Min-Max normalization (Eq.3.16) is a linear normalization technique. It re-scales indicators into an identical range [0, 1] by subtracting the minimum value and dividing by the range of the indicator values[18]. This technique is based on the extreme values (Xmax and Xmin), reflecting d the percentage of points−to−min where d is an abbreviation of distance. dmax−to−min

X − Xmin Xnorm = (3.16) Xmax − Xmin where Xnorm is the normalized data, X is the original data. The advantage is its identical interval and its high sensitivity to extreme values. Min-max normalization retains the original distribution of a dataset except for the transformation to a common range[0, 1][19]. However, when outliers present, the obtained results would be skewed significantly.

Mean Normalization Similar to min-max Normalization, mean normalization (Eq.3.17) re-scales data based on extreme d values, but into interval [-1, 1]. This technique reflects the ratio of points−to−mean where d is an dmax−to−min abbreviation of distance. X − µ Xnorm = (3.17) Xmax − Xmin where Xnorm is the normalized data, X is the original data. Mean normalization includes the same pros and cons as min-max normalization. The difference is that it extends the negative interval to -1.

Z-score Normalization Z-score normalization (Eq.3.18) re-scales indicators to a common scale with a mean of zero and standard deviation of one by extracting the mean of each indicator from a raw value, and then dividing by its standard deviation[18]. Thus, Z-score technique indicates how much each element in one indicator deviates from the mean of the distribution. Observed values above the mean have positive z-scores, while values below the mean have negative z-scores. v u n X − µ u 1 X 2 Xnorm = where σ = t (Xi − µ) (3.18) σ n i=1 where Xnorm is the normalized data, X is the original data. As discussed in Chapter 3.2, standard deviation is sensitive to extreme values, therefore, this normalization is not robust. Besides, z- score replaces the measurement units with number of standard deviations away from the mean. Thus, it is advisable when comparing indicators with different units.

A Flexibility Metric for Processors 13 CHAPTER 3. FLEXIBILITY AS A NEW MEASURE

3.4.2 Normalize to Baseline Baseline normalization simply divides all elements in a indicator by a baseline. For instance, the normalized execution time is determined as the execution time divided by a baseline. X Xnorm = (3.19) Xbaseline Normalize to Reference A fundamental approach is normalizing data to a reference set. However, determining a proper reference poses a challenge. As the established flexibility metric suggests, geometric standard deviation is applied to quantify processor flexibility. Simply taking benchmark results of one processor as reference makes the reference processor “the most flexible” by definition. For instance, if using a RISC processor as the baseline, normalizing a RISC processor to the baseline (RISC) transforms each value in dataset to 1, meaning that there is no deviation and thus no architecture can be more flexible than RISC processors. Intuitively this is not true, and most may argue that an FPGA for example is more flexible than a RISC, since any circuit can be instantiated on an FPGA, including any RISC processor.

Normalize to Intrinsic Work Normalizing to inherent amount of work is an alternative for baseline normalization. The intuition behind this method is that every application has a certain amount of intrinsic work. Normalizing performance to the intrinsic work indicates the amount of work required for each unit of intrinsic work. The essence of attaining the intrinsic work is to construct an ideally large stateless combin- ational circuit where only basic 2-input-1-output and 1-input-1-output logic gates are employed. With inputs, this combinational circuit is capable of directly outputting results. Any potential speedup achieved by utilizing more gates to reduce logic depth causes overhead. In this circuit, data transmission is performed by hard-wiring gates, and loops in applications are implemented as being fully unrolled. Thus operations like data loads, stores and branches are redundant. Only binary operations that perform computational work are considered to be intrinsic work. Consid- ering the fact that the complexity of gates varies in the actual CMOS implementation, the basic unit gate is further broken down into transistors, i.e., a CMOS XOR gate consists of minimum 8 transistors while a basic OR gate only requires 4 transistors. In this manner, intrinsic work repres- ented by the minimum amount of intrinsic transistors, is a property of applications, independent of the platforms to execute. As it is intractable to construct the optimal combinational circuit, instead the circuit com- plexity is approximated by establishing an equivalent circuit out of basic functional circuit blocks such as addition, multiplication, division, and so on. Selecting these basic functional blocks is non-trivial. The only requirement is that the set of blocks has to be Turing complete, which leaves many options. A natural choice is to utilize an existing reduced ISA, since it divides the overall functionality into an elementary set of operations. In particular the intermediate representation (IR) used by the LLVM compiler is a suitable choice, as it essentially is an abstract high level model of the most common ISAs. Extracting IR instructions can be done on arbitrary applications that are compilable for the LLVM compiler. Eventually, by weighting every IR instructions based on the circuit complexity, the intrinsic workload of applications is approximated. Some may argue that in practice memory operations are necessary and more expensive in time and energy consumption compared to some binary operations. It is somewhat unfair to exclude memory operations from intrinsic work. In this work, the decision that weighting memory operations as zero is made by considering the two reasons. One is that in the constructed ideal combinational circuit memory operations are redundant. The second reason is that the execution counts of memory operations extracted from LLVM IR may heavily depend on the code quality. Hence, we leave the memory issue as future work to investigate if there exists a universal weighting method for memory accesses, which can improve the current normalization.

14 A Flexibility Metric for Processors CHAPTER 3. FLEXIBILITY AS A NEW MEASURE

3.4.3 Summary of Normalization In this section, two normalization methods are presented: normalize to dataset and normalize to baseline. Dataset normalization is relatively convenient to apply but requires a bunch of processors to create the dataset. The obtained flexibility value is not only associated with the benchmarked application set but also the other benchmarked processors. In contrast, baseline normalization can be conducted individually, and the flexibility value is quantified only related to the applied bench- mark set. Among two baseline normalization methods, it is essential that a common agreement is made on the selected baseline which the dataset is normalized to. The approach of normaliz- ing to the intrinsic work of applications is promising, as with this approach the normalized data still remains meaningful, i.e., normalized performance can be expressed as the amount of intrinsic work per second. The completion of more intrinsic work within a time unit indicates a better performance. Table 3.1 summarizes different normalization methods that have been discussed in this chapter.

Table 3.1: Summary of normalization methods

Normalization Description Pros Cons method Reflect the percentage Min-max May be distorted by of X − X to Identical range [0, 1] normalization data min extreme values. Xmax − Xmin. Reflect the percentage Mean May be distorted by of X − X to Identical range [-1, 1] normalization data mean extreme values. Xmax − Xmin. Reflect how data • Identical mean 0 and Z-score deviates from the mean standard deviation 1. Unbounded range. normalization compared to the • Less influence from average deviation. extreme values. Baseline Reflect the speedup of • Easy to apply. Hard to determine normalization data to baseline • Positive dataset baseline

Once the normalized dataset is obtained, flexibility can be quantified based on it. As discussed, GSD with multiplicative property would be advisable when the dataset is expressed as ratios, speedups or growth factors. Both the normalization approaches, normalizing to dataset and baseline, to some extend, implicitly transform data to percentages or ratios. An example of dataset normalization is that min-max normalization expresses data as percentages to Xmax − Xmin. By dividing data to one reference, baseline normalization explicitly yields a speedup of data to baseline. Hence, theoretically speaking, GSD should be applied for both normalization approaches. Note that GSD is only applicable for positive datasets. In case of normalizing to dataset, three approaches min-max, mean, and z-score normalization may transform data into a non-positive interval, leading to the failure of applying GSD. A solution that has been attempted to remove this limit is to shift all numbers in this dataset with a unit of |Xmin| + δ, where Xmin is the minimum value in this dataset and δ is an arbitrary positive number. Introducing a shift of δ is to ensure this shifting approach also works for the dataset with Xmin equaling zero. Note that to reduce the impact of δ on original dataset, δ is determined to be relatively small. In this manner, all normalized numbers in the dataset are guaranteed to be positive. However, shifting normalized results cannot maintain the original monotonicity when comparing data disparity with GSD. For instance, with GSD, processor A is evaluated to be more flexible than processor B, while after shifting processor B may be more flexible than processor A. The other drawback is that this shifting approach results in the normalized data to be meaningless. Hence, GSD with the baseline normalization is the preference.

A Flexibility Metric for Processors 15 Chapter 4

Experiment Setup

In this section, we list the experiment setup involved in this work, including benchmark set, plat- forms and simulators. By tuning compilation directives, techniques that may influence flexibility values are further investigated. With the proposed flexibility metric, the flexibility of processors can be measured quantitat- ively. To compare the flexibility of diverse computer architectures and examine the common sense in terms of flexibility, architectures including CPUs, GPUs, FPGAs, and DSPs, are evaluated in this work. A desired property of a flexibility metric is that it is able to distinguish different architecture classes in terms of flexibility. 4.1 Benchmarks

To quantify flexibility for numerous processors, PolyBench/ACC as a generic benchmark set is selected, since it provides multi-language versions of benchmarks, including C and CUDA[20]. In this work, 14 applications supported by multiple languages are benchmarked with standard dataset. For domain specific analysis, a more representative benchmark set can be selected. The involved benchmarks are described in Table 4.1[20]. Table 4.1: Description of each benchmark in PolyBench Benchmark Description 2mm 2 Matrix Multiplications (D=A.B; E=C.D) 3mm 3 Matrix Multiplications (E=A.B; F=C.D; G=E.F) adi Alternating Direction Implicit solver correlation Correlation Computation covariance Covariance Computation doitgen Multiresolution analysis kernel (MADNESS) fdtd-2d 2-D Finite Different Time Domain Kernel gemm Matrix-multiply C=alpha.A.B+beta.C gramschmidt Gram-Schmidt decomposition jacobi-1D 1-D Jacobi stencil computation jacobi-2D 2-D Jacobi stencil computation lu LU decomposition syr2k Symmetric rank-2k operations syrk Symmetric rank-k operations

4.2 Platforms

In this work, flexibility measurements are conducted on 24 platforms. Applications are directly executed on GPUs and CPUs. For FPGAs and DSPs, high-level synthesis (HLS) and simulators

16 A Flexibility Metric for Processors CHAPTER 4. EXPERIMENT SETUP are used to provide greater visibility into application behavior. Compilers, as a part of target platforms, are tuned for maximum optimization to fully exploit the capability of processors.

4.2.1 CPU In total 10 CPUs are included, 6 Intel CPUs and 4 ARM CPUs. The goal is to distinguish and compare embedded and desktop CPUs in flexibility. The benchmarks are compiled with gcc. Table 4.2 provides detailed information of the examined CPUs. Table 4.2: Overview of CPUs in this work Processor ISA Micro-architecture #Threads Compiler i7-6700 x86 64 Skylake 8 gcc 4.8 i7-4770 x86 64 Haswell 8 gcc 4.8 i7-960 x86 64 Bloomfield 8 gcc 4.8 i7-950 x86 64 Bloomfield 8 gcc 4.8 i7-920 x86 64 Bloomfield 8 gcc 4.8 Pentium 4 x86 64 Northwood 2 gcc 4.8 Processor ISA System #Threads Compiler Cortex A15 ARMv7 Nvidia JTK1 4 gcc 4.8 Cortex A9 ARMv7 Odroid U3 4 gcc 4.8 Cortex A53 ARMv7 RPi3 Model B 4 gcc 6.3 ARM1176 ARMv6 RPi1 Model B 1 gcc 6.3

4.2.2 GPU Aiming at comprehending the difference of flexibility between desktop and embedded GPUs, one embedded GPU Tegra K1 and three desktop GPUs, are evaluated in this work by compiling the PolyBench/ACC CUDA version of the applications. Note that the default standard dataset in C and CUDA version varies, thus the dataset size of the CUDA version is modified to be equal to the dataset size of the C version, ensuring the application workload consistent over all platforms.

Table 4.3: Overview of GPUs Processor Chip Architecture #Cores Compiler Tegra K1 GK20A Kepler 192 nvcc 6.5 GTX 570 GF110 Femi 480 nvcc 7.5 GTX TITAN GK110 Kepler 2688 nvcc 7.0 GTX 750 TI GM107 Maxwell 640 nvcc 7.0

4.2.3 FPGA As PolyBench/ACC lacks its supports for hardware description languages, Vivado High-level Syn- thesis (HLS) is applied to transform C applications into register transfer level (RTL) implement- ations, which can be directly targeted to Xilinx programmable devices. Aiming at exploring the impact of the amount and the type of resources on flexibility, Xilinx FPGAs from different families and with different amounts of resources are included. i.e., Virtexuplus is a UltraScale+ version of Virtex FPGA but with more resource compared to the normal Virtex7. All simulations were performed with Vivado HLS v2018.2.

4.2.4 DSP To examine the hypothesis that parallelism techniques like multithreading incur inflexibility, two multi-threaded Hexagon DSPs from Qualcomm and one single-threaded DSP from Texas Instru- ments (TI) are included. All the measurements are based on simulations. The Hexagon V60 and

A Flexibility Metric for Processors 17 CHAPTER 4. EXPERIMENT SETUP

Table 4.4: Overview of FPGAs Family Device LUT FF DSP BRAM Artix7 xc7a200t 129000 269200 740 730 Kintex7 xc7k480t 597200 597200 1920 1910 Virtex7 xc7v2000t 1221600 2443200 2160 2584 Zynq xc7z100 277400 554800 2020 1510 Virtexuplus xcvu13p 1728000 3456000 12288 5376 Kintexu xcku115 663360 1326720 5520 4320 Zynquplus xczu19eg 522720 1045440 1968 1968

Table 4.5: Overview of DSPs Processor #Cores L1I L1D L2 Simulator Hexagon V60 4 16K 32K 512K Hexagon SDK Hexagon V5 3 16K 32K 256K Hexagon SDK TI C6747 1 32K 32K 256K CCSv4

V5 DSPs are simulated in cycle-approximate mode provided by the Hexagon SDK, and Code Com- poser Studio v4 (CCSv4) provides cycle-accurate simulations for TI C6747. Table 4.5 provides more details of these three DSPs. With timing option enabled, hexagon-sim models cache, op- timal multi-threading mode, and processor stalls [21]. With timing disabled, caches are assumed to be perfect accessed and stalls are excluded. Hexagon DSPs are simulated with simplified multi- threading model.

4.3 Compiler Directives

It can be argued that not optimizing code gives a distorted image of reality, since programmers typically will spend some effort to manually optimize code for accelerators such as GPUs and FPGAs. As it is unfeasible to hand optimize all benchmarks for each platform, and the code quality would depend heavily on the programmer, a compromise is found by inserting compiler directives. Without directives, compilers, as a part of the platforms of which flexibility is measured, can hardly exploit maximum potential of processors. Therefore, to further investigate the impacts of applying compiler directives on flexibility, compiler directives are inserted for CPU, FPGA, and DSP. Regarding GPUs, some manual efforts have already been made when implementing the CUDA version benchmarks. 4.3.1 CPU Multi-threading that parallelizes tasks among multiple threads is enabled by OpenMP directives, where threads run concurrently. With the runtime environment tasks are allocated to different threads. In each kernel, the outermost loop is vectorized. In case dependencies between loops exist, the inner loop would be parallelized. 4.3.2 FPGA For FPGAs, directives are inserted to introduce more parallelism and pipelines, aiming at optim- izing performance and increasing resource utilization. 4.3.3 DSP To examine the hypothesis that best-effort and parallelism techniques such as cache and multith- reading, negatively impact flexibility, Hexagon DSPs are simulated in two modes: timing accurate and inaccurate mode. With the accurate timing mode, Hexagon models cache, optimal multith- reading mode, and processor stalls. With the inaccurate mode, caches are assumed to be perfectly accessed, stalls are excluded, and a simplified multithreading model is simulated [21].

18 A Flexibility Metric for Processors Chapter 5

Implementation

The implementations involved in benchmarking different platforms are described. In addition, based on LLVM IR, a practical approach to extract the intrinsic workload from arbitrary applic- ations is introduced in this section.

5.1 CPU & GPU

A model implemented with Python is applied to automate benchmarking on GPUs and CPUs. Note that before operating via Python interface, all benchmarks and Makefiles were already in- stalled in personal file which can be accessed by multiple servers. The implementation of Makefiles guarantees that executable files generated during compilation process are named differently based on accessing servers, aiming at avoiding collisions caused by multiple servers operating at a same file.

Figure 5.1: Design flow of Python implementation for CPUs and GPUs

In the first step, two task lists are constructed for both CPUs and GPUs. Each task in the list is represented by its server address. Based on the task type and the server address, the threads created in step 2 are able to firstly establish connection to server and then execute corresponding commands on servers. Because race conditions may be caused by multiple threads working syn- chronously, a barrier is created to block the operation that closes the connection to servers. Only after all benchmarks are accomplished, connections to servers are closed. Step 3 mainly targets at data processing and analysis, involving extracting data from .txt file, transforming data to excel file, applying and comparing normalization methods. Eventually, the obtained results are plotted.

5.2 FPGA

In Vivado HLS, after synthesizing C function, a report is generated, providing performance metrics, such as resource utilization, estimated clock period, loop latency, function latency in clock cycles,

A Flexibility Metric for Processors 19 CHAPTER 5. IMPLEMENTATION and so on. Function latency indicates the number of cycles required to complete this function, similarly loop latency is for the loop in this function. In case the design involves a loop with a variable index, the number of iterations cannot be determined. Vivado HLS reports the loop latency as unknown, so as the function latency. A solution is to utilize C/RTL Co-simulation in Vivado HLS that simulates functions on RTL level. However, it is extremely time-consuming and memory intensive. For instance, co-simulating an application that runs with 1 second on a single-core i7-6700 processor, may take more than one week to complete and may generate larger than 100 GBs files. Therefore, in this work the static loop analysis is applied to efficiently compute accurate cycle count.

f o r(k= 0; k < m; k++) ( L1 ) f o r(i=k+1; i < m; i ++) ( L2 ) f o r (j=k+1; j < m; j++) ( L3 ) A[i][j] =A[i][k] ∗ A[ k ] [ j ] ; Listing 5.1: An example loop with variable indexes

An example loop with variable indexes is shown in Listing 5.1, where the integer m is a constant and greater than 0. To determine the latency of loop L1, the iterations of L2 and L3 should be known. As can be observed, the iterations of L2 and L3 depends on the variable k that only varies in L1. Thus, it can be inferred: m X 1 #L2 = (m − i) = m(m − 1) (5.1) 2 i=1 m X 1 #L3 = (m − i)2 = m(m − 1)(2m − 1) (5.2) 6 i=1 With the known loop iterations of L2 and L3, the latency of the outermost loop L1 can be determined. By default, Vivado HLS simply translates C functions into Verilog designs, few optimizations are ap- plied without directives. For instance, Vivado HLS keeps the loops in C function rolled, meaning that one iteration of the loop is synthesized into a block of logic, which the RTL design executes sequen- tially [22]. In this manner, FPGAs cannot fully exploit their massive parallelism, and merely a diminutive part of resources are utilized. Hence, to further investigate the Figure 5.2: Design flow of Vivado HLS simulation FPGA flexibility of optimized implement- ations, optimization directives are inserted to refine the implementations. Since each array is implemented as a block RAM by Vivado HLS, partitioning arrays is prerequisite to separate one large memory into multiple small memories, which increase the amount of simultaneous read and write ports from memories. After array partition, PIPELINE directives are inserted above the innermost loops, which provides best performance under area constrains, as Vivado HLS attempts to unroll all loops nested below a PIPELINE directive. Note that loops with variable bound can- not be unrolled. Thus, for the example in Code 5.1, L3 as the innermost loop will not be unrolled, and because of this the approach to determine loop latency maintains feasible. Only pipelining the operations inside L3 still results in significant improvement in performance. To speedup the simulation process, tasks are paralleled on several remote servers via the university network. Servers are assigned to execute simulations simultaneously with different applications and array partition factors (APFs). With a higher APF, more concurrent memory accesses can be achieved. Thus more parallel computations are possible for some applications when data dependencies allow. Certainly, more hardware resources are required. With other predefined

20 A Flexibility Metric for Processors CHAPTER 5. IMPLEMENTATION settings included in tcl files, such as clock period and device type, vivado hls synthesizes designs and generates synthesis reports, based on which some informative data is extracted.

5.3 DSP

Simulations of Hexagon DSPs were again automated on remote servers. Based on the unique host name of each server, the Makefile assigns diverse simulation tasks to servers, by passing the application names and the DSP model to be simulated. To attain the final simulation results, C applications with the model flag are firstly compiled by the compiler hexagon-clang. Next, the actual simulations start based on the generated files.

5.4 Intrinsic Workload Estimator

As discussed in Section 3.4.2, the intrinsic workload of applications is used as the baseline to normalize processor secondary metrics. In this section a practical approach is proposed based on LLVM IR to extract the intrinsic workload from arbitrary applications. In the compilation process, applications written in diverse languages are translated by front- ends to a high-level programming language IR. With IR, optimization techniques can be applied generally, instead of compiling directly to a target architecture. After being optimized by LLVM optimizer, back-ends translate optimized IR into machine codes based on the target ISA. Hence, it can be observed that LLVM IR is platform-independent.

Figure 5.3: Steps of extracting intrinsic workload as baseline

Figure 5.3 illustrates the steps to extract the intrinsic workload of applications. Applications are firstly compiled into LLVM IR, based on which instrumentation codes are inserted by the custom pass to call runtime library. Execution counts of IR instructions are recorded by the runtime library. As loops are fully unrolled in the ideal combinational circuit, where only binary operations are considered, the overheads in loop counters introduced by conditional statements are excluded. Any potential parallelism in IR instructions, such as vectorization, is eliminated explicitly. Code 5.2 shows an example of an IR instruction with a vector of elements, where two primitive double- precision operands are added. Instead of passing a vector as argument, the runtime library is called two times, meaning that the fadd operation on double-precision data has been executed twice.

%1 = fadd <2x double > %load5, %load9 Listing 5.2: An example IR instruction with vector

A Flexibility Metric for Processors 21 CHAPTER 5. IMPLEMENTATION

5.4.1 IR Interpreter The IR execution counts of running a radix-2 FFT are captured in Figure 5.4. Under the Dynamic Instruc- tion Count section, all the involved IR instructions, e.g., memory and binary operations, are listed with corres- ponding execution counts. The second section Binary Operation Count specially extracts all binary operations along with the bitwidth of operated data. For example, the integer multiplication mul processes both 32-bit and 64-bit integers in this application. This feature provides more accurate information of all the binary operations in applications, improving the accuracy of estimating ap- plications’ workload. Three additional functionalities are implemented to provide more alternatives for use. • Change the format of output results to a more ma- chine readable form by setting the compilation flag -w2e=True. • Eliminate the overheads in loop counters intro- duced by conditional statements by setting the compilation flag -eli=True. • Only apply the IR interpreter on a certain function by passing the function name to the compilation flag -fNameNotR or -fNameInMain. The difference between these two options are described next.

Figure 5.4: Output results of a radix-2 Design Flow FFT when N=2048 Figure 5.5a illustrates the design flow of the custom LLVM pass called libDynCountPass.so. The detailed C++ implementation of this pass is included in AppendixA. After translating applica- tions to LLVM IR, an LLVM module is generated, which basically is a collection of global variables and functions. Functions contain basic blocks, and each basic block comprises one or more IR instructions. In the design flow, both the two actions Insert BasicInstCall and Insert Benchmark-

(a) Design flow of the libDynCountPass.so (b) Compilation steps 22 A Flexibility Metric for Processors CHAPTER 5. IMPLEMENTATION

Call work in basic blocks. The action Insert BasicInstCall inserts a function call before each IR instruction with two arguments: operation type and data bitwidth. The action Insert Benchmark- Call is responsible for inserting special function calls to signal the runtime library to start or stop functioning. The mechanism of eliminating the overhead introduced by loop counts is simple. Typically, a for loop represented by IRs is composed of several basic blocks with dedicated names. When iterating through basic blocks in the function, the pass checks if the names of the basic blocks contain for.cond and for.inc. If so, this basic block will be skipped. In this manner, the IR instructions in these basic blocks will not be instrumented, and thus not be recorded. As mentioned before, two compilation flags -fNameNotR and -fNameInMain can be used to de- termine the function name. The essential difference of these two approach is the place to insert the special function calls to signal the runtime library, i.e., inside or outside of the MeasuredFunction.

MeasuredFunction() { main ( ) { call startBench(); call startBench(); ..... MeasuredFunction() ; call stopBench(); call stopBench(); r e t ; ..... } } Listing 5.3: For non-recursive functions Listing 5.4: For functions in main (-fNameNotR) (-fNameInMain)

Listing 5.3 shows the first approach that is applicable for non-recursive functions. At runtime when entering the MeasuredFunction, the runtime library is informed to start recording execution counts due to the activation of the startBench. Similarly, the stopBench is inserted before the re- turn ret of the MeasuredFunction, aiming at signalling the runtime library to stop recording. This approach can be applied for a non-recursive function no matter where the function is presented. However, due to the signaling function calls are included inside of the MeasuredFunction, when applying recursive functions such as FFT, this approach fails to function correctly. To solve this issue, another approach presented in Listing 5.4 is developed. Contrary to the first approach, only the function presented in the main can be applied. The way it works is simple, the special function calls startBench and stopBench are inserted above and below the MeasuredFunction, respectively. In this manner, the runtime library will only be signaled when the MeasuredFunction is called or returned in the main. Note that the MeasuredFunction should be predefined with the noinline attribute, since com- pilers would inline this function when applying optimizations during compilation. The inlining replaces the function call site with the body of the called function. For example, when the Meas- uredFunction in Listing 5.4 is inlined, then the computations in this function would be executed in the main. In this manner, the implemented IR interpreter is incapable of recognizing the func- tion when only passing the function name as an argument to the compiler. Certainly, a possible solution to fix the inlining issue can be achieved by utilizing some special pragmas to manually define the scope where the IR instructions should be recorded. In this work, this solution is not taken, as we prefer automating the extraction of intrinsic workload by only passing the function name to the compiler, which is more convenient.

Usage

The compilation steps to apply the custom pass libDynCountPass.so are illustrated in Figure 5.5b. The first step is to compile the target applications into bitcode files by the LLVM compiler clang. Note that LLVM bitcode is a binary representation of the textual LLVM IR. Next the LLVM optimizer as well as the analyzer opt loads the pass libDynCountPass.so and inserts the instrumented codes to the bitcode files. The obtained instrumented bitcode files are further transformed into object files by the LLVM static compiler llc. Eventually, the linker cc links the object files with the runtime library and produces the final executable file.

A Flexibility Metric for Processors 23 CHAPTER 5. IMPLEMENTATION

Validation The validity of the IR interpreter is checked by both the non-recursive and recursive functions. The 2D matrix multiplication is an example of non-recursive functions, the multiplications are conducted between two N x N matrices. Hence, to examine its validity for non-recursive functions, the extracted execution counts for operation mul and add should meet this relation: #(mul) = #(add) = N 2. For a length-N Radix-2 FFT, the execution counts of operations such as fmul, fsub, and fadd, should be a constant factor of its complexity Nlog2N where N is a power of 2. Figure 5.6 illustrates the execution counts of the operation mul and add in the 2D matrix multiplication when varying N. Intuitively, regardless of the N value, the relation that #(mul) = #(add) always hold. A trend-line N 2 seems to be perfectly followed by #(mul) and #(add), namely #(mul) = #(add) = N 2 holds. The relations between the execution counts of operations and the FFT block size N are presented in Figure 5.7. Similarly, with a varying N the execution counts of the operation fmul, fsub, and fadd seem to have the same trend as the complexity of the Radix-2 FFT, i.e., #(fmul) = 4Nlog2N and #(fadd/fsub) = 4Nlog2N. That is to say, one complex radix-2 butterfly requires 4 multiplications and 4 additions/subtractions, which accords with the description in Section 2.2.3. To conclude, the IR interpreter is examined to function correctly for both non-recursive and recursive applications.

Figure 5.6: The relation between operation execution counts and N in the 2D matrix multiplication

Figure 5.7: The relation between operation execution counts and N in the Radix-2 FFT

24 A Flexibility Metric for Processors CHAPTER 5. IMPLEMENTATION

5.4.2 Intrinsic Transistors With the IR interpreter, the execution times of IRs can be recorded dynamically when running the executable file compiled from modified IR. To weigh IRs based on their most intrinsic work, IR instructions are firstly broken into 2-input-1-output and 1-input-1-output gates. In theory the workload of all these basic gates is equivalent. However, in CMOS implementation some gates are more complex than others. Therefore the minimum amount of transistors to construct a gate in CMOS is used as a practical measure.

%mul = mul i8 %1, %2 Listing 5.5: An example IR instruction of 8-bit integer multiplication

The first step to obtain the most intrinsic transistors is breaking IR instructions into gates. Two approaches are considered in this project. Taking the IR instruction shown in Listing 5.5 as an example, a combinational multiplier can be constructed, based on which a formula to estimate the gate count can be further derived. Another approach is to apply RTL compilers to synthesize the 8-bit integer multiplication written in Verilog. In this project, we firstly derive the formulas to estimate the gate count for each involved binary operation, then compare the estimated gate counts to the results of RTL synthesis, aiming at examining the reliability of both the derived and synthesized results.

Figure 5.8: Interconnections for an 8x8 combinational multiplier [23]

To explain the steps of gate count estimation, here we still use the 8-bit integer multiplication as an example. Based on the shift-add multiplication algorithm, Figure 5.8 visualizes the inter- connections for an 8x8 combinational multiplier for two unsigned integers. To be more precise, the

A Flexibility Metric for Processors 25 CHAPTER 5. IMPLEMENTATION

multiplicand X and multiplier Y are defined as X = x7x6x5x4x3x2x1x0 and Y = y7y6y5y4y3y2y1y0, respectively. Each rectangle box named as yixi indicates the logic AND of yi and xi, and each “+” box represents a full adder. Connecting the carriers of full adders in each row forms an 8-bit ripple adder [23]. In this manner, a formula can be derived to estimate the gate count G for this 8-bit multiplication as follows:

G(8 − bit unsigned MUL) = G(AND) · #(AND) + G(RA8−bit) · #(RA8−bit)

= #(AND) + 8 · G(FA) · #(RA8−bit) (5.3) = #(AND) + 40 · #(RA8−bit) = 82 + 40 · 7 = 344

In case of a n-bit unsigned multiplication, it can be further inferred as:

G(n − bit unsigned MUL) = G(AND) · #(AND) + G(RAn−bit) · #(RAn−bit) = n2 + 5n · (n − 1) (5.4) = 6n2 − 5n

Note that the RA and the FA used above represent the ripple adder and the full adder respect- ively. Equation 5.4 shows a generic function that uses a bitwidth n as the input to output the estimated gate count for a n-bit unsigned multiplication. Using the same estimation approach, formulas for other operations including floating-point operations are listed in Table 5.1.

Table 5.1: Derived formulas to estimate gate count for binary operations Operation Formula Applied algorithm Unsigned ADD/SUB1 5n n-bit Ripple-carry adder Unsigned MUL1 6n2 − 5n shift-add multiplier (Fig. 5.8) Unsigned DIV1 6n2 non-restoring array divider Floating-point MUL2 6s2 − 9s + 15e + 4 including s-1 bits unsigned multiplication Floating-point DIV2 24s2 − 40s + 15e + 17 including 2(s-1) bits unsigned division 1 n is the bitwidth of the operated data. 2 s and e represent the amount of significant bits and exponent bits defined by the IEEE 754 for floating-point arithmetic.

After proposing formulas to estimate gate counts for diverse binary operations, the Cadence RTL compiler synthesizes operations written in Verilog. An Verilog example of an 8-bit unsigned multiplication is shown in Listing 5.6[24]. Note that only the basic gates with fewer than 3 inputs are employed in synthesis, namely attributes are set to only allow the following gates with only 2 inputs gates AND, NAND, OR, NOR, XOR, XNOR, and INV with only 1 input.

module unsigned multipply (y,a,b); parameter wA= 8, wB= 8; input [wA−1 : 0 ] a ; input [wB−1 : 0 ] b ; output [wA+wB−1 : 0 ] y ; a s s i g ny=a ∗ b ; endmodule Listing 5.6: An 8-bit unsigned integer multiplication in Verilog

Figure 5.9 shows the comparisons between the estimated and synthesis results of binary op- erations. Intuitively, the gate estimation of unsigned addition, subtraction, floating-point multi- plication and division seem to give quite similar results as the synthesis with only less than 10%

26 A Flexibility Metric for Processors CHAPTER 5. IMPLEMENTATION deviation. However, for both unsigned multiplication and division, the established formulas seem to overestimate the amount of actual gates required in synthesis. With the increase of operation bits, this overestimation of the unsigned multiplication tends to increase, from -6% to 12.9%, and eventually reaching 19.7% for 64 bits. In contrast, the overestimation of the unsigned division decreases from 40% to 34%. The over-estimation can be explained as the actual Cadence im- plementations use more efficient designs than using a series of subtractors or adders to perform divisions and multiplications. Overall, the formulas derived based on combinational designs suffi- ciently validate the synthesis results. AppendixB lists the Verilog codes used for synthesis. With the synthesis results, diverse gates are further weighted by the minimum amount of transistors that are required for each basic CMOS logic gate. Table 5.2 lists weights of 7 basic gates. More details including weights of IR instructions in a unit of gate, transistor, and logic depth are concluded in Table 5.3.

Table 5.2: Basic gates weighted in number of transistors Gate AND NAND OR NOR XOR XNOR INV Trans. 4 4 4 4 8 8 2

Table 5.3: IR instructions weighted in gate, transistor and logic depth Bit width IR 32-bit 64-bit Gate Trans. Depth Gate Trans. Depth add/sub1 188 880 63 380 1776 127 fadd/fsub 1905 8086 103 3494 14882 233 mul 5164 25458 130 20100 98476 313 fmul 4327 20490 97 16124 77246 215 udiv 4486 10763 1128 18217 44752 4350 sdiv 4777 21300 1189 18840 84616 4458 fdiv 12190 54146 992 66565 303284 2295 urem 4616 20426 1168 18452 82524 4440 srem 4882 21842 1228 19053 85682 4552 and 32 128 1 64 256 1 or 32 128 1 64 256 1 xor 32 256 1 64 512 1 1 16-bit add/sub operation: #(gate)=92, #(trans.)=432, #(depth)=31; 8-bit add/sub operation: #(gate)=44, #(trans.)=208, #(depth)=15; 1-bit add/sub operation: #(gate)=2, #(trans.)=12, #(depth)=1.

A Flexibility Metric for Processors 27 CHAPTER 5. IMPLEMENTATION

Figure 5.9: The gate count comparison between estimated and simulated results

28 A Flexibility Metric for Processors Chapter 6

Methodologies

In this section, to clarify the involved metrics, how each metric is quantified and the meaning behind each metric are described in detail. Note that a new metric parallelism is introduced in this section. In addition, approaches applied to approximate energy consumption and area usage are explained.

6.1 Flexibility

As defined in Section3, processor flexibility is determined relative to other metrics, such as performance, energy and area efficiency, namely when applying flexibility measure in normalized performance based on execution time (ET) gives a performance-flexibility value. Benefiting from the inherent property of flexibility measure, being invariant to multiplicative relations, measuring flexibility regarding energy and area efficiency gives the same flexibility value. Since TDP and transistor count are properties for each processor, being irrelevant to executed applications, it can be inferred: #(IT ) Flexibility = GSD( ) perf ET #(IT ) =Lemma 3.2.1 GSD( ) = Flexibility (6.1) ET · P ower energy efficiency #(IT ) =Lemma 3.2.1 GSD( ) = Flexibility ET · Area area efficiency

6.2 Secondary Metrics

Metrics, including normalized performance, energy and area efficiency, are the secondary metrics, which flexibility is measured relative to. The GM is used to average these metrics, as multiplicative relations are introduced implicitly in the definitions of those metrics.

6.2.1 Performance Normalizing execution time to estimated intrinsic workload represents normalized performance. After normalization, normalized performance is expressed as the amount of intrinsic transistors per second. A processor that can complete more intrinsic transistors per second, is considered to have better performance.

#(IT ) Performance = (6.2) ET

A Flexibility Metric for Processors 29 CHAPTER 6. METHODOLOGIES

6.2.2 Energy Efficiency Expressed as the amount of intrinsic transistors per joule, energy efficiency is defined as the quo- tient of performance and power. More intrinsic transistors that can be completed when consuming one joule energy, indicates better energy efficiency. #(IT ) Energy Efficiency = (6.3) ET · P ower

6.2.3 Area Efficiency Area is quantified as the amount of transistors implemented on processors, being independent from process technology. In this manner, area efficiency is formulated as performance divided by the total amount of actual transistors (AT), yielding that transistor utilization rate per second. Namely more intrinsic transistors can be completed with a certain amount of transistors, or fewer actual transistors are required to maintain the same performance, meaning better area efficiency. #(IT ) Area Efficiency = (6.4) ET · #(AT )

6.3 Parallelism

Parallelism in this context is defined as intrinsic workload divided by the total number of cycles, implying the amount of intrinsic transistors per cycle. Thus, more intrinsic transistors can be accomplished within one cycle, indicating more parallelism on transistor level is achieved. #(IT ) Parallelism = (6.5) Cycles

6.4 Approximation

Measuring processors in terms of performance, energy and area efficiency remains challenging, as it is difficult to unify experimental measures on processors across different technologies. Besides, processors like FPGAs and DSPs in this work can barely be assessed via simulations. Hence, the estimation of power consumption and area usage is conducted by utilizing existing information and tools, which are provided in Table 6.1 and Table 6.2. Note that Table 6.1 lists transistor counts and thermal design power (TDP) that represents the average power dissipated of all cores when the processor operates at base frequency [25]. When calculating energy and area efficiency, the same TDP value and transistor count are used for both single-threaded and multi-threaded mode. As the whole multicore processor is considered as a system, only utilizing single thread in this system leaves other threads idle, resulting in energy and area overhead. Similarly, transistor count of the FPGAs is used for both the native implementation and the optimized version.

30 A Flexibility Metric for Processors CHAPTER 6. METHODOLOGIES

Table 6.1: Related information of processors in this work No. Processor Freq. #Trans. TDP Node Ref. (MHz) (million) (watt) (nm) 1 GTX TITAN 837 7100 250 28 [26][27] 2 GTX 570 732 3000 219 40 [26] 3 GTX 750 TI 1020 1870 60 28 [26] 4 Tegra K1 756 - 1 14 28 [26][28] 5 Pentium 4 3400 169 84 90 [25] 6 i7-920 2670 731 130 45 [25] 7 i7-950 3070 731 130 45 [25] 8 i7-960 3200 731 130 45 [25] 9 i7-4770 3400 1400 84 22 [25][29] 10 i7-6700 3400 1750 64 14 [25][30] 11 ARM1176 700 - 2.9 40 [31] 12 Cortex A9 1700 - 4 32 [32] 13 Cortex A15 2300 - 5 23 [33][34] 14 Cortex A53 1200 - 4.4 40 [31] 15 TI C6747 300 222 0.453 65 [35][36] 16,19 Hexagon V5 650 - - 28 [37] 17,18 Hexagon V60 2000 - - 14 [38] 1 ”-” means no information available. 2 Speculated based on TI C66x DSPs [35]. 3 Estimated by TI Power Estimation Spreadsheet 2013.3.

Table 6.2: Related information of FPGAs in this work Processor Freq. (MHz)1 #Trans.2 Power (watt) 3 Node no opt. opt. (million) no opt. opt. (nm) Artix7 116 102 1025 1.4 2 28 Kintex7 120 103 2370 1.7 2.5 28 Virtex7 118 105 9700 2.3 3 28 Zynq 118 96 2200 1.7 2.6 28 Virtexuplus 118 104 14000 4.1 5.5 16 Kintexu 120 101 5300 2.1 3.4 16 Zynquplus 117 103 4100 2.1 2.8 16 1 Determined by the average estimated clock period computed by Vivado HLS when the target clock period is 10ns. 2 Speculated based on a published value of Virtex UltraScale XCVU440[39]. 3 Estimated by Xilinx Power Estimator (XPE) 2018.2.2 based on the average design utilization of benchmark set.

A Flexibility Metric for Processors 31 Chapter 7

Results

In this section the flexibility results are analyzed along with the relation between flexibility and secondary metrics among GPUs, CPUs, FPGAs, and DSPs. Next several promising graphs are presented to further investigate implicit relations between other metrics, such as energy efficiency and area efficiency. All detailed results are included in AppendixC.

Figure 7.1: Scatter Matrix of metrics to explore the implicit relations. All the results are normal- ized to 14nm process technology.

A scatter matrix is provided to review the relations among diverse metrics by observing the distribution of each two metrics, as shown in Figure 7.1. The diagonal where each metric is plotted against itself represents the kernel density estimation of each architecture class, indicating the underlying distribution. For example of the left-top flexibility-flexibility graph. As the red curve

32 A Flexibility Metric for Processors CHAPTER 7. RESULTS suggests, a multi-threaded CPU is most likely to be seen at the points near 0.35, when assessing its flexibility. Note that except flexibility the absolute values of other metrics are presented in a log scale, e.g., 14 is actually 1014. The relations between flexibility and other secondary metrics are captured and presented in the first column of Figure 7.1. In general, intuitive inverse relationships can be observed between flexibility, performance, energy efficiency, and parallelism, where GPUs seem to have superior performance, energy efficiency and the highest parallelism, but also the least flexibility. In contrast FPGAs are the most flexible while the least energy and area efficient. These results confirm the intuitive trade-offs between performance, energy efficiency and flexibility. However, regarding flexibility and area efficiency, a slight decline trend can be observed among GPUs, CPUs, and DSPs, whereas a large gap exists between FPGAs and other processors when ordering with area efficiency. Hence, the tradeoff between flexibility and area efficiency is less explicit compared to performance and energy efficiency. Next more detailed discussions will be held revolving around flexibility. The goal is to analyze how architecture classes perform in terms of flexibility, and find the underlying reasons of incurring flexibility.

7.1 Flexibility Analysis

As can be observed from Figure 7.2, with flexibility as the horizontal metric of coordinate axis, clusters are formed among processors based on their architectures, which validates the proposed metric in sufficiently distinguishing different architecture classes in flexibility.

(a) Performance and flexibility

(b) Energy efficiency and flexibility (c) Area efficiency and flexibility

Figure 7.2: Visualizing the implicit relations between flexibility and other secondary metrics

A Flexibility Metric for Processors 33 CHAPTER 7. RESULTS

7.1.1 Native Flexibility Results Without manual code optimizations, the architecture classes sorted from the least to the most flexible are: GPUs, CPUs, DSPs, and FPGAs.

GPU GPUs have superior performance and energy efficiency. As might be expected this enormous performance advantage comes at the expense of flexibility, which implicitly reveals that GPUs with SIMD architecture can deliver superior performance as being specialized for massively parallel processing, while incapable of equally supporting sequential applications. Only 192 CUDA cores are employed in the embedded GPU Tegra K1. This number is 14x lower than the GTX TITAN. Combined with a lower clock speed and 3x smaller onchip memories, it results in nearly 38x performance of the GTX TITAN over Tegra K1 while only 2x difference in energy efficiency. However, when comparing flexibility, the embedded GPU is measured to be more flexible than all desktop GPUs, among which the GTX TITAN having the most cores is the least flexible. Therefore it can be extrapolated as: employing more CUDA cores to achieve more parallelism, incurs an increasing dispersity among applications. That is to say, it enlarges the difference between the applications that can fully benefit from more parallelism and the applications that can hardly benefit, therefore flexibility decreases.

CPU CPU as a typical representation of general purpose processors, is considered by some to be the processor with the greatest flexibility [4]. However, in this work, both DSPs and FPGAs with native implementations are evaluated to be more flexible than CPUs based on the proposed metric. Contrary to popular belief, these results are logical if we consider the fact that an increasing amount of advanced techniques have been employed in modern CPUs to improve the average computation speed, such as advanced branch predictors and deep pipelines. Inevitably, system behavior might vary significantly along with the increase in architectural complexity, becoming inflexible. When comparing flexibility among CPUs, embedded CPUs (ARM) and desktop CPUs (Intel) form two distinct green clusters. Compared to embedded CPUs, desktop CPUs with higher performance are measured to be more flexible on average. Amongst all benchmarked Intel CPUs, the i7-4770 with Haswell (4th generation) and the i7-6700 with Skylake (6th generation) seem to be more flexible than the outdated Pentium 4 with NetBurst and the most processors with Nehalem micro-architecture (1st generation). To some extent, it indicates that with the upgraded architectures, Intel CPUs tend to improve overtime in terms of flexibility.

DSP VLIW DSPs, as domain-specific processors, are evaluated to be more flexible than general-purpose CPUs on average, which is somewhat expected. Besides the reason that inflexibility emerges in modern CPUs, DSPs also become more than just ”the dedicated processors with multiply- accumulate”. Traditional DSPs are designed with specialized functionalities, dedicated memory paths, and few registers. Contrary to traditional DSPs, well-connected functional units, orthogonal datapaths, and a large number of registers are employed in VLIW DSPs [40]. Therefore, DSPs with the VLIW architecture are gradually mutating toward GPPs. Another interesting observation is that the TI C6747 DSP is less flexible than the Hexagon DSPs. This result contradicts the hypothesis that the introduction of multi-threading exacerbates flexibility, as this technology fails to benefit all applications equally. The speedup calculation of the TI DSP to other processors states that the TI DSP performs extremely poor for the application that implements alternating direction implicit solver, where extensive floating-point divisions (FDIV) are performed. If this application is excluded, the TI DSP gains 40% in flexibility, resulting in higher flexibility than the other two Hexagon DSPs and being just slightly lower than FPGAs. The results obtained by micro-benching TI DSP with CCS4 states that nearly 9 · 103 cycles are taken to simulate a single FDIV with double precision. The reason behind this is that the default CCS compiler does

34 A Flexibility Metric for Processors CHAPTER 7. RESULTS not automatically invoke the fast FDIV function from the runtime library when compiling FDIV operation represented by ”/” in C. Only 367 cycles per FDIV are required when the runtime library is used [41]. However, since compilers are a part of processors to be evaluated, within the involved application domain lacking sufficient support for FDIV leads to inflexibility. Furthermore, in case that all operations can be supported equally by all DSPs, single-threading delivers higher flexibility compared to multi-threading.

FPGA FPGAs with native implementations perform just as poorly as DSPs, whereas being measured as the most flexible. The reason is intuitive: FPGAs offer reconfigurable logic components which can be reloaded with new configurations for arbitrary applications. Besides, without manual work, the current HLS compiler as a part of processors, barely optimizes native designs, and therefore cannot fully exploit available resources to achieve massive parallelism, i.e., the average resource utilization rates of UltraScale+ FPGA boards are only about 1%. In this manner, the unoptimized implementations can be supported by all seven FPGAs without resource constrains, resulting in a compact cluster of FPGAs regarding flexibility as shown in the performance-flexibility graph of Figure 7.2.

7.1.2 Compiler Directives As processors cannot be fully exploited with default compilations, in this work some compiler directives are inserted to investigate changes of the flexibility.

CPU Shifts in flexibility are introduced when activating multi-threaded processing, which in general contributes to modest improvements in both performance, energy and area efficiency, regardless of embedded or desktop CPUs. How much can be gained from multi-threading depends on the number of available hardware threads and the dependencies in applications. For instance of an Intel i7-960 with 8 threads, a nearly 6x performance is achieved for matrix multiplications. On the contrary, processing in multi-threading causes resource contention and cannot utilize the hardware efficiently for applications without sufficient task-level parallelism. Additionally, as observed from the performance-flexibility graph, the horizontal shifts of the 4-core processor i7-6700 and i7-4770 are approximately 3x-4x larger than the shift of the processors that have the same number of cores but with Nehalem micro-architecture. On the other hand, comparing Pentium 4 to other processors with 8 threads, fewer flexibility penalties are for the outdated Pentium 4 in multi-threading along with the least performance improvements. Pentium 4 with a single CPU can enable two threads due to hyper-threading technique. Overall, it confirms that executing in multi-threading mode exacerbates flexibility because not all applications can benefit equally from multi-threading. The decline of flexibility is influenced by both the architectural designs and the number of available threads to be employed. The processors with fewer hardware threads tend to suffer less in terms of flexibility when processing in multi-threading.

DSP To verify the hypothesis that best-effort designs such as cache hierarchy affect system flexibility negatively, two Hexagon DSPs simulated with perfect caches, no system stalls and a simplified multi-threading model, are included, which are represented by the points 18 and 19 in Figure 7.2a. The constructed ideal DSPs are more flexible than FPGAs without optimizations, verifying the hypothesis.

FPGA For the FPGAs, directives are applied to explore the impact of the number and the type of resources on flexibility. Compared to the native FPGA implementations, applying optimization directives contributes to only 3x-10x improvements in both performance, energy and area efficiency, while its

A Flexibility Metric for Processors 35 CHAPTER 7. RESULTS

flexibility sharply declines. Exploiting more parallelism typically requires more resources. Large variation among applications is introduced, when some applications can fully utilize resources to parallel computations while other applications cannot. Failing to utilize resource might result from requiring more resources than the available resources, or being limited by dependencies in applications. Beside, the majority of resource limits most likely occurs in DSP slices. For example of performing symmetric rank-k update. With directives, more than 3000 DSPs are required, contributing to more than 130x performance improvement. In this manner, the heavy demand for DSPs can only be satisfied by the Virtexuplus and the Kintexu. Note that relatively more DSPs are employed in the Kintexu as it is designed specially for signal processing. The large amount of DSPs benefiting only few applications exacerbates the disparity, reducing flexibility. Overall, if the data dependencies in applications allow, significant performance improvements can be achieved by utilizing more resources. In this manner, FPGAs with relatively more resources have higher performance and area/energy efficiency than FPGAs with fewer resources. However, inflexibility is incurred as not all applications can be highly parallelized, thus cannot benefit equally. 7.2 Other results

Figure 7.1 includes several interesting graphs that reveal the underlying relations among different metrics. Amongst them two graphs deserve some discussions and will be further analyzed. 7.2.1 Performance vs Parallelism

Figure 7.3: Visualizing the implicit relation between performance and parallelism

Figure 7.3 illustrates the relation between performance and parallelism where a intuitive linear relation can be observed when log scaling these two metrics. More parallelism leads to higher performance. GPUs with SIMD architectures can achieve the most parallelism. Compared to single-threaded CPUs, more parallelism and higher performance are introduced for multi-threaded CPUs, which is also expected. The clusters of FPGAs seem to follow a linear trend as Line 2, whilst Line 1 is constructed to represent the general trend of Intel CPUs. Observing from Figure 7.3, Line 2 seems to be the line after shifting Line 1 downwards. To explain the reason behind these two paralleled trend lines in Figure 7.3, some mathematical justifications are conducted.

1 1 n ! n n ! n #(IT ) Y #(IT )n 1 Y #(IT )n 1 Parallelism = GM( ) = = · = · Performance (7.1) ET · f ET · f f ET f i=1 n i=1 n where n is the total number of applications, and f represents the processor frequency. As implied by Equation 7.1, for each processor its average performance can be expressed as frequency multiplying parallelism. The next step is to log scale both parallelism and performance,

36 A Flexibility Metric for Processors CHAPTER 7. RESULTS thus it can be inferred: lg (Performance) = lg (f) + lg (Parallelism) (7.2) where lg(Performance) and lg(Parallelism) represent the Y axis Performance and X axis Paral- lelism respectively in Figure 7.3. After log scaling, the performance Y and parallelism X of each processor can be formulated as Equation 7.2. Therefore, it can be extrapolated: for each processor, after log scaling its performance is proportional to its parallelism, and satisfies this linear relation Y = c + X, where the constant c is determined by its frequency. In this manner, frequency is the main factor that influences the spread of each architecture class in terms of parallelism. GPUs seem to sacrifice some clock speed in order to achieve massive parallelism. FPGAs designed as reconfigurable devices can hardly execute with high frequency. Hence, to achieve the same mag- nitude of performance as other processors, more parallelism is required for FPGAs executing at relatively low frequencies. 7.2.2 Area Efficiency vs Energy Efficiency

Figure 7.4: Visualizing the implicit relation between area and energy efficiency

Intuitively, a proportional relation between area and energy efficiency can be observed in Figure 7.4, which indicates that an area-efficient processor is most likely to also have high energy efficiency. Among processors, GPUs are the most area and energy efficient. Next both the multi-threaded CPUs, DSPs and the optimized FPGAs can achieve the same magnitude of energy efficiency, while the multi-threaded CPUs are nearly 100x more area efficient than the optimized FPGAs. More remarkably, diverse tendencies seemingly exist between FPGAs and other processors. FPGAs tend to follow a linear relation as Line 1, whilst another Line 2 can be constructed to roughly represent the relation between energy and area efficiency for GPUs, CPUs and DSPs. As the slopes of these two trend lines suggest, with the same amount of increment in area efficiency, FPGAs can hardly gain an equivalent increase in energy efficiency as other processors. The energy inefficiency may be caused by a high static power consumption on FPGA boards regardless of specific designs. Convincing proof can be found in Table 6.2. With optimizations, the power consumption increases less than 50% on average. The largest FPGA Virtexuplus seems to suffer the highest overhead caused by the static power consumption. This is observed from the fact that without optimizations its power consumption is already 4.1W, only 1.4W power increment is introduced when applying optimizations. Therefore, compared to other instruction-based processors, FPGAs can attain less improvement in energy efficiency, which might be because of its high static power leakage.

A Flexibility Metric for Processors 37 Chapter 8

Conclusions

As the exponential growth of computing power has halted due to technology scaling, the future advancements have to be done on the architectural side. Therefore, it is essential for researchers to understand and then exploit a variety of tradeoffs in architectural design. In particular, a tradeoff between flexibility and performance/energy efficiency has been frequently claimed and is used in diverse contexts. However, a generically applicable, quantifiable flexibility metric is still missing. In this work a quantifiable flexibility metric that can be applied generically across diverse platforms and benchmarks is proposed, which allows a valid quantitative comparison between processors. Along with a novel normalization method based on intrinsic workload of applications, this metric evaluates 24 platforms with 14 benchmarks. As capable of clearly distinguishing diverse architecture classes, this flexibility metric suggests that the major architecture classes ordered from the least to the most flexible are: GPUs, CPUs, DSPs, and FPGAs. Through flexibility analysis, several techniques that tend to exacerbate processor flexibility are observed.

• Operations that lack sufficient support extraordinarily, such as floating-point division. • Parallelism techniques that fail to equally benefit all applications, such as multithreading. Increasing the parallelism degree typically decreases flexibility. • Best-effort designs that introduce significant performance variation, such as cache. With the proposed flexibility metric, relations between flexibility and performance/energy/area efficiency have been investigated. The tradeoffs between flexibility and performance/energy effi- ciency are confirmed, while the tradeoff between flexibility and area efficiency seems implicit. Moreover, an interesting property parallelism metric is proposed, based on which a straightfor- ward concept is verified: more parallelism results in higher performance. To achieve the same performance as other architectural classes, more parallelism is required for FPGAs. In addition, the relation between area and energy efficiency is also explored, which indicates that an area- efficient processor is most likely to also have high energy efficiency. Among diverse architecture classes, reconfigurable devices can hardly attain the same improvements in energy efficiency with the same area efficiency. At last, to facilitate the wide application of the proposed metric, an open source tool1 is released to automatically extract the intrinsic work of arbitrary applications. Overall, this work provides a starting point in assessing processor flexibility, and we aim to raise awareness and discussions in the future computer architecture design, eventually contributing to a generation of new flexible processors.

1https://surfdrive.surf.nl/files/index.php/s/gQiDoItf3cHqPkk

38 A Flexibility Metric for Processors Chapter 9

Future Recommendations

The quantifiable flexibility metric proposed in this work is a starting point of investigating and exploiting processor flexibility. The limitations that exist in this work have suggested some possible improvements.

• Utilizing the actual power values instead of the power values in literature or TDPs, as the average power values might vary depending on the executed applications. • Exploring another approach to quantify the area of processors, as most transistor counts applied in this work are based on speculations. The available information is insufficient re- garding the transistor counts of modern processors, in particular, all ARM CPUs, DSPs and FPGAs. Missing the transistor counts of ARM processors makes it impossible to distinguish embedded and desktop CPUs in area efficiency.

Recommendations for future research in further investigating flexibility are summarized:

• Applying the intrinsic work extracted by the workload estimator to processor versatility presented in Eq.2.3[14]. In this manner, a unit of work is defined as an application. The amount of useful work specified for the application is thus the intrinsic transistors extracted by the workload estimator. However, determining the amount of information to specify each application still remains challenging.

A Flexibility Metric for Processors 39 Bibliography

[1] M. Dubois, M. Annavaram, and P. Stenstrm, Parallel Computer Organization and Design. New York, NY, USA: Cambridge University Press, 2012.1

[2] R. Hameed, Balancing Efficiency and Flexibility In Specialized Computing. PhD thesis, Stan- ford University, 2013.1,2

[3] L. Null and J. Lobur, The Essentials of Computer Organization and Architecture. USA: Jones and Bartlett Publishers, Inc., 4th ed., 2014.2

[4] K. Karuri and R. Leupers, Application Analysis Tools for ASIP Design: Application Profiling and Instruction-set Customization. Springer Publishing Company, Incorporated, 2014.2, 34

[5] R. Fasthuber, F. Catthoor, P. Raghavan, and F. Naessens, Energy-Efficient Communication Processors: Design and Implementation for Emerging Wireless Systems. Springer Publishing Company, Incorporated, 2013.2,5

[6] A. Lukefahr, S. Padmanabha, R. Das, R. Dreslinski, Jr., T. F. Wenisch, and S. Mahlke, “Heterogeneous trump voltage scaling for low-power cores,” in Proceedings of the 23rd International Conference on Parallel Architectures and Compilation, PACT ’14, (New York, NY, USA), pp. 237–250, ACM, 2014.2

[7] G. Chryssolouris, “Flexibility and its measurement,” CIRP Annals, vol. 45, no. 2, pp. 581 – 587, 1996.3

[8] P.H.Brill and M.Mandelbaum, “On measures of flexibility in manufacturing systems,” Inter- national Journal of Production Research, vol. 27, no. 5, pp. 747–756, 1989.3

[9] W. Kellerer, A. Basta, P. Babarczi, A. Blenk, M. He, M. Klugel, and A. M. Alba, “How to measure network flexibility? a proposal for evaluating softwarized networks,” IEEE Commu- nications Magazine, pp. 2–8, 2018.3

[10] E. Lannoye, D. Flynn, and M. O’Malley, “Evaluation of power system flexibility,” IEEE Transactions on Power Systems, vol. 27, pp. 922–931, May 2012.4

[11] V. Oree and S. Z. S. Hassen, “A composite metric for assessing flexibility available in con- ventional generators of power systems,” Applied Energy, vol. 177, pp. 683 – 691, 2016.4, 12

[12] R. Kumar, D. M. Tullsen, P. Ranganathan, N. P. Jouppi, and K. I. Farkas, “Single-isa het- erogeneous multi-core architectures for multithreaded workload performance,” in Proceedings. 31st Annual International Symposium on Computer Architecture, 2004., pp. 64–75, June 2004. 4

[13] E. Tomusk, C. Dubach, and M. O’Boyle, “Measuring flexibility in single-isa heterogeneous processors,” in 2014 23rd International Conference on Parallel Architecture and Compilation Techniques (PACT), pp. 495–496, Aug 2014.4,5

40 A Flexibility Metric for Processors BIBLIOGRAPHY

[14] K. van Berkel, “Processor versatility (flexibility) - an attempt at definition and quantifica- tion.” 2014.6, 39

[15] R. R. Wilcox and H. J. Keselman, “Modern robust data analysis methods: measures of central tendency.,” Psychological methods, vol. 8 3, pp. 254–74, 2003.8

[16] M. MN and B. MJ, “What does it mean? a review of interpreting and calculating different types of means and standard deviations,” Pharmaceutics, April 2017.8

[17] P. J. Fleming and J. J. Wallace, “How not to lie with statistics: The correct way to summarize benchmark results,” Commun. ACM, vol. 29, pp. 218–221, Mar. 1986.9

[18] E. Union and J. Research Centre-European Commission, Handbook on Constructing Com- posite Indicators: Methodology and User Guide. 01 2008. 13

[19] A. Jain, K. Nandakumar, and A. Ross, “Score normalization in multimodal biometric sys- tems,” Pattern Recognition, vol. 38, no. 12, pp. 2270 – 2285, 2005. 13

[20] S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos, “Auto-tuning a high-level language targeted to gpu codes,” in 2012 Innovative Parallel Computing (InPar), pp. 1–10, May 2012. 16

[21] Qualcomm, “Hexagon simulator user guide,” December 2016. 18

[22] Xilinx, “Vivado design suite user guide: High-level synthesis,” April 2017. 20

[23] J. F. Wakerly, Digital Design: Principles and Practices. Upper Saddle River, NJ, USA: Prentice-Hall, Inc., 3rd ed., 2000. 25, 26

[24] Cadence, “Hdl modeling in encounter rtl compiler,” June 2016. 26, 48

[25] Intel. https://ark.intel.com/#@Processors, accessed 2019-1-8. 30, 31

[26] Nvdia, “Geforce.” https://www.nvidia.com/en-us/geforce/, accessed 2019-1-8. 31

[27] C. Angelini, “Geforce gtx titan x review: Can one gpu handle 4k?,” 2015. https://www. tomshardware.com/reviews/nvidia-geforce-gtx-titan-x-gm200-maxwell,4091.html, accessed on 2019-1-8. 31

[28] P. M.M. Pereira, P. Domingues, N. Rodrigues, G. Falcao, and S. De Faria, “Assessing the performance and energy usage of multi-cpus, multi-core and many-core systems : The mmp image encoder case study,” International Journal of Distributed and Parallel systems, vol. 7, pp. 01–20, 09 2016. 31

[29] Anandtech, “The haswell review: Intel core i7-4770k & i5- 4670k tested,” 2013. https://www.anandtech.com/show/7003/ the-haswell-review-intel-core-i74770k-i54560k-tested, accessed 2019-1-8. 31

[30] Intel, “Inside 6th gen intel core: New code named skylake,” 2016. https: //www.hotchips.org/wp-content/uploads/hc_archives/hc28/HC28.23-Tuesday-Epub/ HC28.23.90-High-Perform-Epub/HC28.23.911-Skylake-Doweck-Intel_SK3-r13b.pdf, accessed 2019-1-8. 31

[31] M. Cloutier, C. Paradis, and V. Weaver, “A raspberry pi cluster instrumented for fine-grained power measurement,” Electronics, vol. 5, p. 61, 09 2016. 31

[32] D. Abdurachmanov, P. Elmer, G. Eulisse, and S. Muzaffar, “Initial explorations of arm processors for scientific computing,” Journal of Physics: Conference Series, vol. 523, no. 1, p. 012009, 2014. 31

A Flexibility Metric for Processors 41 BIBLIOGRAPHY

[33] Nvdia, “Whitepaper nvidia tegra k1: A new era in mobile computing,” January 2014. 31 [34] F. Liu, Y. Liang, and L. Wang, “A survey of the heterogeneous computing platform and related technologies,” DEStech Transactions on Engineering and Technology Research, 05 2017. 31

[35] R. Damodaran, T. Anderson, S. Agarwala, R. Venkatasubramanian, M. Gill, D. Gopalakrish- nan, A. Hill, A. Chachad, D. Balasubramanian, N. Bhoria, J. Tran, D. Bui, M. Rahman, S. Moharil, M. Pierson, S. Mullinnix, H. Ong, D. Thompson, K. Gurram, O. Olorode, N. Mah- mood, J. Flores, A. Rajagopal, S. Narnur, D. Wu, A. Hales, K. Peavy, and R. Sussman, “A 1.25ghz 0.8w c66x dsp core in 40nm cmos,” in 2012 25th International Conference on VLSI Design, pp. 286–291, Jan 2012. 31

[36] Texas Instruments, “Tms320c6745, tms320c6747 fixed- and floating-point digital signal pro- cessor,” June 2014. 31 [37] L. Codrescu, W. Anderson, S. Venkumanhanti, M. Zeng, E. Plondke, C. Koob, A. Ingle, C. Tabony, and R. Maule, “Hexagon dsp: An architecture optimized for mobile multimedia and communications,” IEEE Micro, vol. 34, pp. 34–43, Mar 2014. 31 [38] Qualcomm, “Qualcomm hexagon dsp,” December 2017. 31 [39] M. Santarini, “Xilinx Ships Industry’s First 20-nm All Programmable Devices,” Xcell Journal, pp. 9–15, 2014. 31

[40] J. A. Fisher, P. Faraboschi, and C. Young, Embedded computing: a VLIW approach to archi- tecture, compilers and tools. 01 2005. 34 [41] Texas Instruments, “Tms320c67xx divide and square root floating-point functions,” February 1999. 35

[42] J. P. Dawson, “Ieee 754 floating point arithmetic.” https://github.com/dawsonjon/fpu, 2012. 48

42 A Flexibility Metric for Processors Appendix A

LLVM Pass: libDynCountPass.so

This LLVM pass written in C++ is implemented to record dynamic execution counts of IR in- structions.

#include”llvm/Pass.h” #include”llvm/IR/Function.h” #include”llvm/Support/raw ostream.h” #include”llvm/IR/LegacyPassManager.h” #include”llvm/IR/InstrTypes.h” #include”llvm/Transforms/IPO/PassManagerBuilder.h” #include”llvm/IR/IRBuilder.h” #include”llvm/Transforms/Utils/BasicBlockUtils.h” #include”llvm/IR/Module.h” #include”llvm/Support/CommandLine.h” #include”llvm −c/Core.h” #include #include using namespace llvm; using namespace cl; using namespace std;

#define MAIN”main”

// elminate o v e rh e a d default is false; s t a t i c cl::opt eliminateOverhead (”eli”, cl::desc(”Eliminate the overhead caused by for.inc and for.con”)); s t a t i c cl::opt writeExcel (”w2e”, cl::desc(”Write results to excel file”)); s t a t i c cl::opt functionInMain (”fNameInMain”, cl::desc(”Benchmark function c a l l e d in Main”)); s t a t i c cl::opt functionNotRecursive (”fNameNotR”, cl::desc(”Benchmark f u n c t i o n not recursive,e.g. FFT”));

namespace { s t r u c t DynCount : public ModulePass { s t a t i c charID; DynCount() : ModulePass(ID) {} bool enterFunction = false; bool runOnModule(Module &M) ; bool runOnFunction(Function &F); bool InsertBasicInstCall(BasicBlock &B, Function &F, Constant ∗ logFunc , Constant ∗ printFunc , Constant ∗ write2Excel); bool InsertBenchmarkCall(Function:: iterator blk , Function &F, Constant ∗ startBench , Constant ∗ stopBench) ; i n t getTypeString(Type ∗ type ) ; } ; }

bool DynCount :: runOnModule(Module &M) {

A Flexibility Metric for Processors 43 APPENDIX A. LLVM PASS: LIBDYNCOUNTPASS.SO

f o r (Module::iterator fct =M.begin(), fct end =M.end(); fct != fct e n d ; ++f c t ) { Function &F = ∗ f c t ; runOnFunction(F) ; } } bool DynCount:: runOnFunction(Function &F) { LLVMContext &Ctx = F.getContext() ; Module &M = ∗F.getParent() ; std :: vector paramTypes = {Type:: getInt32Ty(Ctx) ,Type:: getInt32Ty(Ctx) } ; Type ∗ retType = Type:: getVoidTy(Ctx); FunctionType ∗logFuncType = FunctionType :: get(retType , paramTypes, false); FunctionType ∗ printFuncType = FunctionType :: get(retType , false);

Constant ∗ logFunc = M.getOrInsertFunction(”logop”, logFuncType); Constant ∗ printFunc = M.getOrInsertFunction(”printop”, printFuncType); Constant ∗ startBench = M.getOrInsertFunction(”startBench”, printFuncType); Constant ∗ stopBench = M.getOrInsertFunction(”stopBench”, printFuncType); Constant ∗ write2Excel = M.getOrInsertFunction(”write2Excel”, printFuncType);

f o r (Function::iterator blk = F.begin(), blk end = F.end(); blk != blk end ; ++blk ) { BasicBlock &B = ∗ blk ; StringRef Fname = F.getName() ; StringRef BBname = B.getName() ;

// Eliminate two types of blocks that cause overhead. //string.find(”xx”) if find, return 0. i f ( ! BBname . f i n d (”for.cond”) | | ! BBname . f i n d (”for.inc”)) { i f (!eliminateOverhead) InsertBasicInstCall(B, F, logFunc, printFunc , write2Excel); } e l s e { InsertBasicInstCall(B, F, logFunc, printFunc , write2Excel); } InsertBenchmarkCall(blk , F, startBench , stopBench); }

r e t u r n false; } bool DynCount:: InsertBenchmarkCall(Function:: iterator blk , Function &F, Constant ∗ startBench , Constant ∗ stopBench ) { StringRef Fname = F.getName() ; StringRef BBname = F.getName() ; BasicBlock &B = ∗ blk ; IRBuilder<> builder(&B) ;

f o r (BasicBlock::iterator ist = B.begin(), end = B.end(); ist != end; ++ist) { Instruction &I = ∗ i s t ;

i f (!functionInMain.empty()) { /∗ main() { startBench(); c a l l MeasuredFunctionName; stopBench(); } ∗/ i f (Fname == MAIN) { i f (isa (I)) { StringRef Callname = cast (I).getCalledFunction()−>getName ( ) ; i f (Callname == functionInMain) { builder .SetInsertPoint(&I); builder .CreateCall(startBench);

44 A Flexibility Metric for Processors APPENDIX A. LLVM PASS: LIBDYNCOUNTPASS.SO

BasicBlock::iterator it = ist; ++i t ; builder . SetInsertPoint(&∗ i t ) ; builder .CreateCall(stopBench); } } } } e l s e { // Measure any function that is not recursive /∗ MeasuredFunctionName() { startBench(); ..... stopBench(); r e t } ∗/ i f (Fname == functionNotRecursive) { i f(blk == F.begin() && ist == B.begin()) { builder .SetInsertPoint(&I); builder .CreateCall(startBench); } e l s e if (I .getOpcode()==1){ builder .SetInsertPoint(&I); builder .CreateCall(stopBench); } } } } } bool DynCount:: InsertBasicInstCall(BasicBlock &B, Function &F, Constant ∗ logFunc , Constant ∗ printFunc , Constant ∗ write2Excel) { i n t phiCount = 0;

f o r (BasicBlock::iterator ist = B.begin(), end = B.end(); ist != end; ++ist) { Instruction &I = ∗ i s t ;

StringRef Fname = F.getName() ; i n t opcode = I.getOpcode(); Type ∗ Itype = I.getType();

//I. dump(); Value ∗Vopcode , ∗VdataType , ∗Vopcodee , ∗VdataTypee ; IRBuilder<> builder(&B) ;

// phi nodes must be the first instruction of BBs i f (isa (&I ) ) { phiCount++; } e l s e { builder .SetInsertPoint(&I); }

VdataType = ConstantInt :: get(Type:: getInt32Ty(F.getContext()) , 0); Vopcode = ConstantInt :: get(Type:: getInt32Ty(F.getContext()) , opcode);

i f(isa (&I ) ) {

/∗ i f detected data type is vector, then construct function call for #( numbElements) times with element type. namely splitting vector operations, for instance

c a l l void @logop(i32 12, i32 3) c a l l void @logop(i32 12, i32 3) c a l l void @logop(i32 12, i32 14) %121= fadd <2x double > %wide.load435,%wide.load439

A Flexibility Metric for Processors 45 APPENDIX A. LLVM PASS: LIBDYNCOUNTPASS.SO

∗/

i f (Itype−>isVectorTy()) { //I. dump(); VectorType ∗vecType=static c a s t ( Itype ) ; i n t numbElements = vecType−>getNumElements() ; VdataType = ConstantInt :: get(Type:: getInt32Ty(F.getContext()) , getTypeString(vecType−>getElementType())) ; Value ∗ a r g s s [ ] = {Vopcode , VdataType } ; f o r(int i=0; i

VdataType = ConstantInt :: get(Type:: getInt32Ty(F.getContext()) , getTypeString( Itype ) ) ; }

Value ∗ a r g s [ ] = {Vopcode , VdataType } ;

// in case of several phi nodes i f (ist == B.getFirstNonPHI()−>getIterator() && phiCount) { //opcode of phi nodes Vopcodee = ConstantInt :: get(Type:: getInt32Ty(F.getContext()) , 53); VdataTypee = ConstantInt :: get(Type:: getInt32Ty(F.getContext()) , 0); Value ∗ a r g s s [ ] = {Vopcodee , VdataTypee } ; f o r(int i=0;i

// including the ret after call stoping benchmark. i f (!isa (&I)) builder.CreateCall(logFunc, args); // Check if it is return instruction i f (opcode==1){ IRBuilder<> builder(&B) ; builder .SetInsertPoint(&I); i f (Fname == MAIN) { i f (writeExcel) { builder .CreateCall(write2Excel); } e l s e { builder .CreateCall(printFunc); } } } } } i n t DynCount:: getTypeString(Type ∗ type ) { /∗ DataType: 1 : half; 2: float; 3: double; 4: 80 − b i t floating point 5 : 128− b i t floating point type (112− b i t mantissa) 6 : 128− b i t floating point type(two 64 − b i t s, PowerPC) 7 : int1; 8: int8; 9: int16; 10: int32 11: int64; 12: int128; 1 3 : Pointers; 14: Vector; 15: Array ∗/

switch(type −>getTypeID() ) { c a s e Type:: IntegerTyID: { IntegerType ∗ intType=static c a s t ( type ) ; i n t bitWidth = intType−>getBitWidth() ; switch(bitWidth) { c a s e 1: return 7;// int1 c a s e 8: return 8;// int8

46 A Flexibility Metric for Processors APPENDIX A. LLVM PASS: LIBDYNCOUNTPASS.SO

c a s e 16: return 9;// int16 c a s e 32: return 10;// int32 c a s e 64: return 11;// int64 c a s e 128: return 12;// int128 } } c a s e Type::HalfTyID: return 1;// half c a s e Type::FloatTyID: return 2;// float c a s e Type::DoubleTyID: return 3;// double c a s e Type::X86 FP80TyID : return 4;// 80 − b i t FP c a s e Type::FP128TyID: return 5;// 128 − b i t FP type (112− b i t mantissa) c a s e Type::PPC FP128TyID : return 6;// 128 − b i t FP type(two 64 − b i t s, PowerPC) c a s e Type:: PointerTyID: return 13;// pointer c a s e Type::VectorTyID: return 14;// vector c a s e Type::ArrayTyID: return 15;// array d e f a u l t: 0; } } char DynCount::ID = 0;

// Automatically enable the pass. s t a t i c RegisterPass X(”dyncount”,”Dynamic intruction count”, f a l s e/ ∗ Only looks at CFG ∗/, f a l s e/ ∗ A n a l y s i s Pass ∗/); s t a t i c void registerDynCount(const PassManagerBuilder &, legacy :: PassManagerBase &PM) { PM. add (new DynCount()); } s t a t i c RegisterStandardPasses RegisterMyPass(PassManagerBuilder :: EP EarlyAsPossible , registerDynCount) ; Listing A.1: IR interpreter written in C++

A Flexibility Metric for Processors 47 Appendix B

Operations in Verilog For Synthesis

Listing B.1 shows the Verilog codes of integer binary operations [24]. IEEE 754 floating point library in Verilog can be accessed online [42]. module sub int (y, a, b); parameter wA= 32, wB= 32; input [wA−1:0] a ; input [wB−1:0] b ; output [wA:0] y; a s s i g ny=a − b ; endmodule module add int (y, a, b); parameter wA= 64, wB= 64; input [wA−1:0] a ; input [wB−1:0] b ; output [wA+1:0] y; a s s i g n y=a+b; endmodule module unsigned multiply (y, a, b); parameter wA= 64, wB= 64; input [wA−1:0] a ; input [wB−1:0] b ; output [wA+wB−1:0] y ; a s s i g ny=a ∗ b ; endmodule module signed multiply (y, a, b); parameter wA= 128, wB= 128; input signed [wA−1:0] a ; input signed [wB−1:0] b ; output signed [wA+wB−1:0] y ; a s s i g ny=a ∗ b ; endmodule module unsigned divide (y, a, b); parameter wA= 32, wB= 32; input [wA−1:0] a ; input [wB−1:0] b ; output [wA−1:0] y ; a s s i g ny=a/b; endmodule module signed divide (y, a, b); parameter wA= 32, wB= 32;

48 A Flexibility Metric for Processors APPENDIX B. OPERATIONS IN VERILOG FOR SYNTHESIS

input signed [wA−1:0] a ; input signed [wB−1:0] b ; output signed [wA−1:0] y ; a s s i g ny=a/b; endmodule module unsigned modulus (y, a, b); parameter wA= 16, wB= 16; input [wA−1:0] a ; input [wB−1:0] b ; output [wB−1:0] y ; a s s i g n y=a%b; endmodule module signed modulus (y, a, b); parameter wA= 64, wB= 64; input signed [wA−1:0] a ; input signed [wB−1:0] b ; output signed [wB−1:0] y ; a s s i g n y=a%b; endmodule Listing B.1: Operations in Verilog used for synthesis

A Flexibility Metric for Processors 49 Appendix C

Measurement Results

Table C.1 and Table C.2 include all the results in terms of flexibility, performance, parallelism, area and energy efficiency. The leftmost column corresponds to the No. column in Table 6.1. Note that “-” means the corresponding value cannot be obtained due to the absence of available transistor counts or power values.

Table C.1: Measurement results of all involved processors (without optimizations)

No. Processor Flexibility Performance Area Efficiency Energy Efficiency Parallelism 1 GTX TITAN 0.17 1.14E+15 1.60E+05 4.55E+12 1.36E+06 2 GTX 570 0.18 1.01E+15 3.37E+05 4.62E+12 1.38E+06 3 GTX 750 Ti 0.24 1.08E+15 5.78E+05 1.80E+13 1.06E+06 4 Tegra K1 0.25 2.98E+13 - 2.13E+12 3.14E+04 5 Pentium 4 0.37 9.08E+13 5.37E+05 1.08E+12 2.67E+04 6 i7 920 0.40 8.24E+13 1.13E+05 6.34E+11 3.08E+04 7 i7 950 0.37 1.02E+14 1.40E+05 7.85E+11 3.32E+04 8 i7 960 0.51 1.42E+14 1.94E+05 1.09E+12 4.43E+04 9 i7 4770 0.47 1.05E+14 7.48E+04 1.25E+12 3.08E+04 10 i7 6700 0.53 7.54E+13 4.31E+04 1.18E+12 2.22E+04 11 ARM1176 0.33 8.57E+11 - 2.95E+11 1.22E+03 12 Cortex A9 0.36 9.34E+12 - 2.33E+12 5.49E+03 13 Cortex A15 0.30 1.50E+13 - 3.00E+12 6.53E+03 14 Cortex A53 0.38 4.82E+12 - 1.10E+12 3.44E+03 15 TI C6747 0.44 9.75E+11 4.43E+04 2.87E+12 3.25E+03 16 Hexagon v5 0.60 8.62E+11 - - 1.44E+03 17 Hexagon v6 0.56 1.90E+12 - - 9.51E+02 - Artix7 0.73 1.42E+12 1.39E+03 1.01E+12 1.22E+04 - Kintex7 0.70 1.69E+12 7.14E+02 9.95E+11 1.41E+04 - Virtex7 0.70 1.67E+12 1.73E+02 7.28E+11 1.41E+04 - Zynq 0.74 1.53E+12 6.97E+02 9.03E+11 1.30E+04 - Virtexuplus 0.71 9.77E+11 6.98E+01 2.38E+11 8.28E+03 - Kintexu 0.70 9.66E+11 1.82E+02 4.60E+11 8.05E+03 - Zynquplus 0.72 9.63E+11 2.35E+02 4.59E+11 8.28E+03

50 A Flexibility Metric for Processors APPENDIX C. MEASUREMENT RESULTS

Table C.2: Measurement results of all involved processors (with optimizations)

No. Processor Flexibility Performance Area Efficiency Energy Efficiency Parallelism 5 Pentium 4 0.36 9.57E+13 5.66E+05 1.14E+12 2.81E+04 6 i7 920 0.37 2.66E+14 3.64E+05 2.05E+12 9.96E+04 7 i7 950 0.34 2.45E+14 3.36E+05 1.89E+12 8.00E+04 8 i7 960 0.46 4.39E+14 6.01E+05 3.38E+12 1.37E+05 9 i7 4770 0.30 1.80E+14 1.29E+05 2.15E+12 5.30E+04 10 i7 6700 0.36 2.10E+14 1.20E+05 3.27E+12 6.16E+04 12 Cortex A9 0.33 2.20E+13 - 5.49E+12 1.29E+04 13 Cortex A15 0.30 3.38E+13 - 6.77E+12 1.47E+04 14 Cortex A53 0.36 1.31E+13 - 2.97E+12 9.32E+03 18 Hexagon v6 0.76 1.85E+12 - - 9.23E+02 19 Hexagon v5 0.76 1.48E+12 - - 2.47E+03 - Artix7 0.20 4.58E+12 4.46E+03 2.29E+12 4.50E+04 - Kintex7 0.19 8.18E+12 3.45E+03 3.27E+12 8.11E+04 - Virtex7 0.18 7.66E+12 7.89E+02 2.55E+12 7.49E+04 - Zynq 0.19 7.43E+12 3.38E+03 2.86E+12 7.78E+04 - Virtexuplus 0.13 9.86E+12 7.04E+02 1.79E+12 9.85E+04 - Kintexu 0.13 9.57E+12 1.81E+03 2.81E+12 9.77E+04 - Zynquplus 0.19 4.60E+12 1.12E+03 1.64E+12 4.68E+04

A Flexibility Metric for Processors 51