Performance and Energy-Efficiency Modelling for Multi-Core Processors

Performance and Energy-Eﬃciency Modelling for Multi-Core Processors

Diogo Augusto Pereira Marques

Thesis to obtain the Master of Science Degree in Electrical and Computer Engineering

Supervisors: Doctor Aleksandar Ilic Doctor Leonel Augusto Pires Seabra de Sousa

Examination Committee Chairperson: Doctor Ant´onioManuel Raminhos Cordeiro Grilo Supervisor: Doctor Aleksandar Ilic Member of the Committee: Doctor Jo˜aoPedro Faria Mendon¸caBarreto

November 2017

Acknowledgments

I would like to thank my supervisors, Dr. Leonel Sousa and Dr. Aleksandar Ilic, for their support and guidance through this Thesis, where their helpful insights were valuable for the development of the work here presented. Furthermore, I would like to thank INESC-ID for providing me the tools and infrastructure that allowed me to conclude this work. I would also like to thank all the co-authors of my publications and people from Intel Corporation that helped me in developing this Thesis, namely Roman Belenov, Philippe Thierry, Zakhar A. Matveev, Ahmad Yasin and Jawad Haj-Yahya. I would also like to thank all my friends for helping me through the entire course at IST and especially to my girlfriend, Raynara Silva, for all the support given during this Thesis, pushing me to always do my best. Finally, special thanks to my family for all their support in my academic course, especially my mother and father for their sacriﬁces and hard-work that allowed me to attend a course at IST and perform this Thesis.

i ii Abstract

In the last years, the increasing computational needs of modern applications brought an increase in the complexity of multi-core processor architectures. Hence, deeply understanding the factors that have a major impact on the performance, power consumption and efficiency on those platforms has become a difficult task. Thus, it is not trivial to guarantee the best execution efficiency for applications in multi-core processors. Given this challenge, insightful tools capable of relating the application requirements with the processor capabilities, such as Cache- Aware Roofline Model and Original Roofline Model, are very valuable for programmers, mostly during the stages of prototyping and design of the applications. However, the simplicity of these tools brings up certain limitations when characterizing the behavior of real-world applications and determining their execution bottlenecks. To address these limitations, this Thesis proposes a set of Cache-Aware Roofline Model extensions to increase model insightfulness and usability, in order to provide more accurate hints regarding application optimization. To validate the proposed extensions and methodologies, a set of applications from standard benchmark suites is characterized in Intel Skylake 6700K, correlating their behavior with the different computational capabilities of the processor and providing primary hints about their main bottlenecks. Besides, the insights derived from the models proposed in this Thesis allowed to increase the performance of an application kernel for up to 6.43×, when compared to its unoptimized version, demonstrating the model usability when optimizing the execution of real applications.

Keywords: Performance, Power consumption, Efﬁciency, Processor capabilities, Insightful tools, Cache-Aware Rooﬂine Model.

iii iv Resumo

Ao longo dos ultimos´ anos, o aumento das necessidades computacionais das aplicaçoes˜ actuais, provocou um aumento da complexidade nas arquitecturas dos processadores multi-core. Assim, perceber quais os factores com maior influenciaˆ na performance, potenciaˆ consumida e eficienciaˆ destas plataformas tornou-se um grande desafio. Portanto, nao˜ e´ trivial garantir a melhor eficienciaˆ na execuçao˜ das aplicaçoes˜ nos processadores multi-core. Dado este desafio, ferramentas proficientes capazes de relacionar, de forma simples e rapida,´ os requisitos de uma aplicaçao˜ com as capacidades do processador, tais como Cache-Aware Roofline Model e Original Roofline Model, sao˜ uma mais valia para os programadores, principalmente durante a prototipagem e projecto das aplicaçoes.˜ Con- tudo, a simplicidade destas ferramentas acarreta algumas limitaçoes˜ aquando a caracterizaçao˜ das aplicaçoes˜ reais e na determinaçao˜ dos seus bottlenecks. De forma a solucionar estas limitaçoes,˜ esta Tese propoe˜ um conjunto de extensoes˜ que aumentem as capacidades do Cache-Aware Roofline Model, de forma a obter sugestoes˜ mais precisas na optimizaçao˜ de aplicaçoes.˜ Para validar as extensoes˜ propostas e metodologias utilizadas, um conjunto de aplicaçoes˜ de benchmarks padrao˜ e´ caracterizado no Intel Skylake 6700K, correlacionando o comportamento das aplicaçoes˜ com as diferentes capacidades computacionais e identificando os seus bottlenecks principais. Para alem´ disto, as sugestoes˜ derivadas dos modelos propostos nesta Tese permitiram o aumento da performance de um kernel de uma aplicaçao˜ ate´ cerca de 6.43×, quando comparada com a versao˜ original, demonstrando a usabilidade do modelo na optimizaçao˜ da execuçao˜ de aplicaçoes˜ reais.

Palavras-chave: Performance, Potenciaˆ consumida, Eficiencia,ˆ Capacidades do processador, Ferramentas proficientes, Cache-Aware Roofline Model.

v vi Contents

Abstract iii

Resumo v

List of Figures ix

List of Tables xi

List of Algorithms xiii

List of Acronyms xv

1 Introduction 1 1.1 Motivation ...... 1 1.2 Objectives ...... 3 1.3 Main contributions ...... 3 1.4 Outline ...... 4

2 Background: Insightful modeling of modern multi-core processors 6 2.1 Modern multi-core processors and performance analysis ...... 6 2.1.1 Intel Ivy Bridge and Skylake micro-architectures ...... 7 2.1.2 Top-Down method for application analysis and detection of execution bottlenecks . . . . . 10 2.2 State-of-the-art approaches for insightful modeling of multi-cores ...... 12 2.2.1 Performance Roofline Modeling ...... 12 2.2.2 Power, Energy and Energy-Efficiency Roofline Modeling ...... 17 2.2.3 Remarks on Original and Cache-aware Roofline principles ...... 21 2.2.4 State-of-the-art approaches on extending the usability of insightful models ...... 22 2.3 Open challenges in insightful modeling ...... 24 2.4 Summary ...... 25

3 Reaching the architecture upper-bounds with micro-benchmarking 26 3.1 Tool for ﬁne-grain micro-architecture benchmarking ...... 26 3.2 Micro-architecture benchmarking ...... 29 3.2.1 Exploring the maximum compute performance ...... 29 3.2.2 Memory subsystem benchmarking ...... 34 3.3 Summary ...... 40

vii 4 Proposed insightful models: Construction and experimental validation 41 4.1 Proposed Cache-Aware Rooﬂine Model (CARM) extensions: Model construction ...... 42 4.1.1 State-of-the-art CARM construction ...... 45 4.2 Experimental validation of proposed CARM extensions ...... 46 4.3 Summary ...... 49

5 Application characterization and optimization in the proposed insightful models 51 5.1 Experimental setup ...... 51 5.2 Evaluation methodology ...... 52 5.3 Case Study: Toypush mini-application ...... 53 5.3.1 CARM-guided application optimization example ...... 55 5.4 Characterization of real-world applications in the proposed models ...... 57 5.4.1 Application characterization in the Single Precision (SP) Scalar LD CARM extension . . 58 5.4.2 Application characterization in the DP Scalar 2LD/ST CARM extension ...... 60 5.4.3 Application characterization in the Double Precision (DP) Scalar LD CARM extension . . 64 5.5 Summary ...... 69

6 Conclusions and Future Works 70 6.1 Future Works ...... 71

References 72

viii List of Figures

2.1 Central Processing Unit (CPU) pipeline for a Skylake micro-architecture [1]...... 8 2.2 Memory subsystem for Intel micro-architectures...... 9 2.3 Connection between cache levels, GPU and system agent [1] ...... 10 2.4 Top-Down Analysis hierarchy [2]...... 11 2.5 Original Roofline Model (ORM) and CARM memory traffic [3]...... 13 2.6 ORM and CARM [3, 4]...... 14 2.7 Performance Cache-Aware Roofline Model for an Intel 6700K quad-core processor (Skylake). . . 15 2.8 Intel Advisor Roofline: Performance characterization of Minighost loops ...... 16 2.9 Original Roofline Models for energy-efficiency and power consumption...... 18 2.10 Power consumed by the processor Intel 3370K Ivy Bridge [5] ...... 19 2.11 Power CARM Models for Intel 3370K Ivy Bridge [5] ...... 20 2.12 Energy and Energy-Efficiency CARM for Intel 3370K Ivy Bridge ...... 20 2.13 Application with different problem sizes in Intel 3770K Ivy Bridge...... 21

3.1 Benchmarking tool general layout...... 27 3.2 Floating Point (FP) Units maximum performance using Advanced Vector Extensions (AVX) Single Instruction Multiple Data (SIMD) DP instructions...... 30 3.3 FP Units maximum power consumption using AVX SIMD DP instructions...... 31 3.4 FP units performance using AVX SIMD DP instructions...... 32 3.5 FP units power consumption using AVX SIMD DP instructions...... 32 3.6 FP units performance and power consumption for different instruction set extensions in Intel Sky- lake 6700K (4 cores)...... 33 3.7 Top Down Method for Fused Multiply-Add (FMA) AVX SIMD DP at nominal frequency. . . . . 34 3.8 Memory subsystem bandwidth for LD AVX SIMD DP at nominal frequency...... 34 3.9 Memory subsystem power consumption for LD AVX SIMD DP at nominal frequency...... 35 3.10 Memory subsystem power consumption for LD AVX SIMD DP at nominal frequency...... 36 3.11 Memory ratios bandwidth for AVX SIMD DP at nominal frequency...... 37 3.12 Memory ratios power consumption for AVX SIMD DP at nominal frequency...... 37 3.13 FP units performance and power consumption for different instruction set extensions in Intel Sky- lake 6700K (4 cores)...... 39 3.14 Top Down Method for 2LD/ST AVX SIMD DP at nominal frequency...... 40

4.1 Proposed CARM extensions for AVX DP FP instructions for Intel Skylake 6700K (4 Cores, 2LD/ST). 42 4.2 Proposed CARM extensions for AVX LD and ST operations for Intel Skylake 6700K (4 Cores). . 43 4.3 Proposed CARM extensions for 2LD/ST ratio with Streaming SIMD Extensions (SSE) and Scalar DP instructions for Intel Skylake 6700K (4 Cores)...... 44

ix 4.4 AVX DP LD and 2LD/ST memory bandwidth evaluation and State-of-the-art CARM for Intel Skylake 6700K (4 Cores)...... 45 4.5 Performance and power consumption LD AVX SIMD DP CARM validations for Intel Ivy Bridge 3770K (1 core)...... 48 4.6 Performance and power consumption LD AVX SIMD DP CARM validations for Intel Ivy Skylake 6700K (1 core)...... 48 4.7 Performance and power consumption 2LD/ST AVX SIMD DP CARM validations for Intel Ivy Bridge 3770K (1 core)...... 49 4.8 CARM for AVX SIMD DP at nominal frequency...... 49

5.1 Toypush instruction mix 5.1a and Top-Down metrics 5.1b...... 54 5.2 CARM characterization of main Toypush kernels in Intel Skylake 6700K...... 55 5.3 CARM model: Toypush optimization characterization in Intel Skylake 6700K...... 56 5.4 Instruction distribution and Top-Down analysis for SP Scalar LD applications...... 59 5.5 Application characterization within state-of-the-art CARM and proposed SP Scalar LD extension. 60 5.6 Application characterization with SP Scalar LD COMPS CARM...... 60 5.7 Instruction distribution and Top-Down analysis for DP Scalar 2LD/ST applications...... 61 5.8 Application characterization within state-of-the-art CARM and proposed DP Scalar 2LD/ST extension...... 61 5.9 Application characterization with SP Scalar LD COMPS CARM...... 62 5.10 Power consumption characterization methodology...... 63 5.11 Batch 1: Instruction distribution and Top-Down analysis for DP Scalar LD applications...... 65 5.12 Batch 1: Application characterization within state-of-the-art CARM and proposed DP Scalar LD CARM extension...... 65 5.13 Application efﬁciency characterization with proposed DP Scalar LD CARM extension...... 66 5.14 Batch 2: Instruction distribution and Top-Down analysis for DP Scalar LD applications...... 67 5.15 Batch 2: Application characterization within state-of-the-art CARM and proposed DP Scalar LD CARM extension...... 67 5.16 Application characterization with DP Scalar LD INST CARM...... 68

x List of Tables

2.1 State-of-the-art works...... 22

5.1 Performance and arithmetic intensity of Toypush kernels before and after optimization...... 57

xi xii List of Algorithms

1 Generic memory benchmark ...... 28 2 Generic FP benchmark ...... 28 3 Multiply and Add (MAD) DP AVX Benchmark for Intel Ivy Bridge ...... 30 4 FMA DP AVX Benchmark for Intel Skylake ...... 30 5 Generic CARM benchmark ...... 47

xiii xiv List of Acronyms

AI Arithmetic Intensity

AMT Asynchronous Many-Task

AVX Advanced Vector Extensions

BPU Branch Prediction Unit

CARM Cache-Aware Rooﬂine Model

CPU Central Processing Unit

DMI Direct Media Interface

DP Double Precision

DRAM Dynamic Random Access Memory

DSB Decoded Icache

ECM Execution-Cache-Memory Model

FMA Fused Multiply-Add

FP Floating Point

FPGA Field Programmable Gate Array

GPU Graphics Processing Unit

HLS High-Level Synthesis

HPC High Performance Computing

IDQ Instruction Decode Queue

LLC Last Level Cache

LSD Loop Stream Detector

MAD Multiply and Add

MSR Model Speciﬁc Register

MSROM Micro-Code Store Read-Only Memory

NUMA Non-Uniform Memory Access

OI Operational Intensity

xv ORM Original Rooﬂine Model

PAPI Performance Application Programming Interface

PCIe Peripheral Component Interconnect express

PMU Performance Monitoring Unit

RAPL Running Average Power limit

SIMD Single Instruction Multiple Data

SP Single Precision

SPEC Standard Performance Evaluation Corporation

SSE Streaming SIMD Extensions

TSC Time Stamp Counter

xvi 1. Introduction

In order to keep up with the growing computational needs of parallel applications, almost every feature of modern multi-core processors suffered continuous improvements over the last years. However, this provoked an increase in the complexity of multi-cores, which may diversely impact their performance, power, and energy- efficiency [3]. Consequently, it is difficult to guarantee the best execution efficiency for the applications, since the evaluation of all possible application execution bottlenecks is far from being a trivial task, especially when different execution domains are involved (e.g., performance versus power consumption). In this process, it is very important to relate the application requirements with the capabilities of the system where they are executed. Although cycle-accurate counters and/or methods that use hardware counters to perform an extensive experimental evaluation (e.g., Top-Down Method [2]) provide an in-depth characterization of the architecture/application capabilities, those environments are usually too complex and hard to develop. The alternative is using simple, insightful and more intuitive approaches for modeling the micro-architecture upper- bounds for performance, power consumption and energy-efficiency [3, 5]. These models typically focus only on some micro-architecture features when describing the multi-core upper-bounds, hence they can smooth the work of programmers, in the stages of design and prototyping. One of the mostly used of insightful approaches is the roofline modeling [6], which considers the maximum capabilities of specific functional units (usually, double precision Floating Point (FP) units) and memory hierarchy (in terms of the bandwidth). Due to their simplicity, the roofline models are usually used to detect the main bottlenecks in the application execution and to provide useful optimization guidelines. As a result, these models can quantify the potential of applications to reach the micro-architecture maximums (rooflines).

1.1 Motivation

In past decades, technological improvements and micro-architectural innovation lead to an exponential performance increase of processor architectures. This performance growth is typically coupled with an increase in the overall system complexity due to introduced micro-architecture and system-wide enhancements, e.g., a higher number of cores with advanced pipeline functionalities and memory hierarchy with deeper and diversified levels [7, 8]. Besides the constant improvements in the architectural features, the number of the transistors on chip was also increasing within each new processor micro-architecture, according to the Moore’s Law (doubling of transistors on chip each 18 months) [9]. However, the current technology is reaching its limit and the occur- rence of dark-silicon, i.e., the impossibility of having all transistors operating at the same time, turned into a huge obstacle for performance, power and energy-efficiency scaling in modern multi-core processors [9]. Due to this phenomenon, different approaches to build more energy-efficient architectures are needed and they are even moving towards highly heterogeneous designs [8]. The increased complexity and diversity of contemporary architectures impose significant challenges to ensure efficient execution and performance portability of real-world and commonly used applications. These challenges mainly arise from the difficulty to fine tune the application execution on a given computing platform in respect

1 to the capabilities of hardware resources and their ability to satisfy the application-specific characteristics and demands. Therefore, relating the application run-time behavior with the capabilities of underlying hardware resources is crucial for identifying the potential execution bottlenecks (arising from the inefficient use of system resources), as well as for assessing the application potential to fully exploit the hardware capabilities. Besides these challenges at the computer architecture level, the inherent complexity of modern parallel applications imposes an additional burden when optimizing their execution and improving performance on general- purpose hardware. In particular, a real-world application may involve a high degree of execution heterogeneity via inclusion of several execution phases, each exercising a specific set of hardware resources. Hence, different application phases may experience different execution bottlenecks from the hardware perspective, thus requiring a set of different optimization techniques to be applied. For example, the performance of application phases bound by the inefficient use of memory hierarchy can be improved by optimizing the memory access pattern and cache utilization, while code vectorization can be applied to boost the performance of phases that do not fully exploit the capability of functional units. In real-world scenarios, even for a single application phase, it might be necessary to simultaneously or iteratively apply several different techniques until reaching the desired performance. In such a broad optimization space, evaluating the benefits and trade-offs across a set of different solutions and optimization goals is far from being a trivial task. To identify the most appropriate implementations, optimization techniques and main hardware execution bottlenecks, a range of complex execution scenarios must be considered (even at the level of a single core or a functional unit). Although architecture-specific testing and simulation environments can precisely model the functionality of architectures and applications, these environments are rather complex, hard to use and difficult to develop [10, 11]. However, for fast prototyping, simple and insightful performance models are essential and particularly useful for computer architects and application designers, since they provide the means to quickly assess and relate the main characteristics of the architectures and applications. To this respect, the approaches based on roofline modeling are particularly useful, since they provide simple and intuitive ways to combine application needs with micro-architecture upper-bounds. In the roofline modeling, there are two main approaches: the Original Roofline Model (ORM) [4] and Cache-Aware Roofline Model (CARM) [3, 5]. The main difference between these two models is the way how the memory traffic is perceived. The ORM only considers the data traffic between consecutive memory levels (usually, Dynamic Random Access Memory (DRAM)), while CARM considers the data traffic as seen by the core and through all memory hierarchy. On the other hand, both models consider the upper-bounds for memory and computation performance and they allow to evaluate how far the application is from exploiting the maximum processor capabilities. Besides this, these models are useful to define if the application is mostly memory or compute bound. Due to its usefulness, roofline modeling is the starting point of several works. ORM is used in the analysis of Non-Uniform Memory Access (NUMA) systems [12–14], for the analysis of application execution bottlenecks [15] and for application in Asynchronous Many-Task (AMT) runtimes [16]. Regarding CARM, it was used to characterize real applications [17] and several tools to facilitate its analysis were developed, such as in [18]. More- over, it resulted in a collaboration with Intel Corporation, and CARM is integrated as the fully supported feature in the Intel Advisor 2017, contained in Intel R Parallel Studio XE 2017 [19]. However, due to their simplicity, these models do not take into account all important micro-architecture features, which might be relevant for characterization of certain workloads. For example, real applications have

2 different instruction mixes and, since these models only consider FP operations, some application characteristics can not be analyzed. Since the models assume that computations and data transfers completely overlap in time, for applications where data transfer time and compute time are mainly sequential, the results might not be the most accurate, mainly in power consumption. Finally, despite giving a set of high level information, in certain scenarios it is not possible to distinguish the actual execution bottleneck for some applications, e.g. to quantify the impact of the bandwidth between the memory levels in the performance CARM. In order to tackle those issues, the main objective of this Thesis is to extend the insightfulness of existing rooﬂine models, by proposing a set of novel methods and extensions to overcome CARM limitations while maintaining the model simplicity, e.g., by including additional information regarding the potential execution bottlenecks, such as bandwidth between memory levels and the impact of different instruction mixes. In order to accomplish this objective, it is fundamental to ﬁnd a trade-off between the detail level of modeling and its easiness to be understood by the user. In fact, the outcomes of this Thesis are expected to be used as a starting point for further improvements of the CARM implementation in Intel Advisor.

1.2 Objectives

In order to overcome the previously referred limitations, this Thesis has the following objectives:

• Benchmarking real applications by using hardware counters in order to characterize the application behavior in real platforms and by analyzing the application source codes to correlate their instruction mix, followed by tests in state-of-the-art platforms/models, with the purpose of acquiring additional information regarding application execution bottlenecks.

• Proposing a set of novel insightful methodologies and extensions to more precisely characterize the application behavior and execution bottlenecks in the performance, power consumption and energy-efﬁciency CARMs.

• Benchmarking Intel Skylake 6700K and Intel Ivy Bridge 3770K micro-architectures, through the development of benchmarks and by using hardware instrumentation tools to characterize the upper-bounds of the architectures for different scenarios, e.g., different instruction mixes, different loads/stores ratio, etc.

• Validating the proposed CARM extensions for performance and power consumption in Intel Ivy Bridge 3770K and Intel Skylake 6700K by focusing on different micro-architecture capabilities, e.g., different load- /store ratios and utilization of different arithmetic units.

• Providing a set of suggestions and improvements for the current implementation of the Intel Advisor CARM, in order to further improve its insightfulness.

1.3 Main contributions

In this Thesis, a set of application-centric micro-architecture insightful models for performance, power consumption and energy-efﬁciency are proposed that aim at improving the insightfulness of the state-of-the-art models

3 in order to cover a wide range of execution scenarios from both micro-architecture and application perspectives. The proposed models explicitly consider the impact of different instruction types, ratio of memory operations and instruction set extensions on the realistically attainable upper-bounds of the modern multi-core processors. In addition, a set of novel and redefined general roofline models are also investigated in order to provide a more precise characterization of the potential execution bottlenecks for applications which execution is not necessarily dominated by the FP operations. As such, these models provide a foundation to derive more general insightful micro-architecture models based on the fundamental roofline modeling principles. In order to fully demonstrate the usefulness and insightfulness of the proposed models, an extensive experimental validation, evaluation and analysis are performed on real hardware platforms (equipped with the quad-core Intel Skylake 6700K and Ivy Bridge 3770K processors) and real-world applications, including a set of FP benchmarks for the Standard Performance Evaluation Corporation (SPEC) suite. For these applications, the information extracted from the proposed CARMs is also compared with the state-of-the-art CARM implementation, demonstrating their ability to provide the application characterization of higher accuracy. In particular, by following the optimization guidelines given by the proposed models, the performance of several application kernels was improved for up to 6.43 times when compared to their unoptimized versions. The initial outcomes and research achievements from this Thesis were communicated at the HPCS 2017 international conference with the following contributions:

• Diogo Marques, Helder Duarte, Aleksandar Ilic, Leonel Sousa, Roman Belenov, Philippe Thierry and Za- khar A. Matveev. “Performance Analysis with Cache-Aware Rooﬂine Model in Intel Advisor”, In Proceed- ings of the International Conference on High Performance Computing & Simulation (HPCS’17), Genoa, Italy, July 2017. (paper in collaboration with Intel Corporation);

• Diogo Marques, Helder Duarte, Leonel Sousa, and Aleksandar Ilic. “Analyzing Performance of Multi-cores and Applications with Cache-aware Rooﬂine Model”, In Special Session on High Performance Computing for Application Benchmarking and Optimization (HPBench’17), collocated with International Conference on High Performance Computing & Simulation (HPCS’17), Genoa, Italy, July 2017. (extended abstract)

In addition, the outcomes and experimental evaluations from this Thesis were also presented in several tutorials and invited talks at international conferences, such as SC’17, HiPEAC’17, HPCS’17, HPBench’17 and NESUS’17. Furthermore, the presented methodology for experimental assessment of the micro-architecture upper-bound capabilities (in particular, for the attainable bandwidth for different levels of memory hierarchy) was indirectly used to improve the Intel Advisor CARM implementation (in alpha and beta development stages of the tool).

1.4 Outline

This document is structured as follows:

• Chapter 2 - Background: Insightful modeling of multi-core processors: This chapter presents a summary regarding the state-of-the-art. It starts by brieﬂy explaining the Skylake micro-architecture, which is the multi-core processor used for developments in this Thesis. Besides, Top-Down method [2, 20] is also presented, to better relate its metrics with Intel micro-architecture capabilities. Next, the rooﬂine modeling

4 is introduced, where its features, approaches and open challenges are explained. Finally, a brief overview is provided regarding the state-of-art works, mainly based on the rooﬂine modeling, and related with the application behavior in several systems (mostly in Central Processing Unit (CPU)), by referring to their importance to this Thesis.

• Chapter 3 - Reaching the architecture upper-bounds with micro-benchmarking: This chapter presents all the steps performed to obtain different micro-architecture upper-bounds. It starts by proposing the benchmarking tool and designed benchmarks structure. Besides, Intel Ivy Bridge 3770K and Intel Skylake 6700K capabilities are compared from performance, power consumption and energy-efﬁciency points of view. Fi- nally, Top-Down analysis is performed in the proposed benchmarks in order to assess their quality and accuracy.

• Chapter 4 - Proposed insightful models: Construction and experimental validation: In this chapter, several CARM instances reﬂecting different micro-architectural capabilities are proposed. In particular, a CARM model reﬂecting the maximum upper-bounds in memory subsystem and FP units of Intel Skylake 6700K is constructed (state-of-the-art CARM). The insights provided by this model are compared with the characterization obtained with proposed extensions in Chapter 5. Moreover, CARM validation in Intel Skylake 6700K and Intel Ivy Bridge 3770 K is performed, using the tool presented in Chapter 3.

• Chapter 5 - Application characterization and optimization in the proposed insightful models: In this chapter the insightfulness and usability of the proposed CARM extensions is demonstrated. To accomplish this task, a deep characterization of set of real-word FP applications from SPEC suite is performed in the proposed CARM instances, by taking into account application instruction distribution and Top-Down analysis. Besides, Toypush miniapp [21] is optimized based on insights provided by proposed CARM extensions. Finally, state-of-the-art CARM insights are compared with the characterization of the proposed CARM extensions.

• Chapter 6 - Conclusions and Future Works: In this chapter, the conclusions obtained from the work performed in this Thesis are presented, as well as, possible future works that can increase CARM insightfulness.

5 2. Background: Insightful modeling of modern multi-core processors

The research work in this Thesis is mainly focused on proposing novel methods and extensions to existing approaches for insightful modeling of multi-core processors, aiming to increase their accuracy and usability. With this aim, in this chapter, an in-depth overview of the main concepts and the required background information is provided, which are necessary to facilitate the understanding of proposed solutions and their scientific contributions. For this purpose, a thorough overview regarding the architecture of modern multi-core processors is presented, as well the state-of-the-art approaches for insightful modeling of their performance, power consumption and energy- efficiency. A special emphasis is also given to exposing the main challenges and open problems in this research area, which are specifically tackled in this Thesis. To this respect, two different micro-architectures from Intel Core processor family are firstly introduced (i.e., Intel Ivy Bridge [1] and the most recent Intel Skylake [7]), by providing an overview about their overall structure, pipeline functionality and memory hierarchy. Besides, one of the most relevant methods for performance analysis in multi-core processors, i.e., the Top-Down method [2], is introduced, correlating its metrics with the presented Intel micro-architectures. Furthermore, the most relevant approaches for insightful modeling for performance, power consumption and energy efficiency of multi-core processors are deeply examined, with a specific emphasis given to ORM [4] and CARM [3, 5]. This Chapter showcases the usability of both ORM and CARM and it also states a set of open challenges, in order to further improve their insightfulness. In addition, other state-of-the-art models and extensions to roofline modeling are analyzed, as well as the breakthrough methods for application characterization and analysis [1, 2].

2.1 Modern multi-core processors and performance analysis

The first multi-core processors appeared around 2001, with IBM-Power4 processor [22], but only in 2005 they outnumbered the single-core processors, with the release of AMD Dual-Core Opteron and Intel Pentium D. Since then, with the improvement of silicon technology, the multi-core processors have increased their performance and computational capabilities, by mostly following Moore’s Law [9]. In particular, until 2016, Intel micro-architectures followed a tick-tock manufacturing model, where the “tock” represents the introduction of a new micro-architecture (or the substantial improvement over the previous one), while the next “tick” processor maintains the same micro-architecture with reduced manufacturing technology. This strategy was followed since 2006, e.g., Sandy Bridge architecture tock (32 nm) and the Ivy Bridge architecture tick (22 nm). For the most recent Skylake micro-architecture (14nm “tock”, introduced in 2015), the “tick-tock” manufacturing and design model is officially discontinued, thus suggesting the potential diminishing applicability of the Moore’s Law in the most recent micro-architectures. In turn, Intel now adopts the “process-architecture- optimization” model, where the first Skylake successor, i.e., the Kabylake micro-architecture (launched in 2017

6 for the desktop market), is still produced in 14 nm technology, thus does not adhere to a reduction in the size of the transistors. It is also announced that Intel will introduce one more Skylake/Kaby-Lake optimization step (Coffee Lake, expected in 2017), before releasing the 10 nm “tick”, i.e., Cannonlake in 2018. As a result, Intel Skylake-based micro-architectures are expected to dominate the CPU market even in the future, at least for next several years. For these reasons, in the scope of this Thesis, two Intel micro-architectures are considered, namely Ivy Bridge and Skylake micro-architectures, which fundamental aspects are subsequently analyzed. It is also worth to note that the choice on primarily focusing on Intel CPUs (versus other manufacturers, e.g., AMD) is motivated by its clear dominance in the High Performance Computing (HPC) environments, e.g., more than 90% of Top500 supercomputers rely on Intel devices and architectures. Moreover, Top-Down method [2] is a performance analysis tool developed by Intel Corporation. It correlates application characteristics with processor capabilities, by identifying the main bottlenecks that limit performance. Since it is a very complete model, which evaluates a huge amount of possible bottlenecks that can limit application performance, Top-Down method will be used in this Thesis to conﬁrm the correctness and insightfulness of the insights provided by the proposed CARM extensions.

2.1.1 Intel Ivy Bridge and Skylake micro-architectures

Despite the introduced enhancements, different micro-architecture generations from the Intel Core processor family share a very similar basic pipeline structure. For Intel Skylake micro-architecture, Figure 2.1 presents a high level overview of the CPU core pipeline. The core pipeline can be divided in two main parts: frontend (in-order execution) and backend (out-of-order execution). The frontier between them is the Instruction Decode Queue (IDQ), which can hold up to 64 micro-operations (µops) and contains a Loop Stream Detector (LSD), able to detect loops of up to 64 µops [1]. In the frontend, the µops are delivered to the IDQ by three components: Micro-Code Store Read-Only Mem- ory (MSROM), Decoded Icache (DSB) and Legacy Decode Pipeline. The Legacy Decode Pipeline obtains the instructions from L1 Instruction Cache and delivers up to 5 µops per cycle to the IDQ. The DSB is fed by the Legacy Decode Pipeline and it stores the latest fetched and decoded µops. As such, DSB allows to bypass the Legacy Decode Pipeline for a set of recently decoded instructions, which is very useful for loop execution (i.e., in cases when the instructions are repeated in each loop iteration). The DSB can deliver up to 6 µops per cycle to the IDQ. Finally, the MSROM can issue a maximum of 4 µops per cycle and it is only used for instructions longer than 4 µops [1]. The instruction ﬂow is controlled by the Branch Prediction Unit (BPU), which designates the next instruction to be forwarded to the IDQ either from the DSB or from the traditional decoding pipeline (i.e., by fetching the instruction from the L1 instruction cache and by decoding it in the Legacy Decode Pipeline). From the IDQ, the µops enter in the renamer block, where several execution steps can be performed, such as, binding the dispatch ports with execution resources, zero-idiom operations (to clear register contents to zero using common operations, e.g., XOR), one-idiom operations (to set all the register bits to 1 using common operations, e.g., CMPEQ) and zero-latency register move operations (to exchange the content between registers). As it can be observed, these operations are performed before the instruction scheduling stage, which allows to reduce the scheduler’s workload and complexity, thus resulting in the overall performance improvements [1].

7 Frontend Bad Speculation

Frontend Bound

Retiring

Memory Ports

Compute Ports Memory Subsystem

Backend

Figure 2.1: CPU pipeline for a Skylake micro-architecture [1].

In the scheduler, µops are forwarded to the respective dispatch ports. In both micro-architectures (i.e., Ivy Bridge and Skylake), ports 0, 1 and 5 are mainly used for FP operations, while ports 2, 3 and 4 are dedicated to the memory operations. In Skylake micro-architecture, the additional ports 6 and 7 are introduced to provide further enhancements, mainly for integer arithmetic and memory operations, respectively. As a result, Skylake processor can dispatch a ready µop to one of eight different ports for execution (versus six ports in Ivy Bridge). Both micro-architectures support Single Instruction Multiple Data (SIMD) instructions, e.g., Advanced Vector Extensions (AVX) and Streaming SIMD Extensions (SSE), as well as scalar instructions (e.g., ADD and MUL). In contrast to Ivy Bridge, Skylake micro-architecture provides the full support for AVX Double Precision (DP) FP Fused Multiply-Add (FMA) operations in two different execution ports (see Figure 2.1) [1]. However, in Ivy Bridge, AVX FMA instructions can only be replicated by simultaneously performing a multiply instruction followed by an addition in two different ports, which is referred herein as the AVX FP Multiply and Add (MAD) operation. As a result, at the same operating frequency and for the same instruction set, the AVX FP throughput can be effectively doubled in Skylake when compared to Ivy Bridge. In detail, in Intel Ivy Bridge, FP MUL and FP ADD are served by two different ports (ports 0 and 1, respectively), and these two instructions can be executed in the same clock cycle [1], i.e., one MAD per clock. Hence, when using AVX SIMD DP instructions, AVX vector length allows to perform 4 ﬂops per instruction and, correspondingly, the Ivy Bridge micro-architecture can

8 (a) Intel Skylake micro-architecture. (b) Intel Ivy Bridge micro-architecture.

Figure 2.2: Memory subsystem for Intel micro-architectures. deliver up to 8 flops per cycle for AVX DP MAD operations. For example, at the nominal frequency (3.5 GHz), Intel Ivy Bridge 3770K processor can deliver a maximum performance of 8×3.5=28 GFLOPS/s per core, i.e., 112 GFLOPS/s when all four cores are fully utilized (4×28). Regarding Intel Skylake, AVX DP FP FMA instructions are served by ports 0 and 1 and the respective functional unit can deliver 8 flops per cycle (per port). Thus, Intel Skylake has a maximum throughput of 16 flops per cycle, i.e., the double of Intel Ivy Bridge maximum throughput. Hence, at the nominal frequency (4 GHz) Intel Skylake 6700K processor can deliver the maximum performance of 16×4=64 GFLOPS/s per core, i.e., 256 GFLOPS/s for 4 cores, which corresponds to an increase of about 2.3 times when compared to Intel Ivy Bridge 3770K processor. Figures 2.2a and 2.2b present the memory subsystem organization for Intel Skylake and Ivy Bridge micro- architectures, respectively. In both architectures, the ports 2 and 3 are reserved for load operations, while port 4 serves the store instruction from/to the core to the L1 data cache. Moreover, in Skylake micro-architecture, there is an additional port 7, which is reserved for store address calculation, in order to provide the full support for two loads and one store instruction (2LD+ST) per cycle per core. Furthermore, from Ivy Bridge to Skylake, the bus width of the connection lane between the ports and the L1 data cache was increased from 128 bits to 256 bits, i.e., from 16 bytes to 32 bytes per port (see Figure 2.2). As a result, Skylake supports a maximum theoretical throughput of 32bytes×3ports=96 bytes per cycle (per core), while Ivy Bridge can only deliver 16bytes×3ports=48 bytes per cycle (per core). The memory subsystem of Ivy Bridge and Skylake micro-architectures contains three cache levels (L1, L2 and L3) and DRAM. L1 and L2 caches are private to each core, and their sizes are 32 KB and 256 KB per core, respectively. L3 and DRAM are shared between cores and their size varies according to processor model and system configuration. For example, in the scope of this Thesis, two different computing platforms were evaluated: i) an Ivy Bridge-based system with a quad-core Intel 3770K processor, 8 MB of L3 cache and 8 GB of DRAM; and ii) a Skylake-based platform with a quad-core Intel 6700K processor, 8 MB of L3 cache and 32 GB of DRAM. It is also worth to note that the L1 instruction and data caches are separated, while L2 and L3 caches include both instructions and data [1]. The connection between the cores and the L3 cache is made through a ring interconnection to multiple slices of this memory level, as shown in Figure 2.3. This connection is a coherent bi-directional ring bus that delivers 32 bytes per cycle in each stop and connects three different parts of the chip: the cores and L3, the on-chip Graphics

9 Figure 2.3: Connection between cache levels, GPU and system agent [1]

Processing Unit (GPU) and the system agent that includes the DRAM controller, Direct Media Interface (DMI) controller and Peripheral Component Interconnect express (PCIe) controller. Finally, this micro-architecture also supports speculative data loads to one of the cache levels, using several hardware pre-fetching mechanisms, which can improve performance for the codes dominated by sequentially ordered memory accesses.

2.1.2 Top-Down method for application analysis and detection of execution bottlenecks

Recently, a Top-Down method for counter-based application analysis was proposed in [1, 2], which represents a breakthrough approach to identify different application execution bottlenecks that limit application performance in modern out-of-order CPUs. It aims at solving the limitations of traditional methods that do not take into account several characteristics of modern CPUs, e.g., CPU stalls that are overlapped among different functional units, speculative execution and the effects of branch miss prediction. The Top-Down concept is based on a structured drill down method that guides the user to critical areas within the processor pipeline by relying on the CPU Performance Monitoring Unit (PMU) (in particular, hardware performance counters). In detail, the Top-Down method decouples the processor pipeline in a tree-like structure, where each node represents a potential execution bottleneck (in different parts of the CPU pipeline) and each node is attributed with a speciﬁc weight to emphasize its relevance, as presented in Figure 2.4. When applying the Top-Down methodology, it is possible to identify the predominant execution bottlenecks, since the Top-Down method reports the overall contribution for each of these pipeline parts in the overall application execution, i.e., how different parts of the CPU pipeline are used by the application. As such, the component with the highest utilization in the Top-Down hierarchy can be considered as the predominant limiting factor for the application execution. Typically, this analysis should be performed between the nodes in the same level of the hierarchy (since they refer to the same pipeline stage), starting from the top level. Afterwards, the nodes in the inner hierarchy should be examined only for the top nodes marked as the predominant sources of bottlenecks. As previously referred, a modern out-of-order CPU engine is divided in two main parts: frontend and backend (see Figure 2.1). The former is responsible for instruction fetches and their transformation in micro-operations (decoding), while the latter is responsible for scheduling, executing and retiring the micro-operations. Therefore, in the Top-Down method, the pipeline analysis is divided in four categories: retiring, bad speculation, frontend bound and backend bound (see Figure 2.4). Since the backend receives micro-instructions from frontend, an application is frontend bound when the backend is under-supplied. In this case, the application can be bounded by the bandwidth or latency of the frontend,

10 Figure 2.4: Top-Down Analysis hierarchy [2]. where the former signals the inefﬁciency in the fetch-units, while the latter is directly connected with fetch starvation. Bad speculation includes all the stalls originating from branch miss-predictions and machine clears, i.e., when the entire CPU pipeline is cleared due to memory ordering violations, self-modifying code or when certain loads refer to illegal addresses. Thus, it includes stalls from two main situations: 1) the pipeline slots are used to issue micro-operations that do not retire; and 2) the slots where the issue pipeline is blocked due to a miss-speculation. The retiring category takes into account the issued micro-operations that eventually get retired. The best- case scenario corresponds to a retiring of 100%, i.e., when the processor retires the maximum amount of micro- operations per cycle. However, this does not imply that the application is fully optimized. For example, a high retiring for a non-vectorized application can suggest possible improvements by introducing SIMD instructions in the code. Lastly, the backend bound node is divided in core bound and memory bound parts. In core bound, a stall can occur due to execution starvation or sub-optimal ports utilization. On the other hand, the memory bound includes execution stalls that occur while serving the data requests from the memory hierarchy (which can be further decoupled on a per memory level basis). Top-Down method was recently extended to provide the power consumption breakdown for different components in the CPU pipeline [20] by following a similar approach used in the performance Top-Down method [2]. As a consequence, this power breakdown method allows to decouple the contribution of frontend, backend and core to the overall power consumption. For this purpose, a set of hardware counters is used to correlate the performance metrics with power consumption. Each counter is associated with a weight, obtained through a set of experimental tests performed on a speciﬁc micro-architecture and micro-architecture simulator. However, the proposed power breakdown method currently covers the single-core execution and only a set of private caches in the memory subsystem (i.e., L1 and L2) [20]. Despite the model complexity (due to the high number of hardware counters), the Top-Down method allows to deeply correlate application behavior with the micro-architecture capabilities, as well as to cover a wide range of

11 potential sources of application execution bottlenecks from the micro-architecture perspective. For these reasons, the Top-Down method will be relied upon in this Thesis to complement the analysis and validation of the herein proposed roofline modeling approaches and extensions, as well as to confirm the efficiency and accuracy of the developed micro-benchmarks for in-depth experimental evaluation of the micro-architecture capabilities.

2.2 State-of-the-art approaches for insightful modeling of multi-cores

The characterization and optimization process of the applications can be a difficult task, due to micro-architecture complexity and application heterogeneity. For a given algorithm and application implementation, determining what are the current execution bottlenecks is far from being a trivial job, since it is required to relate application characteristics/demands with the capability of different subsystems in the processor pipeline. This is especially challenging process when considering the applications with a large diversity of instruction types in their instruction mixes, which can simultaneously exercise several different components of the micro-architecture, e.g., different levels of memory hierarchy and/or functional units. In these scenarios, approaches for insightful modeling of multi-core processors are valuable resources for computer architects and application developers, allowing to ease the characterization and optimization of the applications, through a fast analysis and intuitive visual representation of the most relevant micro-architecture capabilities. In order to be insightful and general, the model can not include too many micro-architectural details, as it will lead to the model that is too complex and/or architecture-specific. As such, the general insightful model needs to incorporate only the minimum set of architecture-related information in order to be able to provide important guidelines about the primary application execution bottlenecks. As a result, the insightful modeling represents a trade-off between the level of detail (modeling accuracy) and model simplicity. Roofline modeling is an insightful modeling method widely used in both academia and industry, which has already provided several contributions in micro-architecture and application analysis [6]. It represents an intuitive and insightful tool, which allows to characterize application behavior in multi-core, many-core or accelerator processor architectures. It combines, in a single plot, the inherent hardware limitations and application optimization potential, by modeling the architecture attainable upper-bounds for performance, power consumption, energy or efficiency [6]. Roofline modeling relies on the observation that memory operations and computations can be executed concurrently in the modern out-of-order processors, thus the overall execution can be limited either by the time to perform the computations or by the memory accesses. Hence, roofline modeling methods contain two distinct regions: memory bound and compute bound regions, which are useful to pinpoint the potential application execution bottlenecks [3, 4, 6]. In the existing literature, there are two main approaches for roofline modeling: the ORM [4] (also referred as the Classic Roofline Model) and the recently proposed CARM [3, 5]. Both models relate intensity, i.e., the ratio between computations and memory traffic, with different metrics (performance, power, energy-efficiency) in order to facilitate application characterization and provide important optimization guidelines.

2.2.1 Performance Rooﬂine Modeling

In general, the performance roofline models relate intensity to FP performance and memory bandwidth (memory traffic). However, CARM [3, 5] and ORM [4] analyze memory traffic differently. While ORM only observes

12 the traffic between two specific memory levels (usually between the Last Level Cache (LLC) and DRAM), CARM considers the complete memory hierarchy by observing the memory traffic from core point of view, as shown in

Figure 2.5. Hence, CARM can represent in a single plot the realistically attainable bandwidth (By) of each memory level y, where y ∈ {L1,L2,...,LLC,DRAM}. In addition, the throughput of FP units is seen equally by both models and it is used to represent the peak compute performance of a given processor (FP in flops/s). Since ORM and CARM observe memory traffic differently, the intensity used in each of these models is also different. ORM introduces the term Operational Intensity (OI) to denote compute operations per byte of data traffic to/from a specific level of the memory hierarchy [4]. For example, the DRAM variant of ORM only observes the data traffic between the LLC and DRAM, i.e., bytes transferred to/from the DRAM, herein referred as DRAMbytes. As such, the OI in DRAM ORM is expressed in f lops per DRAMbyte. On the other hand, CARM uses Arithmetic Intensity (AI) i.e., the ratio between compute operations and the total number of bytes originating from the instructions in the application code (regardless of the memory level where those requests are served in the memory hierarchy) [3]. As a consequence, the AI in CARM is expressed in f lops per byte.

Figure 2.5: ORM and CARM memory trafﬁc [3].

In order to construct each of these models, it is necessary to take into account their respective approaches. Since both models see the FP unit throughput equally, the time to perform a given amount of ﬂops (φ) is expressed as

φ/Fp, corresponding to the time involved in computations (Tc). Regarding the time to perform memory transfers

(Tm), it differs between CARM and ORM. In ORM, the time to transfer an amount of bytes served by DRAM

(βD, i.e., DRAMbytes), with DRAM bandwidth BD, is given by βD/BD. Thus, ORM application execution time is calculated by:

φ 1 1 T(OI) = T = max{ Tc,Tm} = φ × max , . (2.1) βD BD × OI Fp

Hence, the maximum attainable performance of the architecture in the DRAM ORM, i.e., Fa(OI), is deﬁned as:

φ F (OI) = = minB × OI,F . (2.2) a T(OI) D p

On the other hand, from CARM point of view, the time to transfer the amount of bytes (β), served by the memory level y with bandwidth By, is given by β/By. Consequently, CARM application execution time for level 1 1 y is expressed as Ty(AI) = φ × max , . Hence, CARM maximum attainable performance, Fa,y(I), is By × AI Fp expressed as:

Fa,y(AI) = min By × AI,Fp . (2.3)

13 (a) ORM in Intel 3770K Ivy Bridge (b) CARM in Intel 3770K Ivy Bridge

Figure 2.6: ORM and CARM [3, 4].

Figures 2.6a and 2.6b present ORM and CARM models, respectively, for Intel Ivy Bridge 3770K, with three cache levels and DRAM. Both models are plotted for the DP FP AVX instructions, with the intensity in x-axis and performance in y-axis (both axis in the log scale). As presented in Figure 2.6b, CARM includes in a single plot all the memory levels, represented by four slanted roofs (one for each memory level). Each slanted roof delimits the memory bound region of the respective memory level (L1, L2, L3 and DRAM). The maximum attainable performance in this region is limited by L1 cache bandwidth, while the remaining levels offer a lower attainable performance, due to the bandwidth reduction when data is fetched further away from the core. ORM only contains one slanted roof (Figure 2.6a), representing the bandwidth between the LLC and DRAM. In the right part of the models, a set of horizontal roofs forms the compute bound region, which describes the processor computational capabilities. Since Intel Ivy Bridge supports vectorized instructions, e.g., AVX, SSE, and scalar instructions (such as ADD and MUL), the compute region can include one horizontal roof for each instruction type. In particular, Intel Ivy Bridge achieves maximum FP throughput when using DP AVX MAD, corresponding to the FP peak performance, as shown in Figures 2.6a and 2.6b. Furthermore, the intersection between the horizontal and slanted roofs (i.e., the ridge point) represents the minimum intensity that allows to reach Fp and it also demarks the point where computation time is equal to the memory transfer time [3, 6]. As a consequence, the ridge point defines the boundary between the two regions of the model, i.e., the compute (on the right side of the ridge point) and memory bound (on the left side of the ridge point) regions. Since the application is usually plotted with a single point within the roofline chart, if the application intensity is on the right side of the ridge point, it is compute bound; if it is on the left side, the application is characterized as memory bound [3, 6]. It is worth to note that the ORM can also be applied to other memory levels but, instead of using DRAM bandwidth, it is constructed with the peak bandwidth of the desired memory level. Hence, to analyze the gains when applying different application optimization strategies (e.g., improving the memory access pattern), it is necessary to construct and simultaneously use several different representations of the model, one for each memory level [4]. Furthermore, application characterization greatly differs in CARM and ORM. In ORM, since only DRAM traffic is analyzed, the implementation of certain optimization techniques (e.g. cache blocking) can provoke a reduction in the DRAM traffic, thus increasing OI. As a result, the application point can move from the memory bound region towards the compute bound region. In CARM, since the memory traffic is seen from the core perspective, the optimization techniques do not modify the AI (the AI is the property of the application), unless the applied optimizations change the algorithm itself. Hence, CARM allows to visualize the optimization potential of

14 512 - ridge points FMA DP FP AVX (Performance) 256 ADD/MUL DP FP AVX (Performance) 128

64 C 32

L1 Core (Bandwidth) 16 M L2 Core (Bandwidth) 8 Performance CARM L3 Core (Bandwidth) Intel 6700K (Skylake)

Performance(GFLOP/s) 4 4 cores | DP FP AVX

2 DRAM Core (Bandwidth) 1 0.015625 0.0625 0.25 1 4 16 Arithmetic Intensity (FLOP/Byte)

Figure 2.7: Performance Cache-Aware Roofline Model for an Intel 6700K quad-core processor (Skylake). the applications, by plotting a vertical line with constant AI. A given application (kernel) is typically plotted with a single point in the CARM, in respect to its AI and the obtained performance when executed on a given platform. Since the AI of application point is not expected to significantly vary when applying optimization techniques, a simple rule of thumb can be followed in the CARM when determining the potential execution bottlenecks and deriving the optimization guidelines. Given the position of application point in the CARM plot, an imaginary vertical line should be drawn at the application AI, then all rooflines intersected with this imaginary line represent potential sources of execution bottlenecks that limit the application performance. Figure 2.7 presents an example of a Cache-aware Roofline plot where two kernels are reported. The first kernel (marked with “M”) has an AI marked with “A1”, which is underneath the L3 bandwidth ceiling, thus it is memory-bound, and its performance can be potentially improved by applying memory-related optimizations (e.g., by improving cache utilization and memory access pattern). The second kernel (marked with “C”) has an AI denoted with “A2”, which is underneath the peak performance ceilings, thus it is compute-bound, and its performance can be potentially improved by code vectorization and the use of advanced ISA extensions. When deciding on which optimization techniques to apply, a special attention should be given to the memory and/or compute bottlenecks signaled by the rooflines positioned directly above the application point. Hence, when applying different optimization techniques, it is expected that the application point moves along the y axis towards the uppermost roofline (i.e., to improve application performance by breaking the above-positioned rooflines without significant changes in the AI on the x axis). This observation does not necessary hold for the application optimization based on the ORM and OI, due to its strong dependency on hardware properties [3–5] .

Intel Advisor Rooﬂine

In 2017, CARM is integrated as an official feature of Intel Advisor (also referred as Intel Advisor Roofline), where the process of building the roofline plots and in-depth application characterization are fully automatized with respect to the hardware platform where the applications are executed [19]. Intel Advisor is a software tool for analyzing application behavior on a wide range of Intel processors, which covers all contemporary Intel CPU

15 1 2

Figure 2.8: Intel Advisor Roofline: Performance characterization of Minighost loops micro-architectures (from Nehalem to Skylake) up to massively parallel devices (e.g., Intel Xeon Phi x200 family, codenamed Knights Landing). By tightly coupling a set of tools from Vectorization Advisor and Roofline Analysis, the Intel Advisor can now provide insightful performance and design hints to help in the application optimization. To exploit the capabilities of Intel Advisor Roofline, it is required to run the Survey and Trip Counts / FLOPS analysis when profiling an application. During this phase, Intel Advisor also performs a set of quick benchmarks to assess the CARM-related performance parameters of a given execution platform, such as the realistically attainable bandwidth from different memory levels to the core and the peak performance of different arithmetic units. These parameters are subsequently used to automatically construct all necessary rooflines in the performance CARM for a given micro-architecture. Performance data of the target application is also extracted during the Survey and Trip Counts analysis, e.g., the total amount of floating point operations (flops), the total amount of requested data (bytes), execution time and vectorization efficiency. By combining this analysis with the performance CARM, the final outcome of the Intel Advisor Roofline is produced where all loops and functions of the target applications are characterized in the CARM plot. Figure 2.8 presents an example of Intel Advisor CARM characterization (hierarchical mode) for several loops in the Minighost application [23] on a single core of Intel 6700K processor (Skylake). As it can be seen, the automatically constructed performance CARM in the Intel Advisor encapsulates the previously elaborated features of the performance CARM, when representing the attainable micro-architecture performance upper-bounds for different levels of the memory hierarchy (from L1 to DRAM). The loops of Minighost application are represented as dots in the CARM plot, which size and color are selected relatively to their execution time, starting from green to yellow and finally red. Green points are usually unworthy of attention, since their contribution to the overall application execution time is very small. However, the contribution of red and yellow points to the overall execution time is more significant, thus they represent the potential candidates for optimization. In the most recent update of Intel Advisor, the Hierarchical Roofline feature is introduced, which allows visualization of the agglomerative performance for several kernels in respect to the parent kernel that invokes their execution. As shown in Figure 2.8, this functionality is attained by connecting several application kernels into a single parent dot, thus evaluating the FLOPS and bytes contribution of each loop/function in the main kernel (see

16 the connection of kernel 1 in Figure 2.8). This feature increases the insightfulness of the Intel Advisor Roofline, since it eases the source code analysis with the hierarchical application characterization. As it can be observed, Intel Advisor CARM provides a set of powerful tools for in-depth application performance analysis on a given architecture and it eases the selection of optimization techniques that can be applied to increase application performance. This, in turn, avoids wasting time in micro-optimizations that do not contribute greatly to the overall performance of the application. In the scope of this Thesis, Intel Advisor is mainly used to facilitate application analysis. In particular, this tool allows to define the hotspots with the biggest impact on the overall execution time of the application, which are the main focus of this Thesis when characterizing the applications. Besides, Intel Advisor provides the assembly code for all the measured kernels, allowing to perform an extensive analysis of the instructions and instruction set extension utilized by each hotspot. Since Intel Advi- sor CARM and its hierarchical version represent the first steps towards the insightful application characterization, these charts are used as baselines for comparison with the CARM extensions proposed in this Thesis. Finally, the instruction mix analysis in Intel Advisor might provide additional insights into application design and code quality, as well as the additional hints on possible execution bottlenecks. However, the contributions of this Thesis greatly surpass the pure utilization of a set of Advisor features. In particular, a special focus is given to uncover the CARM construction methodology to provide the visual representation of the model. This analysis allows to pinpoint the shortcomings of the state-of-the-art CARM implementation, which may result in inconclusive (or even misleading) characterization and optimization hints for a set of real-world applications. To this respect, this Thesis also proposes a set of strategies and recommendations on how to further improve the roofline insightfulness not only for the Intel Advisor CARM implementation, but also for the roofline modeling in general.

2.2.2 Power, Energy and Energy-Efﬁciency Rooﬂine Modeling

In previous years, the main concern about application optimization only involved performance maximization. However, due to technology and architectural constraints, the recent trends are more focused on energy-efﬁcient execution. In order to address this problem, ORM and CARM were extended to provide power consumption, energy consumption and energy-efﬁciency modeling of CPU micro-architectures [5, 24, 25]. These new models use the similar approaches adopted from the respective performance models, thus they inherit all the previously mentioned differences that occur between CARM and ORM in the performance domain.

Original Rooﬂine Model

The authors in [24–26] applied ORM performance principles to power, energy and energy-efﬁciency. While in the performance model, computation time and memory transfer time can overlap, the energy consumption when performing computations (φε f lop) and memory transfers (βDεmem) can not follow this principle. In [24, 25], by considering a constant power (π0), which does not depend on any executed operation, the total energy consumption of an application is expressed as:

Bε π0 T E = φε f lop + βDεmem + π0T = φε f lop × (1 + + ), (2.4) OI ε f lop φ

17 where φ represents the amount of ﬂops performed and βD is the amount of DRAMbytes transferred. Equation

(2.4) depends on three parameters: the constant energy per flop (ε f lop) and the constant energy per byte (εmem) and the energy provided by the constant power (π0T), which is linear in time [24–26]. From this equation, the models for energy-efficiency (φ/E) and power (E/T) can be derived, which are represented in Figures 2.9a and 2.9b, respectively, for Intel Ivy Bridge 3770K. Both figures are plotted with operational intensity in x-axis and the respective metric (power or efficiency) in y-axis.

(a) ORM Energy-Efﬁciency model [26]. (b) ORM Power Model [26].

Figure 2.9: Original Rooﬂine Models for energy-efﬁciency and power consumption.

In the energy-efficiency ORM extension, the energy balance point (Bε = ε f lop/εmem) is introduced, which corresponds to the operational intensity where bytes and flops consume the same amount of energy. This parameter defines the memory bound and compute bound regions, from the efficiency point of view as shown in Figure 2.9a [24–26]. Regarding ORM power model (Figure 2.9b), it is worth to mention that maximum power corresponds to the ridge point in the performance model. Furthermore, when OI increases, the application is deep inside the compute bound region, and, as expected, also its average power tends to the computation power. On the other hand, by reducing OI, the workload becomes memory bound and its power limited by DRAM [24–26].

Cache-Aware Rooﬂine Model

Recently, CARM principles were applied to model the power consumption and energy-efﬁciency upper-bounds of modern multi-cores with multiple levels of memory hierarchy [5]. For this purpose, the multi-core system is modeled in three internal power domains: core domain (Pc), which corresponds to the power consumed by the units related with instruction execution and memory subsystem; uncore domain (Pu), related with the power consumption in the remaining on-chip components; and package domain (Pp), i.e., the overall power consumed by the chip. The relation between these three domains is given by

Pp = Pc + Pu . (2.5)

Through a set of experimental benchmarks, performed in Intel Ivy Bridge 3770K, with three cache levels and DRAM, the power consumed by the FP units and memory subsystem is obtained, as presented in Figure 2.10. Correspondingly, the core domain can be divided in two parts - the memory subsystem and the FP units. In the memory subsystem (presented in Figure 2.10a), due to the increase in activity of different cache levels, the core β domain power increases from L1 to L3 caches, i.e., Pc,y, where y∈{L1,L2,L3,DRAM}, since more cache levels are used when data is fetched further away from the core. However, when accessing DRAM, the bandwidth seen from the core is reduced and the activity in cache diminishes (stalls while data is not fetched from DRAM), which

18 β causes a reduction in the power of the core domain. In what concerns the uncore domain power (Pu ), it is constant for the cache accesses, and increases when the DRAM is used, since the memory controller and interconnect are more intensely used.

(a) Memory subsystem (b) FP units

Figure 2.10: Power consumed by the processor Intel 3370K Ivy Bridge [5]

φ Regarding the FP units power (presented in Figure 2.10b), the core domain power (Pc ) initially increases with the number of performed FP operations, stabilizing when the maximum performance is achieved. The uncore φ power (Pu ) is constant, since only arithmetic units are utilized, and it is equal to the uncore power of caches. When FP operations and memory operations are simultaneously performed, they share certain components β φ in the processor’s pipeline. Thus, Pc,y and Pc include two main components: 1) variable power contribution v,β v,φ q (Pc,y and Pc ); and 2) the constant power of the chip (Pc ), due to the shared components. Hence, the power φ q v,φ consumption in the core domain when only FP operations are performed is expressed as Pc = Pc + Pc , while the β q v,β power consumption in the core domain that corresponds to different memory levels y is given by Pc = Pc,y + Pc,y . v,φ v,β Pc and Pc,y correspond to the variable power of FP units and memory level “y”, respectively. Based on these parameters, the power of the core domain, for a given memory level y, is calculated by: q v,β Fp v,φ ByAI Pc,y(AI) = Pc + Pc,y min 1, + Pc min 1, , (2.6) ByAI Fp where By is the bandwidth of the memory level y [5]. φ β Furthermore, the uncore power only varies when DRAM accesses are performed. Consequently, Pu = Pu,y = q Pu , when y 6= D → C. Therefore, the uncore domain power is given by q v TD q v Fp Pu,y(AI) = Pu + Pu = Pu + Pu min 1, , (2.7) T(AI) BD→CAI v where Pu is the variable power consumed by the uncore components, BD→C is the DRAM bandwidth seen from the core and TD is the time spent when serving the DRAM requests. Finally, the package domain power results from the sum of the equations (2.6) and (2.7), producing the analytic power CARM, presented in Figure 2.11a. In particular, in the ridge point, since Tc = Tm = T, both FP units and memory subsystem contribute the most to the overall power. Thus, while in performance CARM the ridge point indicates the best operating point (lowest AI to achieve maximum performance), the ridge point in power CARM corresponds to the maximum power consumption. As the AI increases towards the compute bound region,

19 the power consumption asymptotically decreases towards the power consumed when only FP computations are performed. On the other hand, as the AI reduces in the memory bound region, the power consumption decreases to the one consumed when only performing the memory transfers to/from the speciﬁc memory level y. However, this model does not take into account the transition between memory levels, which is performed gradually, as shown in Figure 2.10a. Hence, the total power CARM [5] is developed (presented in Figure 2.11b), including all the possible transitional power consumption states and deﬁning an upper-bound for the power consumption of the micro-architecture.

(a) Analytic power CARM (b) Total power CARM

Figure 2.11: Power CARM Models for Intel 3370K Ivy Bridge [5]

Based on the power CARM equations, the total energy and energy-efﬁciency models can be derived. In the core domain, the energy CARM for a determined memory level “y” is deﬁned as:

" q v,β v,φ # Pc Pc,y Pc Ec,y(AI) = Pc,y(AI)T(AI) = φ + + , (2.8) min ByAI,Fp ByAI Fp while the energy-efﬁciency CARM in core domain is given by:

Fa,y(AI) φ ByAIFp ε (AI) = = = . (2.9) c,y q v,β v,φ Pc,y(AI) Ec,y(AI) Pc max Fp,ByAI + Pc,y Fp + Pc ByAI

In the memory bound region, the total energy CARM (Figure 2.12a) is almost constant, since the execution time is completely dominated by memory operations. On the other hand, in the compute bound region, the amount of ﬂops increases, dominating the execution time and, consequently, increasing the energy consumption.

(a) Total Energy CARM (b) Total Energy-Efﬁciency CARM

Figure 2.12: Energy and Energy-Efﬁciency CARM for Intel 3370K Ivy Bridge

20 Regarding the total energy-efficiency CARM (presented in Figure 2.12b), in the memory bound region, the lowest efficiency is obtained for the DRAM, while the highest efficiency corresponds to the L1 cache. In the compute bound region, the energy-efficiency increases with AI and converges to the maximum efficiency of the architecture, which can only be achieved for AI → ∞. Furthermore, the ridge point (i.e., intersection between memory and compute roofs) in the performance CARM does not correspond to the point where the maximum efficiency is achieved, since it refers to the maximum power consumption, which does not guarantee the most energy-efficiency execution. The energy-efficiency CARM is interpreted by relying on the regions of high energy- efficiency, which are defined starting from the minimum AI required to achieve 99% of the maximum energy- efficiency for each level of the memory hierarchy (see Figure 2.12b).

2.2.3 Remarks on Original and Cache-aware Rooﬂine principles

In order to better showcase the differences between ORM and CARM, Figures 2.13a and 2.13b present the characterization of three different iterative applications in both models, namely: APP-D (limited by DRAM), APP-L3 (limited by L3 cache) and APP-L1 (limited by L1 cache). These applications were designed to reach the model upper-bounds corresponding to the accessed memory level.

(a) ORM [3]. (b) CARM [3].

Figure 2.13: Application with different problem sizes in Intel 3770K Ivy Bridge.

In the first iteration, the applications APP-L1 and APP-L3 are characterized equally in both models (see points marked with “1st ”), since all memory operations fetch data from DRAM. However, in the remaining iterations, the accesses are served by the respective memories (L1 or L3), thus DRAM traffic is reduced and OI in ORM increases. Hence, the applications move from the memory bound region to the compute bound region and, according to ORM, it is possible to achieve FP peak performance with these applications. On the other hand, CARM AI does not change with the number of performed iterations, since the memory traffic that is seen from the core remains the same. In contrast with ORM, CARM shows that APP-L1 can not have its performance further improved, while APP-L3 performance can be boosted to achieve FP peak performance. Despite the difference in characterization, the two workloads reach the same performance in both models, since the performance in both ORM and CARM is reflected from the core perspective (i.e., it corresponds to the throughput of FP units). Furthermore, the APP-D represents the only one that is characterized equally by both models, since all accesses are served by DRAM. By comparing CARM and ORM, it is possible to state some advantages of the former faced to the latter. For example, to correctly characterize the applications with ORM, it is necessary to construct several model instances, in order to take into account all the memory hierarchy. In particular, for the previously presented application

21 examples, since ORM’s OI is related with the data traffic between two memory levels, it would be necessary three different plots to characterize these three applications correctly. On the other hand, CARM allows to characterize all the applications in a single plot, increasing the model simplicity and insightfulness. Besides, while CARM construction depends mainly on experimental measurements [3, 5], ORM is constructed based on man- ufacturer datasheets (performance ORM) and/or mathematical interpolations and approximations (power, energy and energy-efficiency ORMs) [4, 24–26]. Hence, ORM does not take into account possible architectural limitations, while CARM can reflect with more accuracy the system upper-bounds, allowing to perform a more accurate characterization of applications and selecting the best optimization techniques.

2.2.4 State-of-the-art approaches on extending the usability of insightful models

Micro-architecture modeling and application characterization are tackled in several state-of-the-art works, with the objective to ease the application characterization and optimization, since this task can become quite challenging when considering the micro-architecture complexity and application heterogeneity. The most representative scientiﬁc works in this research area are presented in Table 2.1.

Table 2.1: State-of-the-art works.

Paper Year Architecture Model Objective [1, 2] 2014 CPU Other The Top-Down Method for performance analysis [12] 2011 CPU ORM Study of NUMA systems using the roofline model [13],[14] 2014 CPU ORM Introducing application life cycle and memory latency in roofline mode [15] 2014 CPU ORM Extending the identification of bottlenecks in roofline model [16] 2016 CPU ORM Application of roofline model to AMT runtimes [20] 2016 CPU Other The Top-Down Method for power analysis [27] 2010 CPU ORM Introducing a memory concurrency modeling approach [28] 2012 GPU ORM Extending the roofline model to predict performance prior implementation [29] 2013 FPGA ORM Extending the roofline model to target FPGA performance [30, 31] 2015 CPU Other Execution-Cache-Memory Model (ECM) [32] 2017 GPU CARM Extending CARM to GPU architectures [33] 2017 CPU CARM Applications analysis with Intel Advisor CARM [34] 2017 CPU CARM PIC code performance analysis [35] 2017 CPU CARM Monte Carlo simulations optimization [36] 2017 CPU CARM Optimization and parallelization of B-spline based orbital evaluations

In order to improve the insightfulness and portability of different rooﬂine modeling principles, several state- of-the-art works propose to extend this methodology to characterize new bottlenecks and platforms. The works presented in [12–16, 27] propose several extensions to ORM by mainly targeting the CPU micro-architectures, while studies presented in [28] and [29] focus on ORM applicability to Field Programmable Gate Array (FPGA) and GPUs, respectively. Regarding the ORM FPGA extension proposed in [29], the rooﬂine model is constructed based on High-Level

22 Synthesis (HLS) tools, in order to relate algorithm performance and I/O bandwidth. In this work, the architecture design is fully driven by the characteristics of the specific algorithms, thus it is necessary to reconstruct the model for each algorithm. Besides, the relation between computation capabilities and resource consumption (area) is introduced by combining the ORM principles and FPGA main characteristics. The GPU extension proposed in [28], i.e., the Boat Hull model, also represents an algorithm-based model, which does not strictly rely on any architectural characteristics. This work mainly aims at predicting the algorithm performance before its implementation, by including into ORM the information about algorithm classes [37], to characterize the algorithm as memory bound or compute bound. Both these works, i.e., [29] and [28], can be seen as the first steps towards the ORM portability across different platforms. Since CPU+GPU and CPU+FPGA architectures are becoming increasingly popular, these works can help developers to define the upper-bounds of these novel heterogeneous systems. Furthermore, ORM is also extended to NUMA systems in [12–14]. The work presented in [12] adds roofs for different performance peaks for different utilization of cores (e.g. when only 1/4 of the cores are used) and different memory bandwidths due to memory imbalance (e.g., when 50% of the memory traffic is handled by a single memory controller). The works [13, 14] introduce application characterization for different phases during execution and investigate the relation between performance, memory bandwidth, operational intensity and latency of memory accesses. These extensions provide additional insights about the application behavior during its runtime, thus allowing to identify the application kernels that represent the main performance bottlenecks. The works presented in [15, 16, 27] specifically focus on multi-core processors. In [27], the main objective is to evaluate the effects of concurrent cache misses with ORM principles, in order to identify possible memory bottlenecks in multi-threading applications, e.g., race conditions. In [15] the existence of additional bottlenecks is exploited, by characterizing different components in the processor pipeline. In detail, by using a cycle-by- cycle analysis in a Intel Xeon processor simulator, the throughput, latency, issue and stall of several components (e.g. reorder buffer, reservation station and load/store buffer) are obtained, in order to extend ORM memory and compute bound regions. Finally, the work presented in [16] extends the ORM to AMT runtimes in multi-cores. The analysis of these applications is quite challenging, since the asynchronous nature hides the runtime overheads. To address this problem, a model based on ORM is created for sequential units. Although this Thesis is focused on CARM and multi-core CPUs, the extensions proposed in the above-referred works, including those targeting FPGA, GPU and NUMA systems, represent orthogonal research approaches that can be adapted for CARM in multi-core processors, in order to further increase the insightfulness and usability of the model in a wide range of possible scenarios. To this respect, it is also worth noting that the importance of roofline modeling can be evidenced in works that adopt similar modeling principles to provide more elaborate micro-architecture models, such as ECM [30, 31]. The objective of this model is to estimate application execution time and characterize their memory bottlenecks, by estimating the access time of each memory level. Although the ECM inherits similar approach to memory bandwidth modeling as proposed in CARM, its construction requires a large amount of hardware performance counters to be used, which increases its complexity and limits its usability to a specific set of platforms that support the required set of counters and expose them to the end-user. Although recent, the performance CARM [3] was already used to aid architecture design and for optimization and characterization of applications from different scientific domains [18, 38–42], while several tools were also proposed to ease the CARM-based analysis [17, 43, 44]. The works presented in [33–36] use Intel Advisor

23 CARM to characterize the performance of different applications, as well as to investigate the impact of applied optimizations. In particular, [33] compares ORM and CARM application characterization and the insights provided by both models. Furthermore, CARM principles are extended to NVIDIA GPU architectures in [32], while the works proposed in [33, 34] use Intel Advisor CARM in Intel Xeon Phi architecture. These works demonstrate the CARM portability across platforms and architectures for all modeling domains, i.e., performance, power and energy-efficiency. To the best of our knowledge, there are no scientific studies tackling CARM extensions and characterization of real-world applications for power consumption and energy-efficiency. Thus, in order to address this problem, the main objective of this Thesis is to correlate applications performance, power and efficiency, and to propose additional extensions to this model, in order to further improve its insightfulness.

2.3 Open challenges in insightful modeling

Due to its simplicity, roofline modeling contains certain limitations that affect all modeled domains, i.e., performance, power consumption and energy-efficiency. For example, modern applications contain a huge diversity of instructions, such as integer and floating-point, however existing roofline modeling methodologies only consider one type of arithmetic operations at a time (in particular, floating point arithmetic operations). To address this issue, it is necessary to introduce the awareness for a wide range of operations in the roofline modeling, in order to allow more accurate characterization of a wider range of applications from different scientific areas. This observation is directly related to the complexity of contemporary micro-architectures and the ability of existing roofline modeling approaches to fully expose their upper-bounds. As referred in Section 2.1, modern multi-cores provide the support for a vast range of instructions and different ISA extensions, which may affect the architecture performance, power and efficiency in many different ways. Hence, by only focusing on a subset of micro-architecture features and pipeline components, in the existing roofline modeling approaches some important bottlenecks and insights may be hidden from computer architects and developers for certain type of applications. For example, different applications may require a different amount of load and store operations to be performed, which does not necessarily need to result in the maximum utilization of the memory ports in the micro-architecture backend. In fact, depending on the ratio of load and store operations, certain components in the memory subsystem may be underutilized, thus provoking a significant change in the attainable bandwidth upper-bounds (for each level of the memory hierarchy). This effect will necessarily imply the modifications in the modeling of the memory bound regions. By introducing the load/store ratio in roofline modeling, the applications can be characterized more accurately, providing important insights about their behavior. Furthermore, when focusing on CARM and its Intel Advisor implementation, it should be noted that it may provide limited characterization information for applications whose performance lies between two different roofs (slopes) in the memory bound region. Since each memory roof corresponds to a hit rate of 100% in the respective memory level, the introduction of complementary methods to extend the insightfulness in the memory bound region of the model. These open challenges represent some of the main research topics tackled in this Thesis via extensive micro- architecture benchmarking in order to uncover the upper-bounds for different instruction types and mixes. Based on this evaluation on real hardware, different flavors and types of CARMs for all modeling domains (performance,

24 power consumption and energy-efﬁciency) are proposed and experimentally validated. The proposed models aim at extending the insightfulness of the existing models when determining the application execution bottlenecks and selecting the best optimization strategy. Furthermore, novel and alternative rooﬂine modeling approaches are also investigated in this Thesis, in order to provide the support for a wider range of applications, which are not necessary dominated by the FP operations.

2.4 Summary

This chapter starts by presenting an overview regarding the main aspects of a standard Intel core pipeline. In particular, Intel Skylake 6700K and Intel Ivy Bridge 3700K processors are analyzed in detail, stating some of the enhancements implemented between processor generations. When comparing both processors, there is a clear improvement in Intel Skylake 6700K throughput in FP units and memory subsystem, in order to fulfill increasingly computational demands. Moreover, two micro-architecture modeling methods used in the scope of this Thesis are introduced, i.e., Top- Down Method and roofline modeling. Top-Down analysis allows to decouple the main application bottlenecks according to the processor capabilities, dividing several performance limiters in a hierarchical tree. Since these metrics allow to better understand the main factors that limit application performance, Top-Down method plays an important role when validating the insights provided by the proposed CARM extensions, which are one of the main objectives of this Thesis. Regarding roofline modeling, the two main existing approaches are introduced, i.e., CARM and ORM, for performance, power consumption and energy-efficiency. Besides, CARM and ORM approaches are compared, stating their main differences in application characterization. Moreover, Intel Advisor CARM implementation is also analyzed, providing a first look at its main features and their usability and insightfulness. Furthermore, several state-of-the-art works are analyzed, verifying the usability of insightful modeling and, in particular, of roofline modeling in application characterization and micro-architecture benchmarking for different architectures and accelerators, such as, GPUs, FPGAs and many-core systems. Finally, the chapter finishes with the discussion of some current CARM limitations. The state-of-the-art CARM implementations do not take into account the different processor capabilities, such as, different instructions, instruction set extensions and load/store ratios. Since an application can exercise different processor capabilities, extending CARM analysis to include this information is extremely important to provide a more accurate workload characterization, which might ease designing and optimization process of the applications.

25 3. Reaching the architecture upper-bounds with micro-benchmarking

Contemporary multi-core CPUs support a big variety of instruction types and extended instruction sets, in order to fulfill the computational demands of modern applications. Thus, to identify the main bottlenecks in the application execution that prevents it to exploit the maximum potential of a given processor architecture, it is essential to firstly characterize and experimentally assess the sustainable upper-bound capabilities of the micro- architecture (e.g., the realistically achievable bandwidth of different memory levels and the throughput of different computational units). Since multi-core processors employ highly complex out-of-order engines, these parameters depend on a variety of pipeline components, their internal structure and features, which directly impact the micro- architecture capabilities. To address this issue, an accurate micro-architecture benchmarking is extremely important, allowing to characterize the system throughput upper-bounds for memory subsystem and arithmetic units. Moreover, by relying on experimental benchmarking of the micro-architecture, it is possible to assess the realistic architectural limitations and upper-bounds, which do not necessarily correspond to the nominal (theoretical) specifications provided by the vendors in data-sheets. In fact, the experimental evaluation may also reveal the properties that are not even disclosed in vendor specifications. However, designing a set of micro-architecture benchmarks to fully exercise different components in the processor pipeline is not a trivial task, as it is shown in this chapter. In this chapter, an extensive set of benchmarks is constructed and performed for Intel Ivy Bridge and Intel Sky- lake micro-architectures to deeply evaluate the capabilities of different subsystems in their pipeline. In particular, the benchmarks were created to evaluate the throughput upper-bounds for complete memory hierarchy (caches and DRAM), as well as for different types of FP units, by considering a diverse set of instructions and/or extended instruction sets. In particular, memory subsystem benchmarking considers different load/store ratios, which affect the sustainable memory bandwidth of the system. Hence, assessing the impact of the load/store ratios can provide additional insights about micro-architecture capabilities and allow more accurate application characterization (tailored according to the application characteristics/demands). Moreover, in this chapter, a specifically developed tool is introduced, describing its workflow and hardware counters access. This tool is designed to support the fine- grained experimental evaluation in real hardware using a set of architecture-specific benchmarks, also developed in the scope of this Thesis. In addition, the structure of each benchmark is presented, explaining its construction and methods selected to improve its quality and reliability.

3.1 Tool for ﬁne-grain micro-architecture benchmarking

In order to measure the amount of elapsed cycles, the number of performed memory/arithmetic instructions, and energy consumption in different parts of the processor chip, it is necessary to access a set of hardware counter registers, i.e., the Model Speciﬁc Registers (MSRs) built-in the processor [45]. Each MSR is identiﬁed by its

26 Interface Init Configure Start counters Stop counters Start counters Stop counters pthread Report Overhead Computations join() median Thread 0 pthread counters Read TSC Read TSC Read TSC Read TSC create()

Start counters Thread N-1 Configure Stop counters Start counters Stop counters Overhead counters Computations Read TSC Read TSC Read TSC Read TSC

Repeated 1024 times Repeated 1024 times

Figure 3.1: Benchmarking tool general layout. unique address, which is used to read and modify the register content, e.g., to obtain measurements or to configure the counters. These operations can be executed with the assembly instructions rdmsr (to read the counter value) and wrmsr (to configure the counter) [45]. However, the access to MSRs can only be performed from kernel space. Hence, to access the counters from user space (e.g., during the application run-time), it is necessary to incorporate a separate kernel module to connect both kernel and user sides. In order to achieve this functionality and to obtain the processor throughput upper-bounds for different memory subsystem levels and arithmetic units, a benchmarking tool was developed, which layout is presented in Figure 3.1. The tool relies on the kernel module from [46], which provides the communication interface between the user- space and kernel-space trough a set of system calls. In the scope of this Thesis, the kernel module was modified to improve its execution efficiency, reduce the overheads and incorporate additional functional functionalities. For example, the tool is upgraded to allow the access to Running Average Power limit (RAPL) interface, in order to measure the energy consumption in different parts of the processor chip (core, uncore and at the overall process chip, i.e., package). Besides, it was also enriched to support the entire set of uncore events, allowing to perform measurements in a wide range of platform components. As shown in Figure 3.1, after the tool initialization, the threads are created using pthreads interface. In each thread, besides Time Stamp Counter (TSC) monitoring to guarantee an accurate measurement of elapsed clock cycles, the counters to obtain the desired performance measurements are configured. In this part, the kernel module creates MSR configuration in the user side, which is forwarded to the kernel space with the desired counter address and command (read or write), by using the system calls and assembly instructions rdmsr and wrmsr. To configure the counters, three main steps are performed. First, it is necessary to enable the counters by configuring the IA32 PERF GLOBAL CTRL MSR [45]. In this MSR, the first 8 bits enable the general propose counters, while the bits from 32 to 34 enable the fixed counters. Thus, to enable all the counters, all these bits must be set to 1. Next, the general propose counter must be configured by using the respective IA32 PERFEVTSEL MSR [45]. In this register, the event select and unit mask of the desired hardware performance counter must be written in bits 0 to 7 and 8 to 15, respectively. Moreover, it is also possible to define counter masks (e.g. count only the number of cycles when more than 4 instructions are delivered to the core) and defining a counting in user, kernel or both modes. Finally, the measurement is read from the respective IA32 PMC MSR [45]. It is important to notice that Intel Skylake 6700K and Intel Ivy Bridge 3770K have a limited number of hardware counters that can be assessed at any given time. In particular, both micro-architectures only support 4 counters per core in hyper-threading mode and only 8 without the hyper-threading [45]. Since RAPL MSR are read-only and do not operate at the per-core level, their configuration is performed

27 through batches containing all the necessary addresses, in order to obtain all the readings in a single communication between user and kernel spaces, thus minimizing the overheads imposed by the tool when performing the micro-architecture experimental evaluation. Independently of the executed benchmarks, the energy consumption in Intel Ivy Bridge and Intel Skylake is reported in several registers that refer to different domains of the processor chip, i.e., MSR PP0 ENERGY STATUS (core energy usage) and MSR PKG ENERGY STATUS (socket energy usage). The difference between these two counters is referred herein as the uncore (off-core) energy usage. In addition, although officially not supported in the data-sheets, the tested Intel Skylake processor (6700K) also allows to measure the DRAM energy usage, with the counter MSR DRAM ENERGY STATUS, which provides an estimation of the DRAM energy consumption [45]. After the MSR interface configuration, the tool provides two separate execution modes when performing the micro-architecture benchmarking: 1) counter training to minimize the overheads; and 2) benchmark execution. The first stage aims at reducing the impact of micro-benchmarking overheads since some hardware counters may not provide the most accurate measurements (counting overheads). Besides, there are also certain portions of the benchmark code that contain instructions from benchmark skeleton (overhead code) e.g., loop control instructions, which are not the main subject of the experimental evaluation. Thus, it is possible to “train” the counters, by cor- recting the measurement obtained in stage two, by subtracting the overhead measurements. The overhead code is placed before the benchmarked code, in a distinct inline function. To implement this correction, TSC, performance counters and energy consumption registers are read in both stages. Once the parallel execution finishes, the tool reports the obtained median values from 1024 runs.

Algorithm 1 Generic memory benchmark Algorithm 2 Generic FP benchmark for i < time do for i < time do for j < repeat do for j < repeat do MEM INST FP INST MEM INST FP INST (...) (...) end for end for MEM INST FP INST MEM INST FP INST (...) (...) end for end for

The general structure of developed micro-benchmarks for evaluation of the upper-bound capabilities of the memory subsystem and FP units is shown in Algorithms 1 and 2, respectively. As it can be observed, both types of test codes share a similar structure and they are constituted by two loops. The outer loop ensures the performed test will attain a certain predeﬁned time duration, in order to increase the evaluation accuracy for benchmarks with small amounts of ﬂops and bytes. In addition, since core and socket RAPL counters take 50 ms to update their value, the outer loop is essential to guarantee the stability of energy consumption measurements. By executing a set of tests with different time durations, it is experimentally assessed that the accurate and stable readings were achieved when each test iteration took approximately 100 ms. When designing the micro-benchmarks, a special attention is necessary to be taken regarding the amount of instructions that can be placed in the micro-benchmark body (see MEM INST and FP INST in Algorithms 1 and

28 2). For example, having too many instructions may provoke evictions from the L1 instruction cache. In this scenario, additional memory transfers occur from the unified L2 cache, whose increased utilization for instructions may impact the L2 bandwidth evaluation for pure data transfers. Furthermore, this scenario causes an increase in the measured power, which will severely degrade the evaluation accuracy regardless of the type of instructions being tested. For example, for the FP benchmark, the obtained power consumption will not reflect the power consumed exclusively by the FP units, but it will represent the superposition of power consumption of FP units and the respective instruction cache. By performing tests with different amount of instructions, this phenomenon was verified after approximatively 980 instructions. To overcome this issue, an inner loop with a fixed size is introduced within the benchmark structure. On the other hand, for the inner loop, it is also needed to take into account the opposite effect, i.e., having too few instructions inside the loop. By creating an inner loop with a small amount of instructions, they will fit inside the LSD. In this case, the processor may shutdown (or clock gate) some components in the decoding pipeline, thus reducing the power consumption in the frontend. Hence, in order to avoid the unexpected power measures, the inner loop needs to have a size of at least 64 instructions, which is sufficient to eliminate this effect. Based on this set of restrictions, the inner loop size and the number of necessary repetitions is calculated for each benchmark, according to the total amount of instructions to be performed, in order to maximize the number of instructions executed in the inner loop. Since the amount of instructions may not be multiple of the loop size, the remaining instructions are placed after the inner loop, as shown in Algorithms 1 and 2. For example, by considering a total of 255 instructions and the inner loop size of 64 instructions, the inner loop would have three repetitions and 63 instructions would be placed after it. It is also worth to emphasize that the tests were designed to minimize register dependencies, which now can only occur when all the available registers are used. However, due to the high amount of repetitions, these effects are effectively hidden by the architecture out-of-order engine.

3.2 Micro-architecture benchmarking

In this chapter, the previously referred tool (see Figure 3.1) and a set of speciﬁcally designed micro-benchmarks are used to perform a ﬁne-grain experimental evaluation of different Intel micro-architectures, in order to uncover their maximum capabilities for different parts of the CPU engine. To fully characterize each core component, a set of benchmarks are performed, with different amount of executed instructions. These experiments are performed for two different Intel Core client processors at their nominal frequencies, namely: in Intel Ivy Bridge 3770K (3.5 GHz) and Intel Skylake 6700K (4 GHz). Both processors have three cache levels (L1 with 32 KB, L2 with 256 KB and L3 with 8 MB) and DRAM, with 32 GB in Intel Skylake and 8GB in Intel Ivy Bridge. The micro-benchmarks are run in CentOS 7.2.1511 as the operating system and compiled with Intel Compiler 17.0.4.196.

3.2.1 Exploring the maximum compute performance

As referred in Section 2.1, Intel Ivy Bridge and Intel Skylake micro-architectures greatly differ in the compute capability, especially for DP FP arithmetics. In particular, Ivy Bridge only provides separate FP units for AVX ADD and MUL operations (MAD) in two different ports, while Intel Skylake includes the full AVX FMA support in each of the two available ports for FP arithmetics. For this reason, two different micro-benchmarks are developed

29 for each micro-architecture, i.e., the benchmark codes for MAD (Intel Ivy Bridge) and FMA (Intel Skylake), as presented in Algorithms 3 and 4, respectively (both using AVX SIMD DP instructions). To measure the amount of performed AVXSIMD DP instructions, the counters FP ARITH:256B PACKED DO- UBLE (Intel Skylake) and SIMD FP 256:PACKED DOUBLE (Intel Ivy Bridge) are conﬁgured [45]. These tests follow the structure previously presented in Algorithm 2, by substituting the macro FP INST with the following instructions: muldp and addpd for Intel Ivy Bridge and vfmadd132pd for Intel Skylake.

Algorithm 3 MAD DP AVX Benchmark for Intel Algorithm 4 FMA DP AVX Benchmark for Intel Ivy Bridge Skylake for i < time do for i < time do for j < repeat do for j < repeat do mulpd %ymm0,%ymm0,%ymm0 vfmadd132pd %ymm0,%ymm0,%ymm0 addpd %ymm1,%ymm1,%ymm1 vfmadd132pd %ymm1,%ymm1,%ymm1 (...) (...) mulpd %ymm14,%ymm14,%ymm14 vfmadd132pd %ymm12,%ymm12,%ymm12 addpd %ymm15,%ymm15,%ymm15 vfmadd132pd %ymm13,%ymm13,%ymm13 end for end for mulpd %ymm0,%ymm0,%ymm0 vfmadd132pd %ymm0,%ymm0,%ymm0 addpd %ymm1,%ymm1,%ymm1 vfmadd132pd %ymm1,%ymm1,%ymm1 (...) (...) end for end for

By using the previously elaborated micro-benchmarking methodology within the developed tool, an extensive set of benchmarks was performed on each tested micro-architecture by varying the amount of performed FP operations. In particular, in each one of the processors, more than 3000 tests were performed to obtain the desired results. Each test was repeated 1024 times and the median value for counter measures and TSC was reported, in order to obtain stable results and more accurate characterization of processor capabilities. The obtained experimental results for single-core and multi-core MAD and FMA performance are presented in Figures 3.2a and 3.2b, for Intel Ivy Bridge and Intel Skylake, respectively.

28 28 FMA AVX SIMD DP (4C) 27 MAD AVX SIMD DP (4C) 27 26 26 FMA AVX SIMD DP (1C) 25 MAD AVX SIMD DP (1C) 25 24 24 Filling Pipeline FP Units Filling Pipeline FP Units 23 23 Intel Skylake 6700K Intel Ivy Bridge 3770K AVX SIMD DP 22 AVX SIMD DP 22 Performance [GFLOPS/s] Performance [GFLOPS/s] 21 21 22 24 26 28 210 212 214 216 218 220 22 24 26 28 210 212 214 216 218 220 FLOPS FLOPS

(a) MAD performance for Intel Ivy Bridge 3770K. (b) FMA performance for Intel Skylake 6700K.

Figure 3.2: FP Units maximum performance using AVX SIMD DP instructions.

As it can be observed in Figures 3.2a and 3.2b, while filling the pipeline (slanted region), the performance increases with the amount of flops performed, until approximately 40 flops per core in Intel Ivy Bridge and 48 flops per core in Intel Skylake. As such, it is needed to perform at least 10 AVX ADD/MUL instructions on both Ivy Bridge ports (5 instructions per port) to reach the maximum AVX DP FP MAD performance. In contrast, Intel

30 70 70 FP Units Core Power (4C) 60 60 Intel Ivy Bridge 3770K 50 AVX SIMD DP 50 FP Units Core Power (4C) 40 40 Intel Skylake 6700K AVX SIMD DP 30 30 Core Power (1C) 20 Core Power (1C) 20 10 10 Uncore Power (1C/4C) DRAM Power (1C/4C) Uncore Power (1C/4C)

Power Consumption [W] 0 Power Consumption [W] 0 22 24 26 28 210 212 214 216 218 220 22 24 26 28 210 212 214 216 218 220 FLOPS FLOPS

(a) MAD power consumption for Intel Ivy Bridge 3770K. (b) FMA power consumption for Intel Skylake 6700K.

Figure 3.3: FP Units maximum power consumption using AVX SIMD DP instructions.

Skylake FMA units require 6 AVX FMAs on both ports (i.e., 3 instructions per port). These results may suggest a much higher efficiency of FMA units implemented in the Intel Skylake architecture. Once the pipeline is completely filled with instructions (constant region), the processor achieves maximum throughput. It is worth to emphasize that the developed micro-benchmarks attained the theoretical maximum performance on both architectures. In particular, the single thread benchmarking achieved 28 GFLOPS/s in Intel Ivy Bridge (2 ports × 4 flops (1 MAD) × 3.5GHz) and 64 GFLOPS/s in Intel Skylake (2 ports × 8 flops (1 FMA) × 4GHz). For the multi-thread test (4 cores), Intel Ivy Bridge reached to 112 GFLOPS/s (4×28) and Intel Skylake achieved 256 GFLOPS/s (4×64), since FP performance scales linearly with the number of cores. As it can be observed, Intel Skylake offers for about 2.3× higher performance than Intel Ivy Bridge (for both single-core and multi-core performance), mainly due to inclusion of powerful AVX FMA units operating at the higher frequency. The corresponding power consumption results are presented in Figures 3.3a (Intel Ivy Bridge) and 3.3b (Intel Skylake), for three different domains of the processor chip, i.e., core, uncore and package, including both single- core and multi-core execution scenarios. Although power consumption takes more time to stabilize at its maximum value, a similar behavior to the one observed in the performance domain can be noticed. For single thread tests, Intel Ivy Bridge maximum power consumption in core domain is around 13.5 W, while Intel Skylake consumes approximately 19 W, confirming that a higher performance comes at the cost of increased power consumption (although these two micro-architectures rely on different manufacturing technology). When all four cores are used, Intel Ivy Bridge consumes about 40 W and Intel Skylake power consumption is approximately 60 W. Hence, in contrast to the performance, the power consumption does not linearly scale with the number of cores. This observation may suggest the existence of shared components in the cores domain that are always active, regardless if a single thread or multiple threads are being used. As only AVX DP FP arithmetic instructions are executed in the units inside the processing core, the uncore and DRAM power consumptions (Intel Skylake) are constant (through the entire test) and do not depend on the number of cores utilized. Being the superposition of core and uncore power domains, package power follows the trend observed in the cores domain. For this reason, as well as to improve the readability and understanding of the presented power consumption results in this Chapter, the package power is omitted, since it typically does not provide additional insights. By combining the experimental results obtained when evaluating the maximum computational performance (see Figure 3.2) and power consumption (see Figure 3.3), it is possible to provide a cross-comparison of different micro-architectures in terms of their energy-efficiency (GFlops/J). For single-core AVX DP FP arithmetics, Intel

31 FMA AVX SIMD DP 28 28 27 MAD AVX SIMD DP 27 ADD/MUL AVX SIMD DP 26 26 ADD/MUL AVX SIMD DP 5 25 2 Filling Pipeline 24 Filling Pipeline 24 FP Units (4 Cores) FP Units (4 Cores) Intel Skylake 6700K 3 23 Intel Ivy Bridge 3770K 2 AVX SIMD DP AVX SIMD DP 22 22

Performance [GFLOPS/s] 1 Performance [GFLOPS/s] 21 2 22 24 26 28 210 212 214 216 218 220 22 24 26 28 210 212 214 216 218 220 FLOPS FLOPS

(a) FP units performance for Intel Ivy Bridge 3770K. (b) FP units performance for Intel Skylake 6700K.

Figure 3.4: FP units performance using AVX SIMD DP instructions.

70 70 FP Units (4 Cores) FMA Core Power 60 Intel Ivy Bridge 3770K 60 50 AVX SIMD DP 50 ADD/MUL Core Power MAD Core Power 40 40 FP Units (4 Cores) 30 Intel Skylake 6700K ADD/MUL Core Power 30 AVX SIMD DP 20 20 10 Uncore Power 10 DRAM Power Uncore Power Power Consumption [W] 0 Power Consumption [W] 0 22 24 26 28 210 212 214 216 218 220 22 24 26 28 210 212 214 216 218 220 FLOPS FLOPS

(a) FP units power consumption for Intel Ivy Bridge 3770K. (b) FP units power consumption for Intel Skylake 6700K.

Figure 3.5: FP units power consumption using AVX SIMD DP instructions.

Ivy Bridge 3770K can deliver about 2 GFlops/J (28 GFlops/s at 13W), while Intel Skylake 6700K provides 3.4 GFlops/J (64 GFlops/s at 19W). For multi-core execution, Skylake also outperforms Ivy Bridge in terms of energy- efficiency by delivering about 4.3 GFlops/J (versus 2.8 GFlops/J in Ivy Bridge). As it can be concluded, Intel Skylake 6700K offers significant improvements in energy-efficiency when compared to the Intel Ivy Bridge 3770K, namely about 70% for single-core and around 54% for multi-core computations. As previously referred, modern multi-core processors support a variety of compute instructions/units, e.g. ADD, MUL and FMA, which influence their performance upper-bounds. Hence, in order to provide a full characterization of compute capabilities of modern processors, it is necessary to benchmark arithmetic units for different FP instructions. The performance results obtained for 4 cores and different FP instructions are presented in Figures 3.4a and 3.4b, for Intel Ivy Bridge and Intel Skylake, respectively. As expected, Intel Ivy Bridge and Intel Skylake achieve the maximum performance when MAD and FMA instructions are performed, respectively. In both systems, ADD and MUL operations achieve the same throughput and the maximum performance equal to the half of the one achievable for MAD instructions in the Intel 3770K Ivy Bridge or FMA instructions in the Intel 6700K Skylake processor, respectively. As it can be observed, the benchmark tests are able to attain maximum theoretical performance of each instruction, i.e., ADD and MUL achieve 56 GFLOPS/s in Intel Ivy Bridge, while Intel Skylake, due to the enchantments in its architecture is able to attain 128 GFLOPS/s for each of these instructions. As observed in power consumption results presented in Figures 3.5a (Intel Ivy Bridge) and 3.5b (Intel Skylake), for multi-core computations, the highest power consumption corresponds to the instruction type that guarantees the maximum throughput in each platform. Regarding ADD and MUL instructions, their power consumption is

32 8 FMA AVX SIMD DP 70 2 FMA AVX SIMD DP FMA SSE SIMD DP 27 60 26 50 FMA SSE SIMD DP FMA Scalar DP 25 40 FMA Scalar DP 24 Filling Pipeline 30 FP Units (4 Cores) FP Units (4 Cores) 23 20 Intel Skylake 6700K Intel Skylake 6700K 22 AVX | SSE | Scalar DP 10 AVX | SSE | Scalar DP Power Consumption [W]

Performance [GFLOPS/s] 21 0 22 24 26 28 210 212 214 216 22 24 26 28 210 212 214 216 FLOPS FLOPS

(a) FP units performance for different instruction set exten- (b) FP units power consumption for different instruction set sions in Intel Skylake 6700K (4 cores). extensions in Intel Skylake 6700K (4 cores).

Figure 3.6: FP units performance and power consumption for different instruction set extensions in Intel Skylake 6700K (4 cores). equal within the same processor, i.e., 33 W in Intel Ivy Bridge and 56 W in Intel Skylake, which represents an increase of 23 W between these two generations. Besides, it can be verified that for the different instruction types (units) and the same data precision, higher performance implies higher power consumption for both architectures. In what concerns the energy-efficiency for multi-core FP ADD/MUL computations, a decrease of about 35% for Ivy Bridge and 55% for Skylake can be inferred when compared to their maximum achievable energy-efficiency for MAD or FMA instructions, respectively. However, Intel Skylake 6700K still offers a better energy-efficiency of about 2.3 GFlops/J versus 1.7 GFlops/J in Intel Ivy Bridge 3770K, i.e., about 35% energy-efficiency improvement. The throughput of a multi-core processor also depends on the used instruction set extension, e.g., AVX, SSE or scalar. The performance results obtained for FP units, when using different extensions, are presented in Figure 3.6a, for Intel Skylake 6700K (all 4 cores). In these tests, Intel Ivy Bridge results are not presented, since the analysis is similar to the ones already performed in this chapter and it would not provide additional insights. As it can be observed in Figure 3.6a, the maximum attainable performance is achieved for all three tests, i.e., 256 GFLOPS/s for FMA AVX DP, 128 GFLOPS/s for FMA SSE DP and 64 GFLOPS/s for FMA Scalar DP. As expected, SSE performance is half of AVX, since SSE vector length only handles two flops per instruction. Moreover, Scalar DP performance is half of SSE, since each scalar instruction only computes one flop at a time. Regarding the power consumption, presented in Figure 3.6b, the highest power consumption in FP units is achieved when using AVX instructions (approximately 60 W). SSE and scalar instructions attain lower power consumption, i.e., 45 W and 29.9 W, respectively. Hence, from the energy-efficiency point of view, FMA AVX DP allows to achieve 4.27 GFLOPs/J (256 GFLOPS/s at 60 W), FMA SSE DP and FMA Scalar DP only achieve 2.84 GFLOPs/J (128 GFLOPS/s at 45 W) and 2.14 GFLOPs/J (64 GFLOPS/s at 29.9 W). Thus, in order to use all the potential of Intel Skylake 6700K from energy-efficiency point of view, AVX instructions must be utilized. In order to assess the quality of developed micro-benchmarks and their ability to fully exercise the FP units in the processor architecture, the Top-Down Method [2] (see Section 2.1) was applied to the constructed benchmarks when evaluating the maximum performance of the Intel Skylake 6700K for AVX DP FP FMA operations in all four cores. The results obtained with the Top-Down analysis are presented in Figure 3.7, which gives a breakdown of the predominant sources of performance bottlenecks in different parts of the processor pipeline. Since only arithmetic operations are performed, memory subsystem does not limit performance, thus memory

33 1 FMA Top Down (4 Cores) 0.8 Intel Skylake 6700K AVX SIMD DP 0.6 Retiring (RET)

0.4 Core Bound (Core)

0.2 Frontend Bound (FE)

Power Consumption [W] 0 22 24 26 28 210 212 214 216 218 220 FLOPS

Figure 3.7: Top Down Method for FMA AVX SIMD DP at nominal frequency. bound contribution is zero. Besides, frontend does not stall the backend execution, thus also frontend bound and bad speculation do not contribute to diminish performance. As it can be observed, before hitting maximum peak performance, i.e., while ﬁlling the pipeline, the main bottleneck is core bound, since the amount of instructions is insufﬁcient for the processor to achieve its maximum retirement rate. Thus, the utilization of dispatch ports is the main performance limiter. In contrast, when maximum performance is achieved, the processor can only retire two FP instructions per cycle, i.e., only half of the retirement slots are used. As a result, the retiring contribution is around 50%. Finally, since only half of the dispatch ports reserved for computations are used (the remaining two ports do not support AVX DP FP arithmetics), the core bound contribution is around 50%. As it can be concluded, the developed micro-benchmarks were capable of fully exploiting the processor capabilities for AVX DP FP computations, by achieving the maximum possible retirement rate and core utilization, while exhibiting negligible (or zero) execution overheads in the other pipeline domains.

3.2.2 Memory subsystem benchmarking

Regarding the memory subsystem, its capabilities differ between Intel Ivy Bridge and Intel Skylake micro- architectures, as referred in Section 2.1. While Intel Ivy Bridge can deliver a maximum of 48 bytes per cycle (two loads and one store of 16 bytes each), Intel Skylake bus width between core and L1 data cache was increased to support a theoretical throughput of 96 bytes per cycle (two loads and one store of 32 bytes each). These enchantments across micro-architectures have a great impact on memory subsystem capabilities, as it would be demonstrated across this section. The benchmarks utilized to characterize the memory subsystem are created by substituting the macro MEM INST,

11 2 Memory Subsystem - LD Test 211 210 Intel Ivy Bridge 3770K 210 4 Cores L1→C 29 AVX SIMD DP 29 8 2 L2→C 28 L1→C L3→C 27 27 6 2 1 Core L2→C 26 L3→C 25 DRAM→C 25 Memory Subsystem - LD Test 24 DRAM→C 24 Intel Skylake 6700K Bandwidth [GB/s]

Bandwidth [GB/s] AVX SIMD DP 23 23 22 22 2-5 20 25 210 215 220 2-5 20 25 210 215 220 Data Traffic [KB] Data Traffic [KB]

(a) LD bandwidth for Intel Ivy Bridge 3770K. (b) LD bandiwdth for Intel Skylake 6700K.

Figure 3.8: Memory subsystem bandwidth for LD AVX SIMD DP at nominal frequency.

34 Memory Subsystem - LD Test 60 60 Package Power L2→C L3→C Intel Ivy Bridge 3770K L1→C 50 4 Cores | AVX SIMD DP 50 DRAM→C L2→C L3→C DRAM→C Core Power 40 Package Power L1→C 40 30 Memory Subsystem - LD Test 30 Intel Skylake 6700K 20 Core Power 20 4 Cores | AVX SIMD DP Uncore Power 10 10 Uncore Power DRAM Power Power Consumption [W] Power Consumption [W] 0 0 2-5 20 25 210 215 220 2-5 20 25 210 215 220 Data Traffic [KB] Data Traffic [KB]

(a) LD power consumption for Intel Ivy Bridge 3770K. (b) LD power consumption for Intel Skylake 6700K.

Figure 3.9: Memory subsystem power consumption for LD AVX SIMD DP at nominal frequency. in Algorithm 1, by the respective memory instructions, i.e., vmovapd addr, reg for loads and vmovapd reg, addr for stores. In order to measure the number of performed load instructions, the events MEM INST RETIRED.ALL LO- ADS (Intel Skylake) and MEM UOP RETIRED.ALL LOADS (Intel Ivy Bridge) are conﬁgured in the tool. For store instructions, the counters MEM INST RETIRED.ALL STORES (Intel Skylake) and MEM UOP RETIRED- .ALL STORES (Intel Ivy Bridge) are utilized in the measurements. In order to obtain the desired results, an extensive set of benchmarks following the described methodology are performed in both processors. To obtain an accurate and stable memory bandwidth measurements, each presented curve involved more than 500 tests and, exactly as FP benchmarking, each is repeated 1024 times, reporting the median value for the counter and clock measures. The bandwidth results obtained for load instructions are presented in Figures 3.8a and 3.8b for Intel Ivy Bridge and Intel Skylake, respectively. As can be observed in both ﬁgures, the highest bandwidth is obtained in both systems when all the accesses are served by L1 cache. Besides, the performed tests achieved the maximum theoretical bandwidth, i.e., 112 GB/s in Intel Ivy Bridge (two loads of 16 bytes per cycle) and 256 GB/s in Intel Skylake (two loads of 32 bytes per cycle) for single thread benchmarks. The bandwidth of the remaining memory levels decreases as the data is fetched further away from the core. In the single-thread test, L2 achieved 128 GB/s in Intel Skylake and approximately 38 GB/s in Intel Ivy Bridge, while L3 cache attained 60 GB/s in Intel Skylake and 30 GB/s in Intel Ivy Bridge. Finally, DRAM achieves around 7.5 GB/s in Intel Ivy Bridge and 14 GB/s in Intel Skylake. Since L1 and L2 caches are private to each core, their bandwidth scales linearly with the number of cores, hence L1 maximum attainable bandwidth is 448 GB/s (4 × 112) in Intel Ivy Bridge and 1024 GB/s (4 × 256) in Intel Skylake. Despite L3 is shared between cores, once each core has its one slice in the ring interconnection, the same effect occurs. In contrast, DRAM bandwidth does not scale linearly with the number of cores, once it is shared by all cores and all of them use the same connection to access it. For multi-thread execution, the DRAM bandwidth in Intel Ivy Bridge is approximately 24.5 GB/s, while Intel Skylake attains 30.7 GB/s. Hence, in both multi-thread and single-thread tests, Intel Skylake delivers always higher throughput than Intel Ivy Bridge for all memory levels. In particular, Intel Skylake offers for about 3.37 times higher L2 bandwidth than Intel Ivy Bridge in both tests, corresponding to the biggest difference when comparing both architectures. This may suggest that L2 cache suffered a considerable number of improvements across micro-architecture generations. The correspondent power consumption results are presented in Figures 3.9a and 3.10a (Intel Ivy Bridge) and 3.9b and 3.10b (Intel Skylake). In order to better visualize one core power consumption, the curves were placed

35 13 22 Memory Subsystem - LD Test 12.5 Intel Skylake 6700K AVX SIMD DP L3→C 20 12 L2→C L3→C L2→C DRAM→C 11.5 L1→C 18 L1→C DRAM→C 11 Memory Subsystem - LD Test 16 1 Core 1 Core 10.5 Intel Ivy Bridge 3770K AVX SIMD DP 14 Power Consumption [W] 10 Power Consumption [W] 2-5 20 25 210 215 220 2-5 20 25 210 215 220 Data Traffic [KB] Data Traffic [KB]

(a) LD power consumption for Intel Ivy Bridge 3770K. (b) LD power consumption for Intel Skylake 6700K.

Figure 3.10: Memory subsystem power consumption for LD AVX SIMD DP at nominal frequency. in different figures. As can be observed in all figures, core power consumption increases as the data is served by higher levels of the processor memory hierarchy, due to the increasingly utilization of the cache levels, achieving its maximum when all caches are being used, i.e., when data is fetched from L3. In L1 cache, the power consumption is about 11.2 W in Intel Ivy Bridge and 17.5 W in Intel Skylake. Moreover, L2 and L3 caches in Intel Ivy Bridge consume around 11.7 W and 11.9 W, respectively, while in Intel Skylake their power consumption is approximately 18.2 W (L2 cache) and 18.88 W (L3 cache). Thus, similar to FP performance, higher bandwidth is followed by an increase in power consumption (when comparing the same cache levels). In DRAM, due to the reduction in bandwidth when data is served by this memory level, the caches stall while waiting for data, reducing the power consumption, achieving about 16.2 W in Intel Skylake and 11.3 W in Intel Ivy Bridge. When comparing both architectures, there is a clear increase in power consumption from Intel Skylake to Intel Ivy Bridge, with a maximum increase of 59% in L3 cache. Similar to power consumption in FP units, the power consumption also does not scale with the number of cores for the memory subsystem. For multi-thread tests, Intel Ivy Bridge power consumption in L1 cache is around 30.8 W and approximately 51 W in Intel Skylake, representing an increase of 65.6% between processors. Moreover, the power consumption of L2 and L3 caches is about 33.2 W and 33.6 W in Intel Ivy Bridge, while in Intel Skylake their power consumption measures are approximately 53 W and 54.3 W, respectively. In DRAM, Intel Ivy Bridge 3770 K consumes about 29.6 W, while Intel Skylake 6770 K DRAM power consumption is approximately equal to 41.6 W. By combining bandwidth and power consumption results, it is possible to compare the efficiency of the memory subsystem in both processors. For single-thread benchmarking, L1 cache in Intel Ivy Bridge 3770K can deliver about 10 GB/J (112 GB/s at 11.2 W), while Intel Skylake 6700K is able to provide approximately 14.63 GB/J (256 GB/s at 17.5 W). Through all the entire levels, Intel Skylake 6700 K is always more efficient than Intel Ivy Bridge 3770K. In particular, L3 cache in Intel Skylake 6700K can provide a maximum of 3.178 GB/J (60 GB/s at 18.88 W), while Intel Ivy Bridge 3770K only delivers 2.52 GB/J (30 GB/s at 11.9 W). For multi-thread, since there is a much higher increase in bandwidth than power consumption, each processor is able to deliver even higher efficiency than when working with only 1 core. Moreover, Intel Skylake 6700K continues to provide a higher efficiency than Intel Ivy Bridge 3770K. In particular for L1 cache, Intel Skylake 6700K is able to deliver up to 20 GB/J (1024 GB/s at 51 W), while Intel Ivy Bridge 3770K only delivers 14.5 GB/J (448 GB/s at 30.8 W). This allows to conclude that also the memory subsystem in Intel Skylake 6700K is more

36 2LD/ST L1→C 210 210 2LD/ST L1→C LD LD/ST L2→C 29 29 LD ST 8 LD/ST 28 L3→C 2 L2→C 27 ST L3→C 27 26 26 Memory Subsystem Memory Subsystem Intel Skylake 6700K DRAM→C 25 DRAM→C 25 Intel Ivy Bridge 3770K 4 Cores | AVX SIMD DP Bandwidth [GB/s] Bandwidth [GB/s] 24 4 Cores | AVX SIMD DP 24 23 23 2-5 20 25 210 215 220 2-5 20 25 210 215 220 Data Traffic [KB] Data Traffic [KB]

(a) Memory ratios bandwidth for Intel Ivy Bridge 3770K. (b) Memory ratios bandwidth for Intel Skylake 6700K.

Figure 3.11: Memory ratios bandwidth for AVX SIMD DP at nominal frequency.

45 Memory Subsystem - Core Power 70 Intel Ivy Bridge 3770K 65 L1→C L3→C 4 Cores | AVX SIMD DP 2LD/ST 40 L3→C 60 L2→C L2→C 55 LD/ST 35 2LD/ST L1→C 50 LD LD/ST 45 ST DRAM→C LD DRAM→C 30 40 Memory Subsystem - Core Power 35 Intel Skylake 6700K ST 4 Cores | AVX SIMD DP Power Consumption [W] Power Consumption [W] 25 30 2-5 20 25 210 215 220 2-5 20 25 210 215 220 Data Traffic [KB] Data Traffic [KB]

(a) Memory ratios power consumption for Intel Ivy Bridge (b) Memory ratios power consumption for Intel Skylake 3770K. 6700K.

Figure 3.12: Memory ratios power consumption for AVX SIMD DP at nominal frequency. efﬁcient than in Intel Ivy Bridge 3770K. Regarding uncore power (see Figures 3.9a and 3.9b), it is constant and equal to the uncore power in FP test while the caches are utilized, increasing when DRAM is utilized. The DRAM power domain in Intel Skylake follows the same tendency. The package power corresponds to the sum of uncore power and core power, having a similar behavior to the core power consumption. Due to this, uncore, DRAM and package curves are not presented in the following tests, since their behavior is similar to the one exposed here, which would not provide additional insights. Furthermore, as referred in Section 2.1, Intel Ivy Bridge and Intel Skylake contain two ports to dispatch loads and one to dispatch stores. Thus, these micro-architectures support a wide range of load/store ratios, e.g., LD, ST, LD/ST and 2LD/ST, which varies the memory subsystem throughput. The obtained bandwidth results with four cores for different load/store ratios are presented in Figures 3.11a and 3.11b, for Intel Ivy Bridge and Intel Skylake, respectively. In both architectures, the highest bandwidth in L1 cache is achieved when two loads and one store are performed together, (i.e., 2LD/ST ratio). In Intel Ivy Bridge, the tests achieved the maximum attainable bandwidth in L1 bandwidth for all ratios, i.e., 448 GB/s for LD and LD/ST, 224 GB/s for ST and 672 GB/s for 2LD/ST. In Intel Skylake, the L1 maximum attainable bandwidth is obtained for LD and LD/ST (1024 GB/s) and for ST (512 GB/s). However, 2LD/ST ratio only achieved around 1355 GB/s, approximately 86.6 % of the maximum theoretical bandwidth (1536 GB/s). Since all the maximum attainable bandwidths were achieved for the remaining memory ratios and in two completely different micro-architectures, this may suggest the existence of an undisclosed bottleneck

37 in the memory dispatch ports of Intel Skylake 6700K. Furthermore, different memory ratios utilize the data bus connecting memories at different utilization rates. Thus, the bandwidth of the remaining memory levels is also affected by the load/store ratio. In Intel Skylake, all the remaining levels are limited by LD bandwidth, achieving around 509 GB/s in L2, 241 GB/s in L3 and 30 GB/s in DRAM. On the other hand, Intel Ivy Bridge L2 (161 GB/s) and L3 (120 GB/s) maximum throughputs correspond to 2LD/ST bandwidth, while DRAM upper-bound matches LD bandwidth, achieving about 24.5 GB/s. The respective power consumption results are presented in Figures 3.12a and 3.12b, for Intel Ivy Bridge and Intel Skylake, respectively. Here, only the power consumption correspondent to the bandwidth upper-bounds of each memory level is evaluated, in order to simplify the memory subsystem analysis. In L1 cache, when using 2LD/ST ratio, Intel Ivy Bridge power consumption is about 34 W, while Intel Skylake 6700K consumes around 63 W, i.e., more 85% than Intel Ivy Bridge. The same analysis from energy-efficiency point of view, reveals that Intel Skylake 6700K delivers 21.5 GB/J (1355 GB/s at 63 W), while Intel Ivy Bridge is able to provide 19.8 GB/J (672 GB/s at 34 W). Thus, for L1 cache, Intel Skylake 6700K is 8.6% more efficient than Intel Ivy Bridge 3770K. Moreover, in Intel Skylake, when using LD ratio, L2, L3 and DRAM power consumptions are approximately 53 W, 54.3 W and 41.6 W, respectively. On Intel Ivy Bridge, L2 and L3 power consumptions, when using 2LD/ST ratio are about 36.3 W and 39 W, respectively, while DRAM consumes nearly 29 W (LD ratio). Thus, from energy- efficiency point of view, Intel Skylake 6700K delivers 9.6 GB/J (509 GB/s at 53 W), 4.44 GB/J (241 GB/s at 54.3 W) and 0.72 GB/J (30 GB/s at 41.6 W), for L2, L3 and DRAM, respectively, when using LD ratio. On the other hand, Intel Ivy Bridge 3770K only delivers 4.43 GB/J (161 GB/S at 36.3 W), 3.08 GB/J (120 GB/S at 39 W) and 0.84 GB/J (24.5 GB/S at 29 W), for L2, L3 (2LD/ST ratio) and DRAM (LD ratio), respectively. Thus, when both processors are using the maximum bandwidth in each memory level, Intel Skylake 6700K is more efficient in L1, L2, L3 caches than Intel Ivy Bridge 3770K. However, in DRAM, Intel Ivy Bridge 3770K is 17% more efficient than Intel Skylake 6700K. It is important to notice that in Intel Skylake, the ratios mixing loads and stores, i.e., LD/ST and 2LD/ST ratios, achieve in L2 cache lower power consumption than L1 cache. Since no information about stores behavior across the different memory levels is disclosed in Intel manuals, it is difficult to understand this behavior. However, in Intel Skylake micro-architecture, L2 cache associativity was reduced from 8 ways in previous generations to 4 ways, which may have changed the way how loads and stores are handled when executed together. Similar to FP units, memory subsystem throughput also depends on the utilized instruction set extension. The bandwidth results obtained for 2LD/ST ratio using different instruction set extensions, namely AVX DP, SSE DP and Scalar DP are presented in Figure 3.13a. As can be observed, 2LD/ST AVX corresponds to the maximum upper-bound across all memory levels. In particular, L1 cache bandwidth is equal to 1355 GB/s. For SSE and Scalar bandwidth curves, attain much lower bandwidth, since their vector length is reduced. While AVX supports memory transfers of 32 bytes each in Intel Skylake 6700K, SSE and scalar can only handle 16 bytes and 8 bytes, respectively, achieving 553 GB/s (SSE) and 298 GB/s (Scalar). In the remaining cache levels, 2LD/ST AVX attains 417.7 GB/s in L2, 214.3 GB/s in L3 and 18.6 GB/s in DRAM. Moreover, 2LD/ST SSE achieves 320 GB/s in L2, 197 GB/s in L3 and 15.3 GB/s in DRAM. Finally, 2LD/ST Scalar test only achieves 187.3 GB/s in L2, 119.8 GB/s in L3 and 14.9 GB/s in DRAM. Moreover, in the correspondent power consumption results, presented in Figure 3.13b, power consumption

38 L1→C 70 10 AVX 2 65 L1→C SSE AVX L3→C 29 L2→C 60 L2→C 28 L3→C 55 Scalar DP SSE 27 50 Scalar DP 6 45 DRAM→C 2 Memory Subsystem - 2LD/ST Test 40 Memory Subsystem - 2LD/ST Test 5 Intel Skylake 6700K 2 Intel Skylake 6700K 4 Cores DRAM→C 35 Bandwidth [GB/s] 24 4 Cores | Core Power AVX | SSE | Scalar DP 30 AVX | SSE | Scalar DP 23 Power Consumption [W] 25 2-5 20 25 210 215 220 2-5 20 25 210 215 220 Data Traffic [KB] Data Traffic [KB]

(a) Memory subsystem performance for different instruction set(b) Memory subsystem power consumption for different in- extensions in Intel Skylake 6700K (4 cores). struction set extensions in Intel Skylake 6700K (4 cores).

Figure 3.13: FP units performance and power consumption for different instruction set extensions in Intel Skylake 6700K (4 cores). is completely dominated by 2LD/ST AVX instructions, achieving about 63 W in L1. Regarding SSE and scalar curves, in L1 their power consumption is equal and approximately 47 W. However, in the remaining memory levels, SSE power consumption surpasses easily scalar power. The AVX test power consumption is approximately 58.7 W in L2, 61.4 W in L3 and 41.25 W in DRAM. For the SSE test, the power consumption in remaining memory levels is about 51.1 W in L2, 55.1 W in L3 and 34.9 W in DRAM. Finally, for the scalar benchmark, the values obtained are 49.1 W in L2, 52.3 W in L3 and 32.9 W in DRAM. Thus, for all memory levels, AVX power consumption is always superior to SSE and Scalar. From energy-efficiency point of view, when accessing L1 cache, 2LD/ST AVX delivers up to 21.5 GB/J (1355 GB/s at 63 W). Since scalar and SSE power consumptions are equal in L1 but SSE instructions attain higher performance, scalar instructions are less energy-efficient than SSE. In fact, SSE instructions can provide, in L1 cache, 11.77 GB/J (553 GB/s at 47 W), while scalar DP only deliver 6.34 GB/J (298 GB/s at 47 W). In the remaining cache levels, 2LD/ST AVX delivers 7.12 GB/J (417.7 GB/s at 58.7 W) in L2, 3.5 GB/J (214.3 GB/s at 61.4 W) in L3 and 0.45 GB/J (18.6 GB/s at 41.25 W) in DRAM. Furthermore, SSE instructions provide 6.26 GB/J (320 GB/s at 51.1 W) in L2, 3.57 GB/J (197 GB/s at 55.1 W) in L3 and 0.44 GB/J (15.3 GB/s at 34.9 W) in DRAM. Finally, Scalar instructions deliver 3.81 GB/J (187.3 GB/s at 49.1 W) in L2, 2.29 GB/J (119.8 GB/s at 52.3 W) in L3 and 0.45 GB/J (14.9 GB/s at 32.9 W) in DRAM. Hence, to fully exploit the memory subsystem of Intel Skylake 6700K from energy-efficiency point-of-view, it is necessary to use AVX instructions, since they allow to achieve the maximum possible efficiency. To also evaluate memory benchmarking, Top-Down method is applied to 2LD/ST ratio test for 4 cores in Intel Skylake. The obtained results are presented in Figure 3.14. The further data is fetched from core, more cycles are spent in performing memory operations, increasing memory bound contribution, which achieves around 96% in DRAM. Besides, this also reduces retiring, due to the increasingly amount of time that takes to perform the operations. Finally, core bound metric increases until L2 cache, once the amount of cycles to fetch data is balanced with ports utilization. However, in L3 and DRAM, the ports utilization diminishes (due to the amount of cycles to serve data), reducing core bound metric contribution.

39 Memory Subsystem - Top-Down 1 Intel Skylake 6700K DRAM→C 4 Cores | AVX SIMD DP MEM Bound 0.8 L1→C L3→C Retiring 0.6 L2→C 0.4 Core Bound 0.2 FrontEnd Bound 0 2-5 20 25 210 215 220 Data Traffic [KB]

Figure 3.14: Top Down Method for 2LD/ST AVX SIMD DP at nominal frequency.

3.3 Summary

In this chapter, an extensive set of benchmarks is constructed in order to fully characterize FP units and memory subsystem in Intel Skylake 6700K and Intel Ivy Bridge 3770K. This kind of analysis is extremely important in order to fully characterize the main upper-bounds of multi-core processors and possible micro-architectural limitations which are not reflected in theoretical datasheets. Besides, modern multi-core CPUs support an extensive set of instructions, load/store ratios and instruction set extensions, e.g., AVX and SSE, which influences processor throughput. To accomplish this task, a tool designed to perform an accurate experimental evaluation on real hardware is proposed in the scope of this Thesis. This tool utilizes hardware performance counters built-in the processors, in order to obtain the necessary measurements to evaluate the system capabilities. Besides, the benchmark structure is explained, revealing the options taken during its construction, with the objective of obtaining maximum precision, accuracy and stability in the performed benchmarks. Next, the results obtained with the tool were evaluated. In this part, Intel Ivy Bridge 3770K and Intel Skylake 6700k processors are compared for a diversity of instructions and load/store ratios. In general, it was demonstrated that enchantments in Intel Skylake micro-architecture allow to achieve higher levels of performance, power consumption and energy-efficiency than Intel Ivy Bridge. Finally, the quality of the benchmarks is assessed with Top-Down analysis. From this, it was possible to conclude that the benchmarks fully exploit the micro-architecture capabilities in both memory and FP tests, revealing an accurate characterization of the micro-architecture upper-bounds.

40 4. Proposed insightful models: Construc- tion and experimental validation

As it was previously shown in Chapter 3, the micro-architecture upper-bounds depend on several factors, such as the utilization of different instruction set extensions and instruction types, the ratio of different memory operations (load and store instructions) etc. However, in its integral version, CARM mainly considers the absolute performance, power-consumption and energy-efficiency upper-bounds of a given micro-architecture, e.g., by focusing on the AVX ISA extensions for FP DP arithmetics and the maximum attainable bandwidth for different memory levels by considering the 2LD+ST ratio of memory instructions [3, 5]. As a consequence, depending on the characteristics and demands of a specific application, this model might not provide the most accurate characterization for the applications that are intrinsically unable to exploit those micro-architecture maximums, e.g., in cases when the applications do not employ the AVX extensions, use a specific subset of FP units and/or have different ratio of load/store operations in their instruction mix. This is a specific gap that the work proposed in this Thesis intends to close. With this aim, a set of application-centric micro-architecture insightful models for performance, power consumption and energy-efficiency are proposed in this Chapter, herein referred as CARM extensions. These models aim at improving the insightfulness of the state-of-the-art models by covering a wide range of execution scenarios from both micro-architecture and application perspectives. The proposed set of CARM extensions are presented for Intel Skylake 6700K processor, evaluating the impact of different processor capabilities in the construction of different models and characterization of potential execution bottlenecks. To perform this analysis, several CARM instances for different instruction types, ratios of memory operations and instruction set extensions are proposed and constructed for performance, power consumption and energy-efficiency. Besides, the state-of-the-art CARM, which models the maximum throughput upper-bounds for memory subsystem and FP units, is constructed for Intel Skylake 6700K. The insights provided by this model will be compared in Chapter 5 with the characterization provided by proposed extensions, in order to assess the usability of the work performed in this Thesis. Furthermore, an extensive experimental validation for a set of proposed CARM extensions is performed on real hardware platforms by considering two different generations of Intel client micro-architectures from Intel Core processor family, i.e., quad-core Intel Ivy Bridge 3770K and quad-core Intel Skylake 6700K. This evaluation was conducted by considering a range of different instruction types and mixes for both compute and memory operations. To obtain a highly accurate experimental validation of the proposed models, the testing methodology and tools presented in Chapter 3 were used together with a set of micro-benchmarks, specifically designed in the scope of this Thesis. These micro-benchmarks allow to attain the maximum upper-bounds of the system and experimentally reach the modeled maximums in memory and compute regions for all considered modeling domains, i.e., performance, power consumption and energy-efficiency.

41 4.1 Proposed CARM extensions: Model construction

As referred in Chapter 3, several different factors can greatly affect the micro-architecture maximum computation capabilities, e.g., the utilization of a speciﬁc subset of arithmetic units. As such, in the compute bound region of the proposed CARM extensions, several horizontal roofs can be included, in order to deﬁne the upper-bounds for different instructions, such as, ADD, MUL and FMA. The performance CARM extension for AVX SIMD DP FP instructions and 2LD/ST ratio is presented in Figure 4.1a. As it can be observed, the extended performance CARM contains two horizontal roofs, one for ADD/MUL operations and the other for FMA instructions. In accordance with the micro-architecture benchmarking performed in Chapter 3, the compute bound region upper-bound corresponds to the FMA instructions, with maximum attainable performance of 256 GFLOPS/s, when using all 4 cores in Intel Skylake 6700K processor. The ADD and MUL instructions form the same horizontal roof, since both arithmetic units attain the same throughput, resulting in the performance of 128 GFLOPS/s.

28 FMA 100 6 L1→C CARM Power 2 ADD/MUL L3→C 90 Intel Skylake 6700K 24 L1→C L2→C 80 4 Cores | AVX SIMD DP | 2LD/ST 22 L3→C L1→C 0 70 2 DRAM→C CARM Performance FMA 2-2 Intel Skylake 6700K 60 L2→C ADD/MUL 2-4 4 Cores | AVX SIMD DP | 2LD/ST 50 DRAM→C 2-6 40 Performance [GFLOPS/s] Power Consumption [W] 2-8 2-6 2-4 2-2 20 22 24 26 28 2-8 2-6 2-4 2-2 20 22 24 26 28 Arithmetic Intensity [flops/byte] Arithmetic Intensity [flops/byte]

(a) Performance CARM extension: ADD/MUL and FMA. (b) Power CARM extension: ADD/MUL and FMA.

23 FMA 22 21 20 ADD/MUL 2-1 2-2 L1→C 2-3 L2→C L3→C 2-4 CARM Energy-Efficiency 2-5 DRAM→C Intel Skylake 6700K 2-6 4 Cores | AVX SIMD DP | 2LD/ST

Energy-Efficiency [GFLOPS/J] 2-8 2-6 2-4 2-2 20 22 24 26 28 Arithmetic Intensity [flops/byte]

Figure 4.1: Proposed CARM extensions for AVX DP FP instructions for Intel Skylake 6700K (4 Cores, 2LD/ST).

A similar trend can be observed in the respective power consumption CARM extension, presented in Figure 4.1b. As it can be observed, in the deep memory bound region of the model there is no difference between L1/ADD curve and L1/FMA curve, since the same cache level is utilized. However, in the compute bound region, different power consumption is attained when using different subset of arithmetic units. More precisely, the power consumption in the proposed CARM extension asymptotically decreases towards the power consumption of respective FP units being used, i.e., to the power of 56 W for ADD/MUL and 60 W for FMA, which corresponds to the power consumption measurements obtained in Chapter 3 when performing the ﬁne-grain micro-architecture evaluation. In order to facilitate the analysis, only L1 roof for ADD instruction is presented. However, the same conclusions are also valid for L2 and L3 caches, and DRAM.

42 8 2 8 L1→C (LD) FMA 2 26 FMA 26 4 2 4 L3→C (LD) 2 L1→C (ST) 22 L2→C (LD) L3→C (ST) L1→C (2LD/ST) 22 0 L1→C (2LD/ST) 2 0 DRAM→C (LD)LD CARM Performance 2 ST CARM Performance 2-2 DRAM→C (ST) Intel Skylake 6700K 2-2 Intel Skylake 6700K 2-4 4 Cores | AVX SIMD DP 2-4 L2→C (ST) 4 Cores | AVX SIMD DP -6 2 -6 Performance [GFLOPS/s] 2 Performance [GFLOPS/s] 2-8 2-6 2-4 2-2 20 22 24 26 28 2-8 2-6 2-4 2-2 20 22 24 26 28 Arithmetic Intensity [flops/byte] Arithmetic Intensity [flops/byte]

(a) Performance CARM extension: Load operations (LD). (b) Performance CARM extension: Store operations (ST). 100 L2→C (LD) LD CARM Power 100 ST CARM Power Intel Skylake 6700K 90 Intel Skylake 6700K L3→C (LD)4 Cores | AVX SIMD DP 90 L3→C (ST)4 Cores | AVX SIMD DP 80 80 70 70 L1→C (2LD/ST) L1→C (2LD/ST) 60 FMA 60 L2→C (ST) FMA 50 L1→C (LD) 50 L1→C (ST) 40 DRAM→C (LD) 40 Power Consumption [W] DRAM→C (ST) Power Consumption [W] 2-8 2-6 2-4 2-2 20 22 24 26 28 2-8 2-6 2-4 2-2 20 22 24 26 28 Arithmetic Intensity [flops/byte] Arithmetic Intensity [flops/byte]

23 23 22 22 21 FMA 21 FMA 20 20 2-1 2-1 2-2 L1→C (LD) 2-2 L1→C (ST) L3→C (ST) 2-3 2-3 L3→C (LD) LD CARM Energy-Efficiency 2-4 2-4 L1→C (2LD/ST) DRAM→C (ST) L1→C (2LD/ST) DRAM→C (LD) Intel Skylake 6700K ST CARM Energy-Efficiency -5 L2→C (LD) -5 2 4 Cores | AVX SIMD DP 2 Intel Skylake 6700K 2-6 2-6 4 Cores | AVX SIMD DP L2→C (ST) Energy-Efficiency [GFLOPS/J] Energy-Efficiency [GFLOPS/J] 2-8 2-6 2-4 2-2 20 22 24 26 28 2-8 2-6 2-4 2-2 20 22 24 26 28 Arithmetic Intensity [flops/byte] Arithmetic Intensity [flops/byte]

(e) Energy-efﬁciency CARM extension: Load (LD). (f) Energy-efﬁciency CARM extension: Store (ST).

Figure 4.2: Proposed CARM extensions for AVX LD and ST operations for Intel Skylake 6700K (4 Cores).

Finally, in the energy-efficiency CARM extension, presented in Figure 4.1c, a substantial amount of similarities can be observed when compared to the performance model. The energy-efficiency CARM also contains one horizontal roof for each instruction type, delimiting the maximum energy-efficiency that is possible to achieve by using a specific arithmetic unit. As in the power consumption and performance CARM extensions, the energy- efficiency upper-bound in the compute bound region is limited by FMA instructions, with the efficiency of 4.3 GFlops/J, while ADD and MUL instruction allow to achieve the energy-efficiency of around 2.29 GFlops/J. As evidenced in Chapter 3, the realistically attainable bandwidth for different memory levels significantly varies depending on the amount of memory ports being utilized and the type of memory operations. As a result, the memory bound region in the proposed CARM extensions will be affected depending on the used load/store ratio. In order to show how different memory ratios affect CARM, several CARM extensions are proposed and depicted in Figure 4.2 for load and store operations and for all modeled domains, i.e., performance, power consumption and energy efficiency. In all figures, 2LD/ST ratio L1/FMA curve is also plotted, in order to provide better and visual assessment of the differences between the integral version of CARM and the herein proposed CARM extensions when characterizing the micro-architecture upper-bounds in terms of the memory bandwidth.

43 FMA AVX FMA AVX 28 28 FMA SSE 6 26 L1→C (Scalar) 2 FMA SSE FMA Scalar 24 24 L1→C (SSE) 2 L3→C (SSE) 22 2 L1→C (AVX) L1→CL1 (2LD/ST)→C (SSE) L3→C (Scalar) DRAM→C (SSE) 0 DRAM→C (Scalar) 20 SSE DP CARM Performance 2 Scalar DP CARM Performance -2 L2→C (SSE) Intel Skylake 6700K 2-2 2 Intel Skylake 6700K 4 Cores | SSE DP | 2LD/ST -4 2-4 2 L2→C (Scalar) 4 Cores | Scalar DP | 2LD/ST -6 2-6 2 Performance [GFLOPS/s] Performance [GFLOPS/s] 2-8 2-6 2-4 2-2 20 22 24 26 28 2-8 2-6 2-4 2-2 20 22 24 26 28 Arithmetic Intensity [flops/byte] Arithmetic Intensity [flops/byte]

(a) Performance CARM extension: SSE instructions. (b) Performance CARM extension: Scalar DP instructions.

Figure 4.3: Proposed CARM extensions for 2LD/ST ratio with SSE and Scalar DP instructions for Intel Skylake 6700K (4 Cores).

Regarding performance CARM extensions presented in Figures 4.2a (load model) and 4.2b (store model), it can be observed that L1(LD) bandwidth, i.e., L1 bandwidth when only load operations are performed, is much closer to the L1(2LD/ST) than L1(ST). This corroborates with the results obtained with micro-architecture benchmarking, where store tests achieved the lowest attainable bandwidth. In addition, when comparing the same memory level, load CARM memory roofs correspond to higher performance than store memory roofs. In power consumption CARM extensions, presented in Figures 4.2c (load model) and 4.2d (store model), the same behavior does not occur completely. In the memory region, while L1(LD) attains a higher power consumption than L1(ST), correlating with performance model, L3(ST) curve attains a higher power consumption than L3(LD). This observation can possibly be attributed to the write-back nature of the LLC, where upon a cache miss to fetch the data for both load and store operations, there is an additional activity required to serve the write-backs to the DRAM for the store operations. Moreover, as it can be observed in the proposed CARM extension for only load operations, the power consumption for cache levels when serving only load operations is lower than the one obtained for 2LD/ST, while L3(ST) almost matches the power consumption of 2LD/ST. This behavior may suggest a bigger impact of store instructions to the overall power consumption even for different memory operation mixes. As expected, the proposed energy-efficiency CARM extension are similar to the performance extensions, as it can be observed in Figures 4.2e (load model) and 4.2f (store model). However, it can be noticed that in terms of energy-efficiency there are no significant differences between L1(2LD/ST) and L1(LD). This result confirms the observations made in performance and power domains, where for different memory levels the load operations can provide lower bandwidth coupled with reduced power consumption, and conversely, higher bandwidth at the cost of increased power consumption. As such, from the energy-efficiency point of view, both ratios allow to extract the maximum potential of the micro-architecture. In contrast, store model upper bounds in the memory region achieves much lower efficiency than 2LD/ST ratio. As previously referred, the utilized instruction set extension also influences both CARM memory and compute regions. As it can be observed in Figure 4.3a for the proposed SSE CARM extension, it is not possible to attain the maximum processor upper-bounds when using SSE instructions, when compared to the AVX upper-bounds. This effect is even more visible when using scalar DP instructions. As presented in the respective CARM extension (see Figure 4.3b), the upper-bounds for scalar DP instructions are even lower than SSE. Hence, in SSE and scalar DP CARM extension, the useful area for application optimization suffers a big reduction, not allowing to achieve

44 the maximum peak performance of the processor. However, by implementing higher instruction set extensions, the application can move across different CARM extensions. In the last instance, to fully maximize the performance, it is necessary to use AVX instructions, as it will be experimentally demonstrated in Chapter 5 when applying a different set of techniques to optimize the application execution. In this analysis, power consumption and energy- efﬁciency CARM extensions are not presented, since the insights are similar to the obtained previously exposed when analyzing the models presented in Figures 4.1 and 4.2.

4.1.1 State-of-the-art CARM construction

As shown in Chapter 3, the results obtained for memory subsystem throughput demonstrate that the maximum bandwidth of each memory level does not occur for the same load/store ratio. In particular, for Intel Skylake 6700K, the maximum bandwidth between L1 cache and core is achieved for 2LD/ST ratio, while the bandwidth upper-bounds for L2, L3 and DRAM are only attainable when LD ratio is utilized. This can be seen in Figure 4.4a, where bandwidth results for LD and 2LD/ST ratios are presented, for Intel Skylake 6700K. Based on these results, it is possible to construct a CARM extension containing the uppermost limits of the micro-architecture, presented in Figure 4.4b. Since it models the maximum limits of the micro-architecture, the roofs correspond to the processor throughput when using AVX SIMD DP instructions. Besides, the memory region of the model mixes the bandwidth of two distinct ratios, i.e., LD (L2, L3 and DRAM) and 2LD/ST (L1) bandwidths. In the compute bound region, the horizontal roofs match to the maximum peak performance of the FP units, when using AVX SIMD DP instructions.

(1355 GB/s) 2LD/ST 8 FMA 210 2 (1024 GB/s) (509.5 GB/s) 6 29 2 ADD/MUL LD (241.2 GB/s) 24 28 (417.7 GB/s) L3→C (LD) 22 L1→C (2LD/ST) 27 (214.3 GB/s) 20 L2→C (LD) 26 State-of-the-art CARM Performance Memory Subsystem (30.8 GB/s) 2-2 DRAM→C (LD) 25 Intel Skylake 6700K Intel Skylake 6700K 2-4 Bandwidth [GB/s] 24 4 Cores | AVX SIMD DP 4 Cores | AVX SIMD DP (18.6 GB/s) 2-6 23 Performance [GFLOPS/s] 2-5 20 25 210 215 220 2-8 2-6 2-4 2-2 20 22 24 26 28 Data Traffic [KB] Arithmetic Intensity [flops/byte]

(a) LD and 2LD/ST AVX SIMD DP bandwdith test. (b) State-of-the-art CARM AVX DP CARM for Intel Skylake 6700K.

Figure 4.4: AVX DP LD and 2LD/ST memory bandwidth evaluation and State-of-the-art CARM for Intel Skylake 6700K (4 Cores).

As it can be concluded, this modeling approach uses the absolute maximums when considering system upper- bounds in all FP and memory components, thus it can be considered as a state-of-the-art CARM model. In particular, it represents a combination between the integral CARM and the herein proposed LD CARM extension. As such, this extension does not allow for modeling the entire range of processor capabilities, thus it may provide misleading characterization for a certain type of applications, which does not allow to fully uncover the main bottlenecks that limit application performance, as it will be shown in Chapter 5.

45 4.2 Experimental validation of proposed CARM extensions

In order to experimentally validate a set of proposed CARM extensions, two different generations of Intel client micro-architectures from Intel Core processor family were considered, namely: quad-core Intel Ivy Bridge 3770K and quad-core Intel Skylake 6700K. To attain accurate experimental validation of the proposed models, the testing methodology and tools presented in Chapter 3 were coupled with a set of specifically designed micro-benchmarks. To measure the amount of performed AVX SIMD DP instructions, the counters already introduced in FP units and memory subsystem benchmarking are configured (see Chapter 3), while RAPL facilities were relied upon to obtain the energy consumption in different parts of the processor chip. In contrast to the memory subsystem and FP unit benchmark tests from Chapter 3, where the amount of performed memory transfers or FP instructions was separately increased, for CARM validation it is needed to construct benchmarks that combine these operations in order to recreate different AIs. However, since the AI corresponds to the amount of flops over the total amount of bytes transferred, increasing AI efficiently is not possible by increasing the amount of bytes (memory transfers) and flops (FP instructions) at each test iteration. Moreover, by changing the number of memory instructions retired, the validation could change from memory level at some point of the test, which would produce inaccurate results. Hence, to perform a correct validation, the benchmarks must fulfill two conditions: 1) the memory operations must be served by the same memory level through the entire test; and 2) AI has to also be increased through the test. This is accomplished by maintaining the number of memory transfers at the constant level (through the entire set of benchmarks), while increasing the number of performed flops. The structure of developed benchmarks for validation of the proposed CARM extensions is presented in Algorithm 5. In order to avoid LSD and instruction cache use, the presented algorithm follows the similar optimization approaches applied when designing the computation and memory benchmarks in Chapter 3. However, while the Algorithms 2 and 1 (see Chapter 3) are only constituted by a single inner loop, the CARM validation benchmark structure contains two inner loops. The first loop includes the mix between memory and FP instructions, thus allowing to recreate the desired AI by overlapping the execution of required amount of FP instructions over the fixed amount of memory instructions. On the other hand, the second loop only holds instruction of one type, i.e., memory instructions or FP instructions, which represent the remaining instructions required to construct desired AI. In particular, when this benchmark is used for validation of the memory bound regions, the number of FP instructions is lower than the memory instructions, thus this loop only contains memory instructions. In contrast, once the amount of FP instructions is superior to the amount of memory instructions, the second loop is only formed by FP instructions. Hence, in the first loop, the memory and FP instructions are always overlapped, in order to balance the computation and memory transfer time, thus maximizing the probability of reaching the ridge point in experimental tests. To perform the experimental validation of proposed CARM extensions, an extensive set of benchmarks following the structure described in Algorithm 5 is created. In detail, approximately 1050 tests were executed to obtain the measurements for each presented validation graphic. The experimental results obtained when validating the performance and power consumption CARM extensions in a single-core of Intel Ivy Bridge 3770K processor, by using the LD operations and AVX SIMD DP instruction set extension, are presented in Figures 4.5a and 4.5b,

46 Algorithm 5 Generic CARM benchmark for i < time do for j < repeat1 do MEM INST FP INST MEM INST (...) MEM INST FP INST MEM INST end for MEM INST FP INST MEM INST (...) for k < repeat2 do MEM INST or FP INST MEM INST or FP INST (...) MEM INST or FP INST MEM INST or FP INST end for MEM INST or FP INST MEM INST or FP INST (...) end for respectively. As it can be observed, the performed tests were able to hit the ridge point for L1 and L2 caches in both performance and power consumption models. In fact, an average fitness of 72.2% was obtained for the overall performance validation and, in particular, a fitness of 99.95% was obtained for L1 cache validation. Since in the ridge point, computation and memory transfer time must be exactly equal, achieving this point involves an enor- mous balance between these two types of operations. However, since their throughputs are not usually a multiple of each other, for some memory levels it is quite hard to achieve this point experimentally. This is the case of L3 cache, where the experimental tests did not achieve this point, although the experimental points are very near to the theoretical curve. Furthermore, Intel Skylake performance and power consumption CARM validations, with one core, using LD ratio and AVX SIMD DP instructions, are presented in Figures 4.6a and 4.6b, respectively. This experimental evaluation also attained the ridge point in L1 and L2 caches in both models, demonstrating the accuracy of the utilized benchmarks. The L3 ridge point was not hit due to the difficulty in balance computations and memory transfers. These tests achieved a fitness of 72.77% in L1 cache validation and 79.7% for L2 cache. In order to evaluate CARM adaptation to different micro-architectural capabilities, the validation for a ratio of two loads and one store is performed in both systems. Intel Ivy Bridge 3770K performance and power consumptions are able, once more, to achieve L1 and L2 caches ridge point and being very closer to L3 ridge point. Despite not achieving L3 roof ridge point, an average fitness of 91.81% is obtained, with of fitness of 99.65% for L1 cache.

47 16 L3→C (LD) 28 L2→C (LD) 6 15 2 MAD 4 2 14 MAD 22 13 L1→C (LD) 0 L1→C (LD) 2 L3→C (LD) -2 LD CARM Performance Validation 12 2 L2→C (LD) Intel Ivy Bridge 3770K LD CARM Power Consumption Validation 2-4 1 Core | AVX SIMD DP 11 Intel Ivy Bridge 3770K 2-6 1 Core | AVX SIMD DP Power Consumption [W] Performance [GFLOPS/s] 10 2-8 2-6 2-4 2-2 20 22 24 26 28 2-8 2-6 2-4 2-2 20 22 24 26 28 Arithmetic Intensity [flops/byte] Arithmetic Intensity [flops/byte]

(a) Performance LD AVX SIMD DP CARM validation. (b) Power consumption LD AVX SIMD DP CARM validation.

Figure 4.5: Performance and power consumption LD AVX SIMD DP CARM validations for Intel Ivy Bridge 3770K (1 core).

LD CARM Validation 8 2 24 Power Consumption 6 FMA 2 L2→C (LD)L3→C (LD) Intel Skylake 6700K 24 L2→C (LD) 22 1 Core | AVX SIMD DP 22 L3→C (LD) 20 20 L1→C (LD) LD CARM Performance Validation L1→C (LD) FMA 2-2 Intel Skylake 6700K 18 2-4 1 Core | AVX SIMD DP 2-6 16 Power Consumption [W] Performance [GFLOPS/s] 2-8 2-6 2-4 2-2 20 22 24 26 28 2-8 2-6 2-4 2-2 20 22 24 26 28 Arithmetic Intensity [flops/byte] Arithmetic Intensity [flops/byte]

(a) Performance LD AVX SIMD DP CARM validation. (b) Power consumption LD AVX SIMD DP CARM validation.

Figure 4.6: Performance and power consumption LD AVX SIMD DP CARM validations for Intel Ivy Skylake 6700K (1 core).

By performing the same validation test in the Intel Skylake 6700K, the CARM validations presented in Figures 4.8a (performance) and 4.8b (power consumption) are obtained. In this system, for a ratio of 2LD/ST, the tests did not achieve L1 cache ridge point. However, this is mainly due to the possible micro-architectural limitations previously elaborated in Chapter 3. In fact, this demonstrates that CARM is able to characterize existing micro- architectural limitations, which are not taken into account by datasheets or other theoretical tools. Thus, since CARM takes all these phenomena into account, increases its reliability when characterizing real applications. A similar behavior occurs in power consumption validation, where none of the ridge points is achieved. However, as observed from memory benchmarks, Intel Skylake power management works with more aggressive mechanisms, which makes difficult to obtain a completely accurate relation between theory and experiences. Besides, while performance model measures a known amount of memory transfers and flops (it is known how many instructions each test executes), power consumption overlaps contributions from all pipeline components, which can reduce accuracy. The results obtained demonstrate a good tool and benchmarking accuracy when performing CARM validation, since high values of fitness were obtained for Intel Ivy Bridge 3770K, where L1 fitness surpassed 99%. On the other hand, there is also space for improvement, since L3 cache validation did not reach the ridge point,

48 20 2LD/ST CARM Validation 28 Power Consumption 26 MAD 18 Intel Ivy Bridge 3770K 24 1 Core | AVX SIMD DP L1→C (2LD/ST) 16 22 L1→C (2LD/ST) 20 L3→C (2LD/ST) 14 L3→C (2LD/ST) MAD 2-2 L2→C (2LD/ST) 2LD/ST CARM Performance Validation 2-4 Intel Ivy Bridge 3770K 12 1 Core | AVX SIMD DP 2-6 L2→C (2LD/ST) Power Consumption [W] Performance [GFLOPS/s] 10 2-8 2-6 2-4 2-2 20 22 24 26 28 2-8 2-6 2-4 2-2 20 22 24 26 28 Arithmetic Intensity [flops/byte] Arithmetic Intensity [flops/byte]

(a) Performance 2LD/ST AVX SIMD DP CARM validation (b) Power consumption 2LD/ST AVX SIMD DP CARM val- for Intel Ivy Bridge 3770K (1 core). idation for Intel Ivy Bridge 3770K (1 core).

Figure 4.7: Performance and power consumption 2LD/ST AVX SIMD DP CARM validations for Intel Ivy Bridge 3770K (1 core).

8 2LD/ST CARM Validation 2 24 Power Consumption 6 FMA 2 L2→C (2LD/ST) Intel Skylake 6700K 24 22 1 Core | AVX SIMD DP L3→C (2LD/ST) 22 L1→C (2LD/ST) 20 20 L3→C (2LD/ST) -2 L2→C (2LD/ST) 2LD/ST CARM Performance Validation 2 18 L1→C (2LD/ST) -4 Intel Skylake 6700K FMA 2 1 Core | AVX SIMD DP 2-6 16 Power Consumption [W] Performance [GFLOPS/s] 2-8 2-6 2-4 2-2 20 22 24 26 28 2-8 2-6 2-4 2-2 20 22 24 26 28 Arithmetic Intensity [flops/byte] Arithmetic Intensity [flops/byte]

(a) Performance CARM for Intel Skylake 6700K with 1 core. (b) Power consumption CARM for Intel Skylake 6700K with 1 core.

Figure 4.8: CARM for AVX SIMD DP at nominal frequency. although the results were always very close to it. Finally, Intel Skylake 6700K validation for 2LD/ST demonstrated that CARM reflects possible micro-architectural bottlenecks, differently from theoretical models such as ORM. Besides, validation results for Intel Skylake 6700K demonstrate the necessity to tune the tool specifically to this micro-architecture, since the enchantments in this processor faced to Intel Ivy Bridge seem to difficult obtaining better results.

4.3 Summary

In this chapter, several CARM extensions are proposed in order to take into account several computational and memory capabilities. Besides, these extensions can provide a more accurate characterization of real-world applications, by adapting CARM extensions and processor capabilities to the application speciﬁcs. Finally, state- of-the-art CARM is constructed, by using the uppermost limits of the micro-architecture for the memory and computational roofs with the results obtained in the benchmarking chapter. It is demonstrated that this extension does not adapt to different memory and arithmetic throughputs, thus analyzing real-world applications in this model might be a challenging task, since these applications can use the most varied pipeline components during their execution. Next, CARM performance and power consumption validations are presented for Intel Skylake 6700K and 3770K, for different processor capabilities. The obtained results show a high ﬁtting between experimental

49 results and theoretical model, achieving ﬁtness superior to 99% in some cases.

50 5. Application characterization and optimization in the proposed insightful models

In order to fully demonstrate the usefulness and insightfulness of the proposed models and CARM extensions, in this chapter an in-depth experimental evaluation and analysis is performed on a real hardware platform (equipped with the quad-core Intel Skylake 6700K processor) and on a set of different real-world applications by considering all modeling domains, i.e., performance, power consumption and energy-efficiency. Initially, a case study for a mini-application (Toypush) is presented, whose major execution hotspots are deeply analyzed and characterized in the proposed models, in order to uncover the main sources of the execution bottlenecks. By following the guidelines derived from the proposed models, a set of different optimization techniques was applied to the original code of each Toypush hotspots, in order to further improve their performance. In addition to Toypush mini-application, a set of real-world applications from the SPECs benchmark suite [47] is also analyzed in the proposed extended CARMs and herein derived roofline models. This set of novel and redefined general roofline models is investigated herein with the aim to address the shortcomings of existing approaches for insightful micro-architecture modeling and to allow characterization of a wide range of applications that encapsulate different types of instructions and instruction mixes. In particular, these general roofline models are based on the total number of instructions (not only FP instructions), in order to provide a deeper analysis for the applications which execution is not necessary dominated only by the FP operations. As such, these models provide a foundation to derive more general insightful micro-architecture models based on the fundamental roofline modeling principles. To verify the correctness of the application characterization provided by the proposed models, the Top-Down analysis [1, 2] (see Section 2.1) was also performed and used to correlate the application position in the proposed CARM extensions with the main execution bottlenecks pinpointed by the Top-Down analysis. In addition, to better assess the impact of the proposed CARM extensions and their ability to provide more accurate application characterization, the information provided in the proposed CARMs is compared with the state-of-the-art CARM implementation, presented in Chapter 4. The results of this analysis corroborate the need for application-centric micro-architecture modeling (the research topic specifically investigated in this Thesis) in order to further boost the model insightfulness for a certain set of applications.

5.1 Experimental setup

The experimental evaluation was performed on a computing platform with a Linux CentOS 7.2.15.11, quad- core Intel Skylake i7-6700K processor (operating at the ﬁxed nominal frequency of 4.0 GHz) and 32 GB of DDR4 DRAM. All applications were compiled with Intel Compiler 17.0.4.196 and during the performed tests Intel Turbo Boost, hyperthreading and hardware prefetching were disabled. For the analysis of real-world applications, a set of loops/functions (kernels) for each application was carefully selected according to their impact on the

51 overall application performance, i.e., only the kernels with the highest impact on the total execution time are analyzed, since they correspond to application hotspots with the biggest potential to increase the overall application performance. Moreover, special attention is paid on guaranteeing that each application thread is bound to a single core, in order to avoid context switching and diminish its impact on the accuracy of presented results. Furthermore, to assess the application characteristics and to obtain the required measures to facilitate their representation in the proposed CARM extensions (as well as to perform the Top-Down analysis), each application hotspot is manually instrumented with Performance Application Programming Interface (PAPI) to obtain the measurements of the hardware counters, i.e., the total number of retired FP, load and store operations, RAPL energy measurements, etc.

5.2 Evaluation methodology

For each considered application hotspot, the distribution on a per instruction type basis is ﬁrstly assessed by performing the binary instrumentation (i.e., by observing the assembly code). Special attention is paid on decoupling the contribution of load and store operations (i.e., memory operations), FP instructions and all other instruction types in the total number of retired instructions. This information is necessary to provide the preliminary characterization of application behavior. For example, kernels mainly constituted by FP instructions are expected to be limited by computations. On the other hand, kernels with a high percentage of memory operations in the total number of retired instructions are expected to be more memory-bound, i.e., it suggests a very high probability that the kernel will be positioned in the CARM memory-bound region. By relying on the information provided by the binary instrumentation and the distribution of different instruction types, further analysis is performed in order to derive the necessary information to guide the selection of one of the proposed CARM extensions, which better suits the characteristics of the considered application hotspot, i.e., to better correlate application behavior and micro-architecture capabilities. To this respect, the instruction mix of each hotspot is analyzed in order to determine: i) the predominant instruction type used in the application hotspot, i.e., scalar operations or SIMD ISA extensions (AVX or SSE); ii) the exact ratio of load/store operations; and iii) considered precision for arithmetic operations, i.e., Single Precision (SP) or DP operations. By combining this information, one of the proposed CARM extensions is selected. For example, for an application hotspot that mainly relies on SSE operations, has very low load/store ratio and uses SP arithmetics, the CARM variant that corresponds to SEE SP FP computations and store bandwidth will be selected, i.e., SSE ST SP FP CARM. It is worth to emphasize that the integral version of CARM, as proposed in [3], mainly considers the performance upper-bounds of the micro-architecture for AVX 2LD+ST FP DP operations, while state-of-the-art CARM seems to focus on modeling the absolute maximums of the architecture for AVX SIMD DP instructions (see Section 4.1.1). Furthermore, different hotspots within a single application may expose different requirements, thus imposing the use of different CARM extensions when characterizing their behavior. In order to provide an in-depth analysis, Top-Down method [1, 2] is applied to each of the kernels, determining the main bottlenecks that limit their performance. As referred in Section 2.1, Top-Down method decouples different micro-architecture bottlenecks in a hierarchical way, dividing them in four main categories: frontend, bad speculation, retiring and backend (sum of core and memory contributions). Frontend and bad speculation bottlenecks are mainly connected with the in-order part of the CPU, reﬂecting backend starvation and branch mis-

52 prediction effects, respectively. In contrast, retiring and backend bound are related to the out-of-order part of the pipeline. While retiring evaluates the total number of instructions per clock retired by the processor, the backend metric defines if the main execution bottleneck belongs to the core (the utilization of execution ports) or to the memory hierarchy components. By correlating the Top-Down analysis with the application characterization provided by CARM, one can ex- pect that a kernel with high retiring and/or core bound component should be positioned closer to the roofs corresponding to the utilization of components inside the core engine (i.e., computation roofs and/or bandwidth slopes corresponding to a set of private caches). On the other hand, a kernel with a high memory bound component (especially in the DRAM), together with lower retiring and core bound metrics, is expected to be positioned more closer to the CARM roofs corresponding to the DRAM bandwidth. This methodology is also confirmed by the Top-Down analysis provided for memory and FP benchmarks (see Figures 3.7 and 3.14 in Chapter 3), since core and retirement are predominant factors for the performed FP benchmarks and when evaluating the bandwidth upper-bounds for cache levels, while memory bound contribution increases when the data is fetched further away from the core. This evaluation methodology is applied and verified when fully characterizing the behavior of a FP mini- application (Toypush), as well as for a set of real applications, in particular to a set of applications from SPEC benchmark suite [47]. Since some of the considered SPEC benchmarks contain a huge diversity of instructions in their instruction mix, as well as very low amount of FP instructions, the insights derived from CARM may be insufficient to fully characterize them. Hence, in these scenarios, a complementary and novel method for roofline modeling is derived and applied herein, where the applications are plotted in a more generic roofline model oriented to the total processor throughput, which relates the ratio of compute instructions (or the total amount of instructions) and memory instructions with the upper-bound capabilities of the out-of-order CPU engine in terms of the amount of instructions that can be retired in a single clock cycle.

5.3 Case Study: Toypush mini-application

Toypush is a single-threaded Fortran application that performs a particle in cell push computation [21]. This application consists of three main hotspots (kernels), namely: b interpol (kernel 1), e interpol (kernel 2) and eom eval (kernel 3). For each of these kernels, the instruction distribution on a per type basis is presented in Figure 5.1a, where the contribution of different instruction types is decoupled according to their share in the total amount of retired instructions. As presented in Figure 5.1a, Toypush kernels involve a substantial diversity of instruction types, such as FP DP Scalar and SSE instructions, load and store operations, as well as integer, control (branches) and other type of instructions. In particular, the “others” category includes all the instructions that do not ﬁt in any of the remaining categories, such as, move and conversion operations. As it can be observed in Figure 5.1a, kernel 1 mostly uses SSE DP instructions and it is very balanced between the number of FP instructions (34%) and memory accesses (36%). On the other hand, kernels 2 and 3 are completely dominated by FP DP scalar and FP DP SSE instructions, respectively. Furthermore, load/store ratios vary among different kernels. While kernels 2 and 3 have a ratio of 3.7 and 3.6, respectively, kernel 1 ratio is only 0.27. As such, kernel 1 should be plotted in SSE DP FP ST CARM, kernel 2 in scalar DP FP 2LD/ST model (it uses scalar instructions) and kernel 3 in SSE DP FP 2LD/ST model. As a result, it is expected for kernels 2 and 3 to

53 Frontend Bound Memory Bound Retiring Branch Control Others LD DP FP SSE Bad Speculation Core Bound Integer ST FP DP Scalar 1 1 0.8 0.8 0.6 0.6

0.4 0.4

0.2 0.2

0 0 1 2 3 1 2 3 Toypush Toypush

(a) Toypush instruction distribution. (b) Toypush Top-Down metrics.

Figure 5.1: Toypush instruction mix 5.1a and Top-Down metrics 5.1b. be characterized as more compute-bound in their CARM extensions, provided that the AI of these kernels allows hitting the compute bound roofs. On the other hand, since kernel 1 is very balanced between memory instructions and computations, its position within the CARM highly depends on the accessed memory level, as well as on its AI. For example, if the accesses are mostly performed in the L1 cache, then the kernel 1 might be compute bound (or dominated by L1 accesses). However, if the memory transfers are mainly served by DRAM, the performance of kernel 1 will be significantly lower, thus its overall performance can be limited by this memory level. Further characteristics of Toypush kernels can be assessed from their Top-Down analysis, which is presented in Figure 5.1b. As it can be observed, kernel 1 is mainly limited by memory (73.3%), in particular by the stores, which corroborates the conclusions derived from its instruction distribution, i.e., it must be plotted in ST CARM extension. Since the high store-bound nature is typically coupled with a low port utilization, the performance of this kernel is expected to be quite low, and probably limited by DRAM. On the other hand, kernels 2 and 3 are mainly limited by retiring (69.5% and 63.6%, respectively) and, since the remaining bottlenecks are marked as core-bound, it is expected that the performance of those kernels is mainly limited by computations or cache levels closer to the core (e.g., L1 and L2). This Top-Down analysis also corresponds to the observations made about their instruction mixes, which are dominated by FP instructions, thus it is expected that this type of instructions has a bigger impact in the performance of kernels 2 and 3. To confirm this analysis, Toypush is characterized with state-of-the-art CARM, which output is presented in Figure 5.2a. To simplify visualization of the kernels, the less important hotspots are hidden and only the three main kernels are presented. Each of the kernels is represented by a single point within roofline chart and identified by the respective number. As shown in Figure 5.2a, kernel 1 is completely limited by DRAM, due to the huge amount of stores that are inefficiently performed. However, since it is bellow DRAM bandwidth roof, it also may suggest memory latency as the main bottleneck. In addition, kernels 2 and 3 do not match the previous analysis and observations derived from their instruction mix and Top-Down analysis. State-of-the-art CARM places these kernels between the DRAM and L3 roofs, meaning that their performance could be pushed down due to the inefficient accesses to the L3 and DRAM levels. However, their instruction mix is mainly dominated by computation FP instructions, thus the memory accesses should not have major impact on the performance of the kernels. Besides, Top-Down method characterizes both kernels 2 and 3 as completely bound by retiring, which typically does not correspond to any inefficient access to the off-core memory levels. In fact, it is expected that the kernels 2 and 3 are either compute-bound or bound by memory levels closer to the core.

54 FMA FMA ADD/MUL 1 L2→C ADD/MUL L1→C 10 101 L2→C

L1→C L3→C DRAM→C L3→C DRAM→C 100 100 Toypush - b_interpol (kernel 1) Toypush - e_interpol (kernel 2) Toypush - b_interpol (kernel 1) Toypush - eom_val (kernel 3)

Performance [GFLOPS/s] -1 Performance [GFLOPS/s] 10-1 10 10-2 10-1 100 101 10-2 10-1 100 101 Arithmetic Intensity [flops/byte] Arithmetic Intensity [flops/byte]

(a) State-of-the-art CARM: Toypush main kernels. (b) SSE DP ST CARM: Toypush Kernel 1 (proposed).

FMA FMA 101 101 ADD/MUL ADD/MUL L2→C

L2→C L1→C L1→C L3→C DRAM→C 100 100 L3→C DRAM→C

Toypush - e_interpol (kernel 2) Toypush - eom_val (kernel 3)

Performance [GFLOPS/s] -1 10 Performance [GFLOPS/s] 10-1 10-2 10-1 100 101 10-2 10-1 100 101 Arithmetic Intensity [flops/byte] Arithmetic Intensity [flops/byte]

Figure 5.2: CARM characterization of main Toypush kernels in Intel Skylake 6700K.

To better understand some of these inconsistencies in state-of-the-art CARM, each of these Toypush kernels are plotted within the respective extended CARMs proposed in this Thesis, i.e., SSE DP ST model for kernel 1, scalar 2LD/ST model for kernel 2 and SSE DP 2LD/ST model for kernel 3. For kernel 1 (see Figure 5.2b), the ST model characterizes it as completely DRAM bound, which partially matches the characterization provided in the state-of-the-art CARM. However, in this case, the distance between the kernel 1 point and the DRAM roof is much lower, suggesting that DRAM latency is not a bottleneck. Regarding kernel 2, which corresponding CARM characterization is presented in Figure 5.2c, it can be noticed that its characterization changes drastically. In the herein proposed model, the kernel 2 performance is now much closer to the scalar add roof, which clearly reveals its expected compute-bound nature, correlating with its instruction mix and Top-Down insights. Finally, kernel 3 characterization provided by the model proposed in this Thesis drastically differs from the one observed in the state-of-the-art CARM, as depicted in Figure 5.2d. In the proposed model, the kernel 3 is bounded by computations, as expected from the previously elaborated Top-Down and instruction distribution analysis. As a result, the characterization of kernels 2 and 3 showcases the importance of the proposed CARM extensions in this Thesis and it paves the way towards application-centric insightful micro-architecture rooﬂine modeling, in order to provide more precise application characterization and derivation of more accurate optimization hints.

5.3.1 CARM-guided application optimization example

As previously referred, Top-Down analysis indicates a high retiring contribution for scalar and SSE Toypush kernels (kernels 2 and 3), which are characterized in the proposed models as bounded by SSE and scalar computations, respectively. From the presented kernel characterizations in the proposed models (see Figures 5.2c and 5.2d), one can derive valuable optimization hints in order how to further improve their performance. For example, for this set of compute bound kernels, the performance can be increased by relying on the more advanced ISA

55 FMA Frontend Bound Memory Bound Retiring Bad Speculation Core Bound ADD/MUL 1 1 10 L1→C L3→C 0.8 L2→C DRAM→C

0.6 100 0.4 b_interpol (kernel 1) b_interpol_opt (kernel 1) 0.2

Performance [GFLOPS/s] 10-1 0 -2 -1 0 1 1 2 3 10 10 10 10 Toypush Optimized Arithmetic Intensity [flops/byte]

(a) Toypush optimization Top-Down metrics. (b) CARM model: Kernel 1 of Toypush optimization characterization in Intel Skylake 6700K.

FMA FMA ADD/MUL ADD/MUL 101 101 L1→C L2→C L1→C L2→C L3→C L3→C DRAM→C DRAM→C 100 100 e_interpol (kernel 2) eom_val (kernel 3) e_interpol_opt (kernel 2) eom_eval_opt (kernel 3)

Performance [GFLOPS/s] 10-1 Performance [GFLOPS/s] 10-1 10-2 10-1 100 101 10-2 10-1 100 101 Arithmetic Intensity [flops/byte] Arithmetic Intensity [flops/byte]

(c) CARM model: Kernel 2 of Toypush optimization charac- (d) CARM model: Kernel 3 of Toypush optimization characterization in Intel Skylake 6700K. terization in Intel Skylake 6700K.

Figure 5.3: CARM model: Toypush optimization characterization in Intel Skylake 6700K. extensions, i.e., AVX in both kernels. In order to optimize kernels performance, AVX flag is used (“mavx”), forc- ing the utilization of AVX instructions. However, while kernels 1 and 3 improved their performance, kernel 2 did not suffer any change due to dependencies, which was preventing vectorization. Thus, to solve this problem, loop unrolling was performed in kernel 2. Top-Down metrics are also applied to the optimized kernels (see Figure 5.3a), in order to predict a possible change in their behavior. Regarding kernel 1, although it continues to be bounded by memory (69.1%), there was a reduction in the contribution of the stores (53.7%). Thus, it is expected to have its performance a bit higher than DRAM roof. On the other hand, kernels 2 and 3 maintain the same behavior, i.e., bounded by retiring and core, thus even after applying the set of optimizations, these kernels should still continue to be limited by computations. For each optimized kernel, the respective characterization in the proposed extended CARMs is provided in Figures 5.3b (kernel 1), 5.3c (kernel 2) and 5.3d (kernel 3). However, kernel 1 is now plotted in AVX ST model and kernels 2 and 3 in AVX 2LD/ST CARM. Since no changes to the algorithm were introduced, all kernels maintain the same load/store ratio, even when applying more advanced ISA extensions (AVX). In these figures, it is also plotted the performance of original kernels, in order to better assess the achieved performance improvement. By applying the above-mentioned optimization techniques, all three kernels of Toypush application achieved a substantial improvement in their performance. As it can be observed, the applied optimizations barely affect the AI of different kernels due to the fact that CARM observes the data traffic from the core perspective and considers the true arithmetic intensity, which is the property of the algorithm itself. However, by improving the performance of different kernels, their characterization points moved along the y-axis towards the regions of higher performance. As a consequence, the potential limiting factors for attaining better performance have also changed for the

56 Table 5.1: Performance and arithmetic intensity of Toypush kernels before and after optimization.

Before Optimization After Optimization Application GFLOP/s AI GFLOP/s AI

Kernel 1 0.89 0.12 2.64 0.11 Toypush Kernel 2 6.27 0.27 40.33 0.29 Kernel 3 11.64 0.33 27.74 0.38 optimized kernels, when compared to their original versions. For example, kernel 1 (Figure 5.3b) is now bound by L3 cache, while its unoptimized version was DRAM-bound. Kernel 2 (Figure 5.3c) almost reaches the maximum sustainable performance of the architecture by approaching to the DP Vector FMA Roof, while it was bound by the Scalar ADD Peak performance before applying any optimizations. The optimized version of kernel 3 (Figure 5.3d) resides very near the DP Vector ADD Roof. It is worth noting that even for the optimized Toypush kernels, the performance characterization in the proposed set of CARMs is in accordance with their previously elaborated Top-Down analysis. Table 5.1 summarizes the experimentally obtained performance improvements on a per kernel basis, before and after applying the kernel optimizations. As previously referred, no signiﬁcant changes can be spotted in the AI of different kernels when comparing their optimized and non-optimized versions. However, depending on the kernel type, different performance gains were achieved. In particular, by code vectorization, the performance of kernels 1 and 3 is improved by 3.15 and 2.38 times, respectively. However, the highest performance gains were achieved for kernel 2, which performance was improved by 6.43 times when compared to the unoptimized version. For kernels 2 and 3, their performance increase is explained by the use of AVX instructions, which allow to retire a higher number of ﬂops per cycles, while in kernel 1 the main factor is the reduced impact of the store instructions to the overall performance.

5.4 Characterization of real-world applications in the proposed models

In order to conﬁrm the usability of proposed CARM extensions when characterizing the behavior of real-world applications, the previously elaborated evaluation methodology (also used for Toypush analysis) is applied herein to characterize FP benchmarks from SPEC benchmark suite [47]. In these applications, the main hotspots were identiﬁed by relying on the analysis provided by the Intel Advisor. These hotspots were subsequently instrumented with PAPI calls to obtain the relevant measures from the hardware performance counters, as well as the energy consumption by relying on the RAPL facility (see Section 5.2). For each considered application hotspot, the respective instruction distributions and load/store ratios were obtained, which allowed to classify the kernels of these applications in three categories according to the used CARM extension, namely: SP scalar LD CARM, DP scalar 2LD/ST CARM and DP scalar LD CARM. The kernels from bwaves, zeusmp, cactusADM, gemsFDTD, tonto, leslie3D and lbm are plotted in the DP scalar LD model. The main hotspots from milc, gromacs, soplex, gamess and calculix are plotted in DP scalar 2LD/ST CARM extension. Finally, in the SP scalar LD CARM only wrf kernels are plotted.

57 Similarly to the Toypush evaluation, Top-Down metrics and instruction distribution are analyzed on a per application kernel basis. However, the instruction mix of considered FP SPEC benchmarks (as in case of any real application with substantial complexity) contains a huge diversity of instructions, which makes the analysis based on the instruction types and mixes a quite challenging process. For this reason, instruction distribution for each kernel is divided in only two main components: memory instructions (represented as load and store instructions) and non-memory instructions (typically referring to instructions performed inside the core execution engine). This allows to decouple if the applications are mainly using memory ports or other execution ports in the CPU pipeline, thus allowing to predict their behavior within the roofline chart. As before, the Top-Down analysis is used herein as a complementary characterization strategy when verifying the insightfulness of the proposed CARM extensions, which are also compared with state-of-the-art CARM characterization, in order to identify possible inconsistencies (as in case of kernels 2 and 3 in Toypush). Moreover, some of the most representative kernels are also evaluated in the proposed CARM extensions for power consumption and energy-efficiency domains, thus allowing to perform a full characterization of their execution on modern multi-core CPUs. Finally, due to the high diversity of instructions in their instruction mix, some application kernels are plotted in a throughput-oriented roofline model, which is specifically developed in the scope of this Thesis and it relates the ratio between the amount of computations (or the total number of retired instructions) and memory transfers instructions with the maximum capability of the CPU engine in terms of the amount of instructions that it can retire per clock cycle.

5.4.1 Application characterization in the SP Scalar LD CARM extension

As previously referred, the SP Scalar LD CARM extension is only applicable to the wrf application kernels, according to their characteristics. By performing the analysis in the Intel Advisor, three main kernels are detected. From the instruction distribution of each wrf kernel, presented in Figure 5.4a, it can be observed that all kernels are mainly dominated by load instructions in the part corresponding to the memory instructions, with around 35% of overall contribution for all kernels. However, as it can be seen, the kernels are mainly constituted by instructions that are not memory related, i.e., that use the dispatch ports 0, 1, 5 and 6 (see Figure 2.1 in Chapter 2). Furthermore, from the corresponding Top-Down analysis (see Figure 5.4b), the main application bottlenecks can be derived. For all kernels, the main bottleneck is the retiring, which almost completely dominates the Top-Down analysis. In kernels 1 and 2, retiring contribution surpasses the 70%, while it is around 60% for kernel 3. Besides, there is a balance between core bound and memory bound metrics, which might inﬂuence the performance of kernels, depending on the accessed memory level. However, due to the high retiring, all the kernels should be closer to the private caches or computation roofs in CARM chart, since a high retiring is only possible to achieve in these conditions. In particular, since kernel 3 has slightly increased utilization of the L1 cache, its performance is expected to be a bit higher than the one achieved for kernels 1 and 2. As presented in Figure 5.5a, the characterization provided by the state-of-the-art CARM completely differs from what is expected from the Top-Down analysis. As it can be observed, all the points are characterized as strictly DRAM bound, although kernel 3 performance is slightly above kernels 1 and 2, due to the use of L1 cache. Hence, there is no clear correlation between Top-Down analysis and state-of-the-art CARM characterization. In contrast, by plotting the kernels in the herein proposed CARM extension (see Figure 5.5b), all the kernels are

58 limited by private caches, i.e., kernel 1 and 2 by L2 cache and kernel 3 by L1 cache, which corresponds to the predicted behavior when analyzing the obtained Top-Down results. Furthermore, since these kernels have a big diversity of instructions, a novel throughput-oriented approach to roofline modeling is also investigated in the scope of this Thesis, which provides the means for re-defining the roofline model in general by focusing on fundamental execution principles in the CPU pipeline that are not strictly tight to its upper-bound capabilities for the FP arithmetics. As presented in Figure 5.6, in the x-axis, instead of relating the amount of flops and transferred bytes, it is now considered the amount of non-memory instructions (herein referred as COMPS) over the number of memory instructions (MOPS). In the y-axis, the performance is now given by the amount of non-memory instructions executed per clock (COMPS/CLK), i.e., the retirement rate of the instructions that originate from the dispatch ports 0, 1, 5 and 6. Hence, the horizontal lines in the herein proposed model correspond to different amount of COMPS that can be retired per clock. Since there is a total of four ports to dispatch these instructions, the processor can deliver a maximum of 4 COMPS/CLK, representing the peak performance of the processor. To be precise, in the Intel Skylake micro-architecture, a maximum of 4 instructions can be retired at any given clock cycle regardless of the originating port [1, 45]. The remaining horizontal roofs represent different retirement rates. i.e., 3, 2 and 1 COMPS/CLK, respectively. Moreover, the sloped lines in the new model represent the throughput of each memory level, i.e., the number of memory instructions performed by cycle (MOPS/CLK). Similar to the original CARM, the throughput varies across different memory levels, and it is reducing as the data is fetched further away from the core. In particular, the maximum throughput between core and L1 cache is attained when 2LD/ST ratio is used and it is equal to 3 MOPS/CLK, while the minimum L1 throughput is obtained when using only ST instructions (1 MOP/CLK). Thus, the memory region of the herein model corroborates clearly with CARM memory bandwidth. Besides, since performance is now measured as the amount of COMPS executed per cycle, the model is denominated herein as COMPS CARM. It is worth to emphasize that the COMPS CARM presented in Figure 5.6 follows a similar approach used to select CARM extensions, i.e., it is constructed for a specific load/store ratio and instruction type as required by the wrf kernels (which are represented with “cross” symbols in Figure 5.6). When compared with CARM extension presented in Figure 5.5b, there are slight differences in application characterization. While in CARM extension the kernels are bounded by L2 (kernels 1 and 2) and L1 (kernel 3), in COMPS CARM all the kernels are

Loads Stores Others 100 Frontend Bound Retiring Core Bound 90 Bad Speculation Memory Bound 80 1 70 60 0.8 50 0.6 40 30 0.4 20 0.2 10 0 0 1 2 3 wrf wrf (a) Instruction distribution. (b) Top-Down metrics.

Figure 5.4: Instruction distribution and Top-Down analysis for SP Scalar LD applications.

59 FMA FMA ADD/MUL 101 1 L1→C 10 ADD/MUL L1→C L2→C L3→C wrf_1 0 DRAM→C 100 L3→C 10 wrf_2 L2→C wrf_1 wrf_3 wrf_2 DRAM→C wrf_3

Performance [GFLOPS/s] -1 10 Performance [GFLOPS/s] 10-1 10-2 10-1 100 101 10-2 10-1 100 101 Arithmetic Intensity [flops/byte] Arithmetic Intensity [flops/byte]

(a) State-of-the-art CARM characterization. (b) SP Scalar LD CARM characterization.

Figure 5.5: Application characterization within state-of-the-art CARM and proposed SP Scalar LD extension.

5 wrf_1 wrf_2 wrf_3

4 L1→C RET = 4

RET = 3 3 L2→C L3→C RET = 2 2 RET = 1 1

0 DRAM→C Performance (COMPS/CLK) 0 5 10 15 20 COMPS/MOPS

Figure 5.6: Application characterization with SP Scalar LD COMPS CARM. limited by the compute roof that corresponds to the retirement rate of 2 COMPS per cycle. However, in contrast to CARM, which analyzes only FP instructions, the COMPS CARM takes into account all retired instructions, when identifying the primary bottlenecks. Hence, COMPS model is a good complement to the FP CARM, since it allows to globally visualize the application bottlenecks (similarly to the Top-Down method), while FP CARM can more precisely describe the execution bottlenecks and provide further optimization hints. In particular, COMPS CARM can be used to distinguish which roofs in the FP CARM should be considered as the primary source of execution bottlenecks, especially in situations when the application point is positioned between several bandwidth slopes and horizontal roofs. As in the FP CARM, it is also possible to directly correlate the Top-Down analysis with the COMPS CARM characterization. As previously referred, all three kernels are bounded by retiring (kernel 1 and 2 around 70% and kernel 3 with approximately 60%). As it can be observed in Figure 5.6, COMPS CARM characterizes all kernels as bound by retiring. However, kernel 3 performance is slightly lower than the performance of kernels 1 and 2, due to the lower retirement contribution when compared to the other kernels.

5.4.2 Application characterization in the DP Scalar 2LD/ST CARM extension

Regarding the application kernels plotted in the proposed DP Scalar 2LD/ST CARM extension, their instruction distribution (see Figure 5.7a) is mainly dominated by memory instructions, typically surpassing 60%. The only exception is kernel 2 from gamess application, i.e., gamess2, where 60% of the instructions are not memory related. Hence, it is expected that the gamess2 kernel is limited by retiring in the Top-Down characterization, while the

60 Loads Stores Others Frontend Bound Retiring Core Bound 100 Bad Speculation Memory Bound 90 1 80 70 0.8 60 50 0.6 40 0.4 30 20 0.2 10 0 0 1 2 gromacs 1 2 2 calculix 1 2 gromacs 1 2 2 calculix milc soplex gamess milc soplex gamess

(a) Instruction distribution. (b) Top-Down metrics.

Figure 5.7: Instruction distribution and Top-Down analysis for DP Scalar 2LD/ST applications.

FMA

FMA 1 ADD/MUL 1 10 L1→C 10 ADD/MUL L2→C L3→C L1→C L3→C 100 DRAM→C 100 gamess_2 DRAM→C milc_1 soplex_1 gamess_2 milc_2 soplex_2 -1 milc_1 soplex_1 10-1 gromacs_1 calculix_1 10 milc_2 soplex_2 gromacs_1 calculix_1

Performance [GFLOPS/s] -2 Performance [GFLOPS/s] 10-2 10 10-2 10-1 100 101 10-2 10-1 100 101 Arithmetic Intensity [flops/byte] Arithmetic Intensity [flops/byte]

(a) State-of-the-art CARM CARM characterization. (b) DP Scalar 2LD/ST CARM characterization.

Figure 5.8: Application characterization within state-of-the-art CARM and proposed DP Scalar 2LD/ST extension. remaining kernels may be limited by memory, depending on the accessed memory level. However, by analyzing the results from the corresponding Top-Down analysis, presented in Figure 5.7b, not all kernels obey to the expected behavior. While Top-Down characterization of milc, soplex and gamess2 kernels is according to the instruction distribution ( milc and soplex are mainly memory bound, while gamess2 is bounded by retiring), gromacs and calculix are limited by retiring and core, despite being dominated by memory instructions. However, gromacs and calculix kernels have a considerable amount of memory access served by the L1 cache, which may provoke the diminishing memory bound impact and increasing retiring and core bound contributions. Hence, in CARM characterization, milc, soplex and gamess2 kernels should be limited by DRAM or between DRAM and the shared cache (L3 cache), while the remaining hotspots should be characterized as bounded by computational roofs or cache levels closer to the core. As it can be observed in state-of-the-art CARM chart presented in Figure 5.8a, milc and soplex kernels are completely bound by DRAM. However, both milc and soplex kernels also contain some contribution from retiring and core bound, which means they should not be completely bound by DRAM, but instead to attain the performance slightly higher than the one delimited by the DRAM slope (i.e., they should be positioned between L3 and DRAM). This effect can also be observed in the Top-Down analysis provided for the micro-benchmarks that were performed to evaluate the bandwidth upper-bounds of the micro-architecture for different memory levels (see Figure 3.14 in Chapter 3). In detail, when memory accesses are served by DRAM, core bound and retiring contributions are basically zero. Furthermore, since these kernels contain a signiﬁcant amount of memory transfers, achieving the compute bound roof should be almost impossible. However, state-of-the-art CARM hints that it might be possible

61 20 L3→C milc_1 19 milc_2 L1→C gromacs_1 18 soplex_1 17 soplex_2 16 15 FMA 14 13 DRAM→C 12

Power Consumption [W] 11 10-3 10-2 10-1 100 101 102 Arithmetic Intensity [flops/bytes]

Figure 5.9: Application characterization with SP Scalar LD COMPS CARM. to improve the performance of these kernels until hitting the scalar ADD roof. In contrast, in the CARM extension proposed herein, presented in Figure 5.8b, the milc kernels are placed between L3 cache and DRAM, as expected according to the previously provided analyses. The soplex kernels are also closer to the DRAM, despite not being on the top of the DRAM roof, which may suggest memory latency as potential bottleneck. However, as expected from instruction distribution of each kernel, the characterization proposed with DP Scalar 2LD/ST CARM extension hints that milc and soplex kernels performance can only be boosted to achieve L1 cache roof and never the computational roof. Hence, when compared to the state-of-the- art CARM, the characterization in the proposed CARM extension provides more accurate hints, according to the predominant type of instructions in kernel instruction mix (instruction distribution) and the Top-Down analysis. Regarding gromacs and gamess2 kernels, state-of-the-art CARM evaluation places their performance between L3 cache and DRAM. However, according to the Top-Down analysis, these kernels should be limited by private caches or computations, which does not strictly correspond to the state-of-the-art CARM characterization. By plotting these kernels in the proposed CARM extension presented in Figure 5.8b, it can be observed that these kernels are positioned closer to L3 cache roof. Although there is an improvement when compared to the state-of-the-art CARM evaluation, this characterization still does not fully match the previously elaborated expectations from the Top-Down analysis, i.e., that the kernel should be bounded by private caches or computational roofs. However, in contrast to the other kernels, for gromacs and gamess2 kernels the Top-Down analysis detects a significant contribution from bad speculation and frontend, respectively. Since these metrics are connected to backend starvation and branch misprediction, their existence signals that the application might achieve a lower performance (than the one previously expected), which also influences their characterization in the proposed CARM extension. Finally, calculix kernel is characterized as the L3 cache bound in the state-of-the-art CARM. However, according to the Top-Down analysis, the main bottlenecks are related to core ports and retiring, thus the performance of this kernel should be limited by private caches or computations. This is confirmed by the kernel characterization in the proposed DP Scalar 2LD/ST CARM, which places this hotspot on top of L2 cache roof, thus matching the insights provided by the Top-Down analysis. In contrast, the guidelines derived from the state-of-the-art CARM seem to be less accurate, since it shows calculix kernel to be characterized as L3 bound. In order to obtain more insights regarding the application behavior, milc, gromacs and soplex kernels are also plotted in the proposed CARM extension for power consumption, as presented in Figure 5.9. This analysis aims

62 vnlwrta h n o 1adL caches. L2 and depicted L1 clearly for one is the scenario than evaluation lower This even caches, L2 lines. those or of L1 top to on attributed for positioned strictly is placed be point is not characterization performance should the kernel bottleneck if the in consumption even observed if power However, be its can curves. DRAM, it L1 and As and L3 L2 lines. between includes DRAM also and area L3 highlighted between the placed little 5.10c, are very Figure that is hotspots there attention analyzing 5.10b) special when Figure a taken (see domains, be caches consumption should power L2 and and performance L3 between 5.10). and characterization Figure 5.10a) inconclusive in for Figure room arrows (see area (see caches the account L2 all into and levels, taken L1 cache between be two While should between curves lies performance consumption performance kernel power kernel i.e., corresponding the CARM, the when performance between em- observed, the are be in areas can kernel speciﬁc it the of As where positions CARM characterization. possible consumption of power range the a cover include to examples phasized These the 5.10a, Figures with 5.10d. in correlated and provided be are 5.10c, can analysis 5.10b, CARM it power how of and examples CARM different several consumption characterization, power performance the exact for the in methodology interpreted interpretation be the cannot depict CARM model, consumption performance power the extended as that way notice same to important is It roof. cache characterization, model performance the with correlates pletely components the bottlenecks. consumption by power consumed the power pinpoint to the i.e., of kernels, perspective these of the execution from the kernels during exercised these of bottlenecks the uncovering at cache DRAM. L3 and between characterization consumption Power (c) cache cache. L2 L1 and between characterization consumption Power (a) gromacs o enl ihpromnecoeo eo h RMln,temi rat nlz npwrconsumption power in analyze to area main the line, DRAM the below or close performance with kernels For Regarding Power Consumption [W] Power Consumption [W] L2→C L3→C hs enli hrceie ewe RMadL ah,tu tcnhv t oe consumption power its have can it thus cache, L3 and DRAM between characterized is kernel whose , DRAM→C Arithmetic Intensity[flops/bytes] Arithmetic Arithmetic Intensity[flops/bytes] Arithmetic gromacs L1→C hrceiaini h rpsdCR xeso o h oe osmto,i com- it consumption, power the for extension CARM proposed the in characterization iue51:Pwrcnupincaatrzto methodology. characterization consumption Power 5.10: Figure i.e. yloigdrcl oterosaoetepotddt nodrt better to order In dot. plotted the above roofs the to directly looking by , 63 n 3cache. cache L3 L2 and between characterization consumption Power (b) d oe osmto hrceiainblo DRAM. bellow characterization consumption Power (d) Power Consumption [W] Power Consumption [W] i.e. h enli lcdbtenDA n L3 and DRAM between placed is kernel the , L3→C L2→C Arithmetic Intensity[flops/bytes] Arithmetic Arithmetic Intensity[flops/bytes] Arithmetic DRAM→C model should also be bellow or near the DRAM power curve, as shown in Figure 5.10d. This can be observed for milc and soplex kernels, which power consumption characterization is positioned below the DRAM roof. Although their power consumption characterization does not completely match to performance one, it should be noted that, in contrast to the performance model that only focuses on the FP operations, the proposed power consumption CARM extension encapsulates many different effects that occur in the CPU pipeline during the application execution, as well as the power contributions from different components and instruction types (not only FP operations). However, both performance and power characterizations in the proposed models show a signiﬁcant amount of consistency between them, e.g., there are no situations where a kernel bounded by the L1 in performance is bounded by DRAM in the power consumption.

5.4.3 Application characterization in the DP Scalar LD CARM extension

The remaining set of kernels from the FP SPEC benchmarks, i.e., the main hotspots from bwaves, zeusmp, cactusADM, gemsFDTD, tonto, leslie3D, gamess and lbm, is plotted in DP Scalar LD CARM extension, according to kernels initially assessed characteristics based on the predominant type of FP operations and load/store ratios. Due to the large amount of kernels in these SPEC benchmarks, their analysis will be separated in two different groups (batches), namely: i) Group 1 with kernels from bwaves, zeusmp, cactusADM and gemsFDTD; and ii) Group 2 with kernels from leslie3D, tonto and lbm benchmarks. For the kernels belonging to the ﬁrst group (batch), i.e., the kernels from bwaves, zeusmp, cactusADM and gemsFDTD, the previously referred methodology was applied in order to assess the instruction distribution on a per kernel basis, as presented in Figure 5.11a. As it can be observed, all kernels are mainly dominated by non-memory instructions, which surpass 60% for all hotspots. Due to this observation, the Top-Down characterization is expected to identify retiring component as the main bottleneck for all kernels, although the retirement contribution may be reduced depending on the accessed memory level. In fact, the respective Top-Down analysis presented in Figure 5.11b conﬁrms this assumption, since retiring dominates Top-Down analysis across all kernels. Further- more, the memory accesses should also impact the performance of these kernels, since its contribution is superior to 20% in almost every kernel. The only exceptions are cactusADM and gamess1 kernels, where memory contribution is approximately 4% and 1%, respectively. Due to the dominance of retiring component, together with core bound contribution, all the kernels are expected to be limited by private caches or computational roofs in CARM plot. For example, gamess1 memory accesses are mainly served by L1 cache, thus its performance can even be characterized between L1 and L2 roofs. However, if memory accesses are mainly served by L3 or DRAM, the performance of these kernels can be lower than expected. By plotting the kernels in the state-of-the-art CARM (see Figure 5.12a), the provided kernel characterization generally does not match the Top-Down analysis, i.e., there is no kernel limited by private caches or by the computation roofs. Instead, state-of-the-art CARM analysis places all kernels around the DRAM slope or between L3 and DRAM roofs. However, when the performance of these kernels is plotted in the proposed CARM extension, a shift towards the private cache levels can be observed for all kernels, which closely matches the expectations from the Top-Down analysis. In detail, regarding the gamess1 kernel characterization in the proposed CARM extension, it completely matches the expected behavior, since the proposed CARM places it slightly above the L2 cache slope. Further-

64 Loads Stores Others Frontend Bound Retiring Core Bound 100 Bad Speculation Memory Bound 90 1 80 70 0.8 60 50 0.6 40 0.4 30 20 0.2 10 0 0 1 2 3 1 2 3 4 1 1 2 3 4 1 1 2 3 1 2 3 41 1 2 3 4 1 bwaves zeusmp cactusADM gemsFDTD gamess bwaves zeusmp cactusADM gemsFDTD gamess

(a) Batch 1: Instruction distribution. (b) Batch 1: Top-Down metrics.

Figure 5.11: Batch 1: Instruction distribution and Top-Down analysis for DP Scalar LD applications.

FMA bwaves_1 bwaves_2 FMA 1 ADD/MUL 1 bwaves_3 10 10 gamess_1 zeusmp_1 L3→C ADD/MUL L2→C zeusmp_2 DRAM→C L3→C 100 L1→C zeusmp_3 100 bwaves_1 zeusmp_4 zeusmp_3 bwaves_2 cactusADM_1 DRAM→C zeusmp_4 bwaves_3 gemsFDTD_1 L1→C cactusADM_1 10-1 gamess_1 gemsFDTD_2 10-1 gemsFDTD_1 zeusmp_1 gemsFDTD_3 gemsFDTD_2 gemsFDTD_4 gemsFDTD_3 zeusmp_2 gemsFDTD_4

Performance [GFLOPS/s] -2 Performance [GFLOPS/s] 10-2 10 10-3 10-2 10-1 100 101 10-3 10-2 10-1 100 101 Arithmetic Intensity [flops/byte] Arithmetic Intensity [flops/byte]

(a) Batch 1: state-of-the-art CARM characterization. (b) Batch 1: DP Scalar LD CARM characterization.

Figure 5.12: Batch 1: Application characterization within state-of-the-art CARM and proposed DP Scalar LD CARM extension. more, zeusmp kernels 1 and 3 and cactusADM kernel also completely correlate with Top-Down insights, since their performance is now limited by private cache levels, in particular L2 cache. The performance of cactusADM kernel is also affected by the frontend bound contribution, indicating backend starvation and, consequently, resulting in a slightly lower attainable performance, as captured in the proposed CARM extension. Although the performance of the remaining kernels is closer to core components in the proposed CARM chart, which demonstrates a more accurate analysis, these kernels are still characterized as bounded by the shared cache, i.e., L3 cache. However, for these kernels, the majority of their memory accesses is served by DRAM, which naturally result in a significantly reduced performance due to the lower DRAM bandwidth. However, since memory bound is not the main bottleneck in Top-Down analysis, the kernels can not be completely limited by the DRAM roof, hence their performance is closer to the L3 cache in the proposed model, thus never reaching the private caches or computational roofs. In order to provide a different perspective when analyzing the bwaves, zeusmp, gamess1, cactusADM and gemsFDTD kernels, their behavior, characteristics and energy-efficiency are also assessed in the corresponding energy-efficiency CARM extension, as presented in Figure 5.13. This model is constructed by following the same principles when selecting the performance CARM extension, i.e., DP Scalar LD CARM, thus providing a coherent comparison between performance and energy-efficiency evaluations. As it can be observed in Figure 5.13, energy-efficiency characterization of the kernels belonging to the first evaluation group does not drastically differ from the one observed in the performance domain. The kernels 1 and 3 from zeusmp maintain the same relative position to the L2 roof in both models, indicating a good balance between performance and power consumption. The position of remaining kernels is slightly shifted further away from the limiting roof, when compared to the performance characterization, although their main execution bot-

65 bwaves_1 bwaves_2 0 bwaves_3 FMA 10 gamess_1 zeusmp_1 zeusmp_2 ADD/MUL

10-1 L3→C zeusmp_3 zeusmp_4 DRAM→C L1→C cactusADM_1 gemsFDTD_1 gemsFDTD_2 10-2 gemsFDTD_3 gemsFDTD_4 L2→C -3 -2 -1 0 1 Energy - Efficiency [GFLOPS/J] Energy - 10 10 10 10 10 Arithmetic Intensity [flops/byte]

Figure 5.13: Application efficiency characterization with proposed DP Scalar LD CARM extension. tlenecks remain the same. This suggests that the power consumption of these kernels is higher and provokes reduced energy-efficiency, thus the execution of these kernels should be optimized, especially in what concerns the power consumed during the execution. For example, by applying different optimization techniques in the respective kernels that allow achieving the performance closer to the L1 roof, the energy-efficiency of these kernels is also expected to significantly increase (i.e., the characterization point in the energy-efficiency plot will move nearer the L1 roof), since the power consumption for L1 access is also lower. Hence, the main objective when optimizing applications from the energy-efficiency point of view is to maximize the efficiency, i.e., to improve the application performance, while reducing or maintaining the similar level of power consumption. As it can be observed in Figure 5.13, all analyzed kernels have very low AI, which prevents them from entering to the regions of high energy-efficiency (i.e., the regions where it is possible to achieve 99% of the maximum processor energy-efficiency). In order to achieve this goal, the structure of the algorithms has to be completely redesigned (if possible at all) to provide significant changes in the kernels’ AI, i.e., to shift the kernels towards the right side of the energy-efficiency CARM ridge point, where it would be theoretically possible to attain the maximum efficiency as sustained by the architecture.

The second group of characterized FP SPEC benchmarks contains the kernels from leslie3D, tonto and lbm applications. As it can be observed in the corresponding instruction distribution, presented in Figure 5.14a, leslie3D kernels are predominantly constituted by non-memory instructions, thus they are expected to be limited by retiring in Top-Down analysis. In contrast, tonto and lbm kernels are quite balanced between memory and non-memory instructions. Hence, predicting their behavior is far from being a trivial task, since it depends on the accessed memory level by the memory instructions. If the requests are all served by private caches, their retiring should be high in Top-Down. On the other hand, if the majority of the memory accesses are referring to DRAM, the kernels should be memory bound. These effects are clearly visible in Top-Down evaluation, which is presented in Figure 5.14b. As expected from the leslie3D instruction mix, these kernels are mainly limited by retiring (approximately 60% in all kernels), with a signiﬁcant contribution from memory bound component. In fact, according to the Top-Down characterization, leslie3D memory accesses are mainly served by DRAM. Hence, despite being predominately limited by retiring, the performance of these kernels should be placed between L3 cache and DRAM roofs. In contrast to the leslie3D

66 Loads Stores Others Frontend Bound Retiring Core Bound 100 Bad Speculation Memory Bound 90 1 80 70 0.8 60 50 0.6 40 0.4 30 20 0.2 10 0 0 1 2 3 4 5 6 1 2 3 1 1 2 3 4 5 6 1 2 3 1 leslie3d tonto lbm leslie3d tonto lbm

(a) Batch 2: Instruction distribution. (b) Batch 2: Top-Down metrics.

Figure 5.14: Batch 2: Instruction distribution and Top-Down analysis for DP Scalar LD applications.

FMA leslie3d_1 leslie3d_2 FMA 1 ADD/MUL leslie3d_3 10 101 leslie3d_4 leslie3d_5 ADD/MUL L2→C leslie3d_6 100 L1→C 100 leslie3d_1 L1→C leslie3d_2 DRAM→C DRAM→C leslie3d_3 tonto_1 10-1 leslie3d_4 tonto_2 10-1 tonto_1 L3→C L2→C tonto_2 leslie3d_5 tonto_3 tonto_3 leslie3d_6 lbm_1 lbm_1 L3→C

Performance [GFLOPS/s] -2 10 Performance [GFLOPS/s] 10-2 10-3 10-2 10-1 100 101 10-3 10-2 10-1 100 101 Arithmetic Intensity [flops/byte] Arithmetic Intensity [flops/byte]

(a) Batch 2: state-of-the-art CARM characterization. (b) Batch 2: DP Scalar LD CARM characterization.

Figure 5.15: Batch 2: Application characterization within state-of-the-art CARM and proposed DP Scalar LD CARM extension. kernels, memory bound component of tonto kernels is much lower and they are completely bound by retiring and core bound, indicating that their performance might be limited by private caches or computations. Finally, the lbm kernel is predominantly bounded by memory. In particular, the majority of its memory accesses are served by DRAM, thus it should be characterized as DRAM bound in the CARM plot. However, since it is not purely memory bound, the existence of some retiring contribution might suggest that lbm kernel performance can be positioned slightly above the DRAM roof. By plotting these kernels in the state-of-the-art CARM (Figure 5.14a), the insights provided by Top-Down analysis are, once again, not completely veriﬁed. In the state-of-the-art CARM, all kernels are positioned below the L3 cache roof, although the Top-Down metrics indicate the opposite. The only kernel that somewhat matches the Top-Down analysis is lbm, which is completely bounded by DRAM (although the Top-Down retiring component suggests that its performance should surpass DRAM). However, when the same set of kernels is characterized in the proposed CARM extension, as presented in Figure 5.15b, a clear relation between the Top-Down analysis and DP Scalar LD CARM can be observed. In detail, in the proposed rooﬂine chart, all the kernels behave according to the Top-Down and instruction mix analyses. In particular, tonto kernels are bounded by private caches, which indicates their high retiring and core bound nature from the Top-Down evaluation. Moreover, lbm and leslie3D kernels, due to the retirement contribution and DRAM accesses, have their performance placed between L3 and DRAM roofs. In fact, leslie3D kernels are now mainly limited by the L3 cache accesses, which indicates the previously referred contribution of DRAM accesses on the application with high or moderate retirement rates. However, the proposed CARM does not fully explain the behavior of tonto kernels, despite providing correct and accurate characterization. In particular, the kernel 3 achieves lower performance than kernels 1 and 2, although

67 23 tonto_1 tonto_2 tonto_3 RET = 4 22

L1→C RET = 3 21 RET = 2 L2→C L3→C 20 DRAM→C RET = 1

Performance (INST/CLK) 2-1 20 21 22 23 24 25 INST/MOPS

Figure 5.16: Application characterization with DP Scalar LD INST CARM. kernel 3 has a retiring component (74%) much higher than kernels 1 and 2 (58 % in both). Since this application has a high diversity in its instruction mix, FP based CARM may not allow for the full characterization of all execution bottlenecks that may originate from the other instruction types present in the application instruction mix. With the objective of explaining this small inconsistency, a similar extension to COMPS CARM is also investigated in this Thesis. In contrast to COMPS CARM (see Section 5.4.1), which relates the ratio of COMPS and MOPS with the amount of COMPS retired per cycle, this new redefined roofline model considers in x-axis the total amount of instructions (INST) over the amount of memory instructions, i.e., MOPS, as shown in Figure 5.16. Hence, in the y-axis, the performance is given by the total amount of instructions retired per clock cycle (IPC). Since the processor can only retire 4 instructions per cycle, the horizontal roofs are equal to the ones presented in COMPS CARM. A similar methodology is applied to the slanted roofs, which consider the throughput of each memory level, in terms of MOPS per cycle. Furthermore, the proposed general roofline model inherits all CARM main characteristics, including the construction for different load/store ratios and instruction types. In addition, since this novel roofline modeling approach considers the performance via the amount of retired INST per cycle, it is denominated herein as INST CARM. As it can be observed in Figure 5.16, tonto kernels are characterized as compute bound in this model. In fact, the highest performance is now achieved with kernel 3, as expected from the Top-Down analysis. The kernels 1 and 2 also corroborate with Top-Down evaluation, since their lower performance is directly connected to the lower retiring contribution. This demonstrates that the proposed INST CARM can also complement FP CARM analysis, similarly to one presented for the COMPS CARM.

As it can be concluded, all different flavors of roofline modeling proposed in this Thesis provide a significant enrichment to the general insightful micro-architecture modeling. When tightly coupled, those models constitute a powerful architecture and application analysis framework, mainly due to their ability to provide more accurate architecture and application characterization. By relying on the proposed set of easy to understand and intuitive models to visually represent the current execution bottlenecks, the application developers and computer architects can now conduct very fast first-order analysis, apply different optimization techniques by following the guidelines given by the proposed models or even make easier decisions and evaluations between different design choices.

68 5.5 Summary

In this chapter, usefulness and insightfulness of proposed CARM extensions are demonstrated, by performing an in-depth experimental evaluation and analysis of a set of different real-world applications on real hardware system, with a quad-core Intel Skylake 6700K processor. This analysis is performed from different modeling domains, such as performance, power consumption and energy-efficiency. Initially, Toypush mini-application main hotspots are deeply evaluated and characterized in their propose CARM models, which adapt to kernels specifics (e.g. load/store ratio and instruction type). This evaluation involves instruction distribution and Top-Down analyses, which are correlated with CARM behavior. Besides, a set of optimizations are applied to Toypush kernels, in order to maximize their performance and assess optimizations impact in proposed CARM characterization. The optimizations allowed to obtain performance improvements of up to 6.43 times when compared to non-optimized Toypush version. Next, a set of real-world applications from SPECs benchmark suite is also analyzed in the corresponding CARM extensions, for performance, power consumption and energy-efficiency. Due to each application specifics, their kernels are divided in three main proposed model categories: SP scalar LD CARM, DP scalar 2LD/ST CARM and DP scalar LD CARM. The kernels performance analysis follows the same methodology utilized in Toypush case-study. Besides, since SPECs benchmark suite applications are also evaluated from power consumption and energy-efficiency point-of-view, the methodology to analyze correctly these CARM charts is also presented throughout application kernels evaluation. Finally, COMPS and INST CARM extensions are also introduced, in order to complement FP CARM insightfulness, by allowing to distinguish which roofs in FP CARM should be firstly taken into account as the primary execution bottlenecks, in particular, when performance kernel is placed between two roofs. Moreover, state-of-the-art CARM performance characterization was compared with the proposed CARM characterization. The obtained results show a clear improvement in kernels characterization when using the proposed CARM extensions, since the insights provided by these novel extensions are according to the expected behavior indicated by Top-Down analysis. The results presented throughout this chapter allow to conclude that proposed CARM extensions and performance, power consumption and energy-efficiency methodologies are a clear improvement to the actual state-of- the-art CARM insightfulness and usability. Thus, their inclusion in Intel Advisor can easily boost the capabilities of this powerful tool, easing even more the designing and optimization processes of real applications.

69 6. Conclusions and Future Works

Due to the increasing micro-architecture and application complexity, optimizing applications from performance, power consumption and energy-efficiency points of view is not an easy task for software developers. Hence, a characterization methodology capable of providing useful insights about application behavior according to the micro-architecture capabilities is extremely important when addressing these challenges. To address this issue, this Thesis proposed a set of CARM extensions, aiming at increasing the model insightfulness and usability when characterizing the real-world applications. Besides, the work performed in the scope of this Thesis has also as an objective to investigate the CARM portability across processors from different and more recent Intel micro-architectures, in particular, from Intel Ivy Bridge 3770K to Intel Skylake 6700K. To achieve these objectives, a tool specially designed for Intel micro-architectures was developed, allowing to characterize the system upper-bounds for several different computational capabilities, such as, different instructions, instruction set extensions and load/store ratios. The obtained results allowed to create several CARM instances, each evaluating different processor capabilities. These instances were successfully validated on two different computing platforms equipped with a quad-core Intel Skylake 6700K and Intel Ivy Bridge 3770K, which also demonstrated the accuracy of the created benchmarks. Furthermore, modern real-world applications can contain a diverse instruction mix, which increases their complexity and, consequently, making it difficult to characterize their behavior and to select the best optimization techniques to be applied to improve their execution efficiency. Hence, the proposed models aim at bridging this gap, by correlating the application specifics with different micro- architectural capabilities. In order to demonstrate the usability of the proposed extensions, a case-study with the Toypush miniapp and the characterization of FP benchmarks from SPEC suite were performed, by following the proposed performance, power consumption and energy-efficiency analysis methodologies. In case of Toypush, its main hotspots were deeply analyzed, by taking into account their instruction mix and Top-Down bottlenecks and, based on this analysis, the correct CARM instance was applied to each kernel. By relying on the insights provided by the proposed CARM extensions, Toypush kernels were optimized, which allowed achieving performance improvements of up to 6.43 times when compared to the non-optimized codes. Besides, the proposed methodology was compared with the insights derived from the state-of-the-art CARM implementation. This analysis shows the capability of the proposed models to provide a more accurate characterization of the application behavior. Furthermore, it was also possible to fully correlate the results of in-depth profiling for application execution bottlenecks (using the Top-Down analysis) and the proposed CARM characterization. On the other hand, state-of-the-art CARM typically gives a more limited set of information when explaining the main execution bottlenecks for the majority of the kernel applications. Moreover, two extra CARM extensions were proposed, namely COMPS CARM and INST CARM. These extensions redefine CARM throughput analysis, by defining application performance as the relation between the number of instructions retired per cycle and the total amount of executed instructions. The experimental results presented in this Thesis clearly show that the proposed extensions increase the CARM insightfulness and usability when analyzing real-world applications. Moreover, the proposed COMPS CARM and

70 INST CARM could be considered as the first step towards the derivation of more general Roofline modeling approaches, since these models consider the entire set of instructions and not only FP instructions. Finally, since CARM is a model based on experimental measurements, it can be applied to any architecture, as it was stated in validation results, where CARM was successfully applied and validated to model the performance, power consumption and energy-efficiency upper-bounds of a quad-core Intel Skylake 6700K processor.

6.1 Future Works

As it can be observed from the performed characterization in Chapter 5, there are some aspects that could improve CARM insightfulness when analyzing the real-world applications. Based on COMPS and INST CARM proposed extensions, CARM horizontal roofs could be extended to include mixes between integer and FP operations. Hence, the majority of the applications, with the most diverse amount of integers and FP instructions would be characterized more accurately. Besides, since integer instructions allow to perform memory instructions of 1 and 2 bytes, these roofs should also be included in the CARM chart. Moreover, there are some application kernels that mix different data precision types, i.e., they contain memory accesses and computations from SP and DP data. Hence, by including in CARM the roofs that correspond to these mixes, different processor capabilities can be taken into account, thus improving the model usability. Finally, further research is needed to fully uncover the impact of frontend and bad speculation on the performance in CARM. Despite the fact that they may provoke a reduction in the kernel performance, CARM seems not to be able to fully take into account these bottlenecks. Hence, CARM can be extended to include memory and computational roofs representing the memory and computational throughput when frontend and bad speculation problems occur.

71 References

[1] Intel Corporation. “Intel R 64 and IA-32 Architectures Optimization Reference Manual”, Intel. [Online].

[2] Ahmad Yasin. “A Top-Down method for performance analysis and counters architecture”. In Proocedings of the International Symposium on Performance Analysis of Systems and Software, ISPASS’14, pages 35–44. IEEE, 2014.

[3] Aleksandar Ilic, Frederico Pratas, and Leonel Sousa. “Cache-aware rooﬂine model: Upgrading the loft”. IEEE Computer Architecture Letters, 13(1):21–24, 2014.

[4] Samuel Williams, Andrew Waterman, and David Patterson. “Rooﬂine: an insightful visual performance model for multicore architectures”. Communications of the ACM, 52(4):65–76, 2009.

[5] Aleksandar Ilic, Frederico Pratas, and Leonel Sousa. “Beyond the rooﬂine: Cache-aware power and energy- efﬁciency modeling for multi-cores”. IEEE Transactions on Computers, 66(1):52–58, 2017.

[6] Aleksandar Ilic. “Cache-Aware Rooﬂine Model (CARM) : Performance, Power, Energy and Energy- Efﬁciency”. http://sips.inesc-id.pt/~ilic/roofline.php.

[7] J. Doweck, W. F. Kao, A. K. y. Lu, J. Mandelblat, A. Rahatekar, L. Rappoport, E. Rotem, A. Yasin, and A. Yoaz. “Inside 6th-Generation Intel Core: New Microarchitecture Code-Named Skylake”. IEEE Micro, 37(2):52–62, Mar 2017.

[8] Herman Schmit and Randy Huang. “Dissecting Xeon + FPGA: Why the Integration of CPUs and FPGAs Makes a Power Difference for the Datacenter: Invited Paper”. In Proceedings of the International Symposium on Low Power Electronics and Design, ISLPED ’16, pages 152–153, New York, NY, USA, 2016. ACM.

[9] Hadi Esmaeilzadeh, Emily Blem, Renee St Amant, Karthikeyan Sankaralingam, and Doug Burger. “Dark silicon and the end of multicore scaling”. ACM SIGARCH Computer Architecture News, 39(3):365–376, 2011.

[10] Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R Hower, Tushar Krishna, Somayeh Sardashti, et al. “The Gem5 Simulator”. ACM SIGARCH Computer Architecture News, 39(2):1–7, August 2011.

[11] J. Power, J. Hestness, M. S. Orr, M. D. Hill, and D. A. Wood. “gem5-gpu: A Heterogeneous CPU-GPU Simulator”. IEEE Computer Architecture Letters, 14(1):34–36, Jan 2015.

[12] Juan A Lorenzo, Juan C Pichel, Tomas´ F Pena, Marcos Suarez, and Francisco F Rivera. “Study of Perfor- mance Issues on a SMP-NUMA System using the Rooﬂine Model”. In Proocedings of the International Con- ference on Parallel and Distributed Processing Techniques and Applications, PDPTA’11, volume 7, pages 18–2011, 2011.

72 [13] O. G. Lorenzo, T. F. Pena, J. C. Cabaleiro, J. C. Pichel, and F. F. Rivera. “3DyRM: a dynamic rooﬂine model including memory latency information”. Journal of Supercomputing, 70(2):696–708, 2014.

[14] Oscar G Lorenzo, Tomas´ F Pena, Jose´ C Cabaleiro, Juan C Pichel, and Francisco Fernandez´ Rivera. “Using an extended Rooﬂine Model to understand data and thread afﬁnities on NUMA systems”. Annuals of Multicore and GPU Programming, 1(1):56–67, 2014.

[15] Victoria Caparros´ Cabezas and Markus Puschel.¨ “Extending the rooﬂine model: Bottleneck analysis with microarchitectural constraints. In Proceedings of the IEEE International Symposium on Workload Charac- terization, IISWC’14, pages 222–231. IEEE, 2014.

[16] Joshua D. Suetterlein, Joshua Landwehr, Andres Marquez, Joseph Manzano, and Guang R. Gao. “Extending the Rooﬂine Model for Asynchronous Many-Task Runtimes”. In Proceedings of the IEEE International Conference on Cluster Computing, CLUSTER’16, pages 493–496, 2016.

[17] Diogo Antao,˜ Lu´ıs Tanic¸a, Aleksandar Ilic, Frederico Pratas, Pedro Tomas,´ and Leonel Sousa. “Monitoring Performance and Power for Application Characterization with the Cache-Aware Rooﬂine Model”. In Pro- ceedings of the International Conference on Parallel Processing and Applied Mathematics, PPAM’14, pages 747–760. Springer, 2014.

[18] Tomas´ Ferreirinha, Ruben´ Nunes, Leonardo Azevedo, Am´ılcar Soares, Frederico Pratas, Pedro Tomas,´ and Nuno Roma. “Acceleration of stochastic seismic inversion in OpenCL-based heterogeneous platforms”. Computers & Geosciences, 78:26–36, 2015.

[19] Alexandra Shinsel. “Intel Advisor Rooﬂine”. https://software.intel.com/en-us/articles/ intel-advisor-roofline, 2017. [Online; posted 02-March-2017].

[20] Jawad Haj-Yihia, Ahmad Yasin, Yosi Ben Asher, and Avi Mendelson. “Fine-Grain Power Breakdown of Modern Out-of-Order Cores and Its Implications on Skylake-Based Systems‘”. ACM Transactions on Archi- tecture and Code Optimization, 13(4):1–25, 2016.

[21] Toypush. https://github.com/tkoskela/toypush.

[22] “Power 4 The First Multi-Core, 1GHz Processor”. http://www-03.ibm.com/ibm/history/ibm100/us/ en/icons/power4/.

[23] miniGhost 3D Halo-Exchange Mini-Application. https://github.com/Mantevo/miniGhost.

[24] Jee Choi, Marat Dukhan, Xing Liu, and Richard Vuduc. “Algorithmic time, energy, and power on candidate HPC compute building blocks”. In Proceedings of the IEEE 28th International Symposium on Parallel and Distributed Processing, IPDPS’14, pages 447–457. IEEE, 2014.

[25] Jee Whan Choi and Richard Vuduc. “A rooﬂine model of energy”, technical report. Georgia Institute of Technology, School of Computation Science and Engineering, Atlanta, GA, USA, December 2012.

73 [26] Jee Whan Choi, Daniel Bedard, Robert Fowler, and Richard Vuduc. “A rooﬂine model of energy”. In Proceedings of the IEEE 27th International Symposium on Parallel & Distributed Processing, IPDPS’13, pages 661–672. IEEE, 2013.

[27] Anirban Mandal, Rob Fowler, and Allan Porterﬁeld. “Modeling memory concurrency for multi-socket multi- core systems”. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems & Software, ISPASS’10, pages 66–75. IEEE, 2010.

[28] Cedric Nugteren and Henk Corporaal. “The boat hull model: adapting the rooﬂine model to enable performance prediction for parallel computing”. In Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, PPoPP’12, volume 47, pages 291–292. ACM, 2012.

[29] Bruno da Silva, An Braeken, Erik H D’Hollander, and Abdellah Touhafi. “Performance modeling for FP- GAs: extending the roofline model with high-level synthesis tools”. International Journal of Reconfigurable Computing, 2013:7, 2013.

[30] Johannes Hofmann, Jan Eitzinger, and Dietmar Fey. Execution-cache-memory performance model: Intro- duction and validation. arXiv preprint arXiv:1509.03118, 2015.

[31] Holger Stengel, Jan Treibig, Georg Hager, and Gerhard Wellein. “Quantifying performance bottlenecks of stencil computations using the Execution-Cache-Memory model”. In Proceedings of the 29th ACM on International Conference on Supercomputing, ICS’15, pages 207–216, 2015.

[32] Andre´ Lopes, Frederico Pratas, Leonel Sousa, and Aleksandar Ilic. Exploring gpu performance, power and energy-efﬁciency bounds with cache-aware rooﬂine modeling. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS’17, pages 259–268. IEEE, 2017.

[33] Tuomas S. Koskela, Mathieu Lobet, Jack Deslippe, and Zakhar Matveev. “Rooﬂine Analysis in the Intel R Advisor to Deliver Optimized Performance for applications on Intel R Xeon Phi Processor”. May 2017.

[34] Tuomas Koskela, Jack Deslippe, Brian Friesen, and Karthik Raman. “Fusion PIC Code Performance Analysis on The Cori KNL System”. In Proceedings of the Cray User Group Conference, CUG’17, 2017.

[35] Amrita Mathuriya, Ye Luo, Raymond C Clay III, Anouar Benali, Luke Shulenburger, and Jeongnim Kim. “Embracing a new era of highly efﬁcient and productive quantum Monte Carlo simulations”. arXiv preprint arXiv:1708.02645, 2017.

[36] Amrita Mathuriya, Ye Luo, Anouar Benali, Luke Shulenburger, and Jeongnim Kim. “Optimization and parallelization of B-spline based orbital evaluations in QMC on multi/many-core shared memory processors”. In Proceedings of the IEEE International Symposium on Parallel and Distributed Processing, IPDPS’17, pages 213–223. IEEE, 2017.

[37] Cedric Nugteren and Henk Corporaal. “A modular and parameterisable classiﬁcation of algorithms”. Eind- hoven University of Technology, Tech. Rep. ESR-2011-02, 2011.

[38] Sofya Titarenko and Mark Hildyard. “Hybrid multicore/vectorisation technique applied to the elastic wave equation on a staggered grid”. Computer Physics Communications, 216:53 – 62, 2017.

74 [39] Igor Surmin, Sergey Bastrakov, Zakhar Matveev, Evgeny Eﬁmenko, Arkady Gonoskov, and Iosif Meyerov. “Co-design of a Particle-in-Cell Plasma Simulation Code for Intel Xeon Phi: A First Look at Knights Land- ing”. In Proceedings of the Algorithms and Architectures for Parallel Processing, ICA3PP’16 Collocated Workshops, pages 319–329. Springer International Publishing, 2016.

[40] J. S. Park, H. E. Kim, H. Y. Kim, J. Lee, and L. S. Kim. “A Vision Processor With a Uniﬁed Interest-Point Detection and Matching Hardware for Accelerating a Stereo-Matching Algorithm”. IEEE Transactions on Circuits and Systems for Video Technology, 26(12):2328–2343, Dec 2016.

[41]L ´ıdia Kuan, Frederico Pratas, Leonel Sousa, and Pedro Tomas.´ “MrBayes sMC3: Accelerating Bayesian inference of phylogenetic trees”. The International Journal of High Performance Computing Applications, 2016.

[42] Joao Andrade, Frederico Pratas, Gabriel Falcao, Valter Silva, and Leonel Sousa. “Combining flexibility with low power: Dataflow and wide-pipeline LDPC decoding engines in the Gbit/s era”. In Proceedings of the International Conference on Application-specific Systems, Architectures and Processors, ASAP’14, pages 264–269. IEEE, 2014.

[43] Lu´ıs Tanic¸a, Aleksandar Ilic, Pedro Tomas,´ and Leonel Sousa. “SchedMon: A Performance and Energy Monitoring Tool for Modern Multi-cores”. In Proceedings of the International Workshop on Multi/Many- Core Computing Systems, MuCoCoS/Euro-Par’14, pages 1–10. Springer, 2014.

[44] Nicolas Denoyelle, Aleksandar Ilic, Brice Goglin, Leonel Sousa, and Emmanuel Jeannot. “Automatic Cache Aware Rooﬂine Model Building and Validation Using Topology Detection”. In Proceedings of the Network for Sustainable Ultrascale Computing, NESUS’16 Workshop, October 2016.

[45] Intel Corporation. “Intel R 64 and IA-32 Architectures Software Developer Manual”, 2013. [Online].

[46] Agner Fog. “Software optimization resources’. http://www.agner.org/optimize/.

[47] FP SPEC Benchmark Suite. https://www.spec.org/cpu2006/CFP2006/.