Eindhoven University of Technology

MASTER

Modeling and analysis of a cache coherent interconnect

Wiener, U.

Award date: 2012

Link to publication

Disclaimer This document contains a student thesis (bachelor's or master's), as authored by a student at Eindhoven University of Technology. Student theses are made available in the TU/e repository upon obtaining the required degree. The grade received is not published on the document as presented in the repository. The required complexity or quality of research of student theses may vary by program, and the required minimum study period may vary in duration.

General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

• Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain Eindhoven University of Technology Faculty of Electrical Engineering Electronic Systems Group

Modeling and Analysis of a Cache Coherent Interconnect

Thesis Report

submitted in partial fulfillment of the requirements for the degree of Master of Science in Embedded Systems

Supervisors: by: Prof. Kees Goossens Uri Wiener Dr. Andreas Hansson

v2.2, August 2, 2012 Abstract

System on Chip (SoC) designs integrate heterogeneous components, increasing in number, performance, and memory-related requirements. Such components share data using distributed shared memory, and rely on caches for improving performance and reducing power consumption. Complex interconnect designs function as the heart of these SoCs, gluing together components using a common interface protocol such as AMBA, with hardware- managed coherency a key feature for improving system performance. Such an interconnect is ARM’s CCI-400, the first to implement the AMBA 4 AXI Coherence Extensions (ACE). Capturing the behavior of such a SoC-oriented interconnect and its impact on the performance of a system requires accurate modeling, simulation using adequate workloads, and precise analysis. In this work we model ACE and a CCI-like interconnect using gem5, a Transaction-Level-Modeling (TLM) simulation framework. We extend gem5’s memory system to support ACE-like transactions. In addition, we improve the temporal behavior of gem5’s interconnect model by better representing resource contention. As such, this work intertwines two challenging computer architecture topics: coherence protocols and on-chip interconnects. Performance analysis is demonstrated on a multi-cluster mobile-device-like compute subsystem model, using both PARSEC benchmarks and BBench, a web-page rendering benchmark. We show how the system-level coherency extensions introduced to gem5’s memory system provide insights about modeling coherent single- and multi-hop SoC interconnects. Our simulation results demonstrate that the impact of snoop response latency on overall system performance is highly dependent on the amount of inter-cluster sharing. The changes introduced significantly improve the interconnect’s observability, enabling better characterization of a workload’s memory requirements and sharing patterns. אתה מתחיל הכי מהר שלך, ולאט לאט אתה מגביר. - אלון 'קרמבו' שגיב, "מבצע סבתא"

You start off as fast as you can, and very slowly increase the pace.

-Alon ‘Krembo’ Sagiv, “Operation Grandma”

i Acknowledgments

Chronologically, those who made it happen should be credited first. Hadn’t it been to Dr. Benny Akesson and Prof. Kees Goossens’s match-making efforts, I would have never crossed the English Channel. For that I give them my full appreciation. While Kees bravely became my supervisor, Andreas had the questionable pleasure of directly supervising my work. For their help and cooperation throughout the past year, I take off my hat. While supervising is done from a bird’s-eye view, the man who was affected the most by my presence must be Sascha Bischoff. Thank you for all your help, knowledge sharing, patience and for translating Dr. Taylor’s English when times were tough. Many many thanks to my friendly colleagues at ARM - Akash, Charlie, Djordje, Hugo, Matt E., Matt H., Radhika, Peter, Rene, Robert and William (alphabetical order, guys, nothing else) for their companion and support. My appreciation goes also to the gifted people behind ACE and CCI in Cambridge and Sheffield who cooperated, spared time and shared insights whenever I asked. In addition, I’d like to thank all gem5 developers, but especially to Ali, Dam and Chris, for providing tips, patches, feedback and plenty of time required for making it all work. From a personal perspective, I’d like to thank my family and friends in Israel, the Netherlands and Germany for all their support during my stay in the UK. But above all, my love and endless appreciation to Tali, which supported me in making this move and bared having the English Channel between us for half a year.

Uri Wiener Cambridge, The United Kingdom August 2nd, 2012

ii Contents

List of Figures v

List of Tables vi

List of Abbreviations vii

1 Introduction 1 1.1 Motivation...... 1 1.2 Hardware-based system coherency...... 1 1.3 Goals...... 2 1.3.1 Problem Statement...... 2 1.3.2 Thesis Goals...... 3 1.4 Contribution of this work...... 3 1.5 Thesis organization...... 3

2 Related Work 5 2.1 System coherence protocols...... 5 2.2 Coherent interconnects...... 6 2.3 Memory system performance...... 7

3 Background 8 3.1 System coherency and AMBA 4 ACE...... 8 3.1.1 Memory Consistency...... 8 3.1.2 Cache Coherence...... 9 3.1.3 System-level coherency...... 10 3.1.4 AMBA 4 AXI Coherence Extensions...... 11 3.1.5 ACE transactions...... 13 3.2 Interconnect modeled: CCI++...... 14 3.2.1 CoreLink CCI-400...... 14 3.2.2 CCI++: extending ACE and CCI to multi-hop memory systems...... 16 3.3 The Modeling Challenge...... 17 3.4 Simulation and modeling framework: gem5...... 18 3.4.1 gem5...... 18 3.4.2 Simulation flow...... 18 3.4.3 Test-System Example...... 20 3.5 The gem5 memory system...... 20 3.5.1 The basic elements: SimObjects, MemObjects, Ports, Requests and Packets...... 20 3.5.2 Request’s lifecycle example...... 22 3.5.3 Intermezzo: Events and the Event Queue...... 22 3.5.4 The model...... 23 3.5.5 A simple request from a master...... 24 3.5.6 A simple response from a slave...... 24 3.5.7 Receiving a snoop request...... 24 3.5.8 Receiving a snoop response...... 24 3.5.9 The bridge model...... 25

iii 3.5.10 Caches and coherency in gem5...... 25 3.5.11 gem5’s cache line states...... 25 3.5.12 Main cache scenarios...... 25 3.6 Target platform...... 26 3.6.1 gem5 building block for performance analysis...... 28 3.6.2 Simulator performance...... 29 3.7 Differences between ACE and gem5’s memory system...... 29 3.7.1 System-coherency modeling...... 29

4 Interconnect model 32 4.1 Temporal transaction transport modeling...... 33 4.2 Resource contention modeling...... 34 4.3 ACE transaction modeling...... 35 4.3.1 ReadOnce...... 36 4.3.2 ReadNoSnoop...... 37 4.3.3 WriteNoSnoop...... 37 4.3.4 MakeInvalid...... 38 4.3.5 WriteLineUnique...... 38 4.4 Modeling inaccuracies, optimizations and places for improvement...... 39 4.5 Bus performance observability...... 40 4.6 Conclusions...... 41

5 Implementation and Verification Framework 42 5.1 Memtest...... 42 5.2 ACE-transactions development...... 43 5.3 ACE transaction verification...... 44 5.3.1 ReadOnce...... 44 5.3.2 ReadNoSnoop...... 44 5.3.3 WriteNoSnoop...... 45 5.3.4 WriteLineUnique...... 46 5.4 Conclusions...... 48

6 Performance Analysis 49 6.1 Metrics and method...... 49 6.2 Hypotheses...... 49 6.3 Small-scale workloads...... 50 6.3.1 MemTest experiments setup...... 50 6.3.2 MemTest experiments results...... 51 6.4 Large-scale workloads...... 51 6.4.1 PARSEC experiments...... 53 6.4.2 PARSEC results...... 53 6.4.3 Why do the missing results prove our hypothesis...... 59 6.4.4 BBench...... 62 6.4.5 Insight from Bus Stats...... 63 6.5 Results analysis and conclusions...... 63 6.6 Contribution...... 64

7 Conclusions and Future Work 66 7.1 Technical conclusions...... 66 7.1.1 Interconnect observability...... 66 7.1.2 Modeling SoC platforms...... 66 7.1.3 The impact of delaying snoop responses...... 67 7.1.4 Truly evaluating ACE-like transactions...... 67 7.2 Reflections, main hurdles and difficulties faced...... 67 7.3 Future work...... 68

A Contributions to gem5 70

iv List of Figures

3.1 Example top-level system using CCI-400. Source: [8]...... 10 3.2 big.LITTLE architecture concept. Source: ARM...... 11 3.3 big.LITTLE task migration. Source: [12]...... 11 3.4 ACE cache line states. Source: [12]...... 12 3.5 AXI and ACE channels. Source: [12]...... 13 3.6 ACE transaction groups. Source: [12]...... 14 3.7 CCI-400 internal architecture. Source: [8]...... 15 3.8 gem5 example systems with typical CCI-like interconnects...... 17 3.9 gem5 Speed vs. Accuracy Spectrum. Source: [19]...... 18 3.10 gem5 simulation inputs, outputs and runtime interfaces...... 19 3.11 gem5 2-core test system example...... 20 3.12 Master to slave transaction sequence diagram...... 22 3.13 Abstracted diagram of the bus...... 23 3.14 Target platform block diagram. Source: ARM technical symposium 2011...... 27 3.15 gem5 target initial and final target platform overview...... 28 3.16 gem5 target initial and final target platform overview...... 31

4.1 The bus-model’s inheritance diagram...... 32 4.2 Added bus outgoing traffic queuing scenarios...... 34 4.3 Abstracted diagram of the layered bus...... 35

5.1 A 3:2:1 old (unintuitive) MemTest system...... 43 5.2 A 3:2:1 new MemTest system...... 43 5.3 MemTest 3:1 system...... 44 5.4 ReadOnce for an M-state line...... 45 5.5 ReadNoSnoop for an E-state line...... 45 5.6 MemTest 3 system...... 45 5.7 WriteNoSnoop in an ACE context...... 46 5.8 WriteNoSnoop for an E-state line in a multi-hop interconnect...... 46 5.9 WriteLineUnique for an O-state line in an ACE context...... 47 5.10 WriteLineUnique in a multi-hop context...... 47

6.1 Small-scale snoop-response test system...... 50 6.2 Simulation results for MemTest-based snoop-response latency experiments...... 52 6.3 Normalized PARSEC communication patterns...... 54 6.4 General simulation and simulator-performance stats for PARSEC benchmarks...... 55 6.5 Total DRAM bandwidth for PARSEC benchmarks...... 56 6.6 Timing-model CPU statistics for PARSEC benchmarks...... 57 6.7 L1 and L2 cache miss rates for PARSEC benchmarks...... 57 6.8 Bus stats for PARSEC benchmarks...... 58 6.9 PARSEC execution times vs. snoop response penalty...... 60 6.10 PARSEC normalized execution times vs. snoop response penalty...... 61 6.11 bodytrack-failing system...... 61 6.12 PARSEC bodytrack 4 cores 4t simsmall cache occupancy in percent...... 62 6.13 BBench Gingerbread and Ice Cream Sandwich snapshots...... 63 6.14 BBench transaction distribution demonstration...... 64

v List of Tables

3.1 Projection of ACE cache line state to MOESI naming. Source: [12]...... 12 3.2 Correlation of ACE and gem5 interface semantics...... 21 3.3 Main port interface methods...... 21 3.4 Projection of gem5 cache line state space to MOESI...... 26

4.1 Bus outgoing traffic queuing...... 33

vi List of Abbreviations

ACE AXI Coherence Extensions ACP Accelerator Coherence Port AMBA Advanced Microcontroller Bus Architecture AXI Advanced eXtensible Interface BBench Browser-Bench BFM Bus-Functional Model (driver/stub) BLM Bandwidth and Latency Monitor CMP Chip Multi-Processor CPU CSS Compute Sub-System DMA Direct Memory Access DMC Dynamic Memory Controller DRAM Dynamic Random Access Memory DVM Distributed Virtual Memory FSM Finite State Machine GIC Global Interrupt Controller GPGPU General-Purpose computing on a GPU GPU Graphics Processing Unit I/O Input/Output ICS Ice-Cream Sandwich (Android) IP Intellectual Property IPC Instructions Per Cycle ISA Instruction-Set Architecture kIPS kilo Instructions Per Second LLC Last-Level Cache LQV Latency Regulation by Quality Value LRU Least-Recently Used LSQ Load/Store Queue

vii MIPS Million Operations Per Second MPSoC Multi-Processor System on Chip MSHR Miss Status Handling Registers NoC Network on Chip

O3 Out of Order OCP Open Core Protocol OS Operating System

OT Outstanding Transactions PIO Programmed I/O PMU Performance Monitoring Unit PoS Point of Serialization

PQV Periodic Quality of Service Value Regulation PV Programmer’s View QoS Quality of Service QVN Quality of Service Virtual Networks

RT Real-Time RT: Register Transfer Level SCU Snoop Control Unit

SoC System on Chip SWMR Single-Writer Multiple-Reader TLB Translation Look-aside Buffer TTM Time to Market

UFS Universal Flash Storage VNC Virtual Network Computing

viii Chapter 1

Introduction

1.1 Motivation

The latest consumer-electronics MPSoCs combine many less-powerful CPUs instead of a single very-powerful CPU for power efficiency. Such MPSoCs commonly include a GPU for graphics and general-purpose, and various application-specific accelerators. This mix of hardware components aims at providing compute power in the most efficient way, as power and performance are critical requirements. As the amount of cores and IPs increases, so does the importance of communication. As in most cases, these devices are not safety-critical. Chip makers spend huge efforts trying to improve average-cases for better system performance, reducing access latencies, increasing throughputs, reducing off-chip accesses etc. Introducing caches is a common approach for such average-case ori- ented designs. In a multi-core or multi-cluster SoC, caches must be orchestrated in order to maintain system-wide coherency. This challenge becomes more acute as not just CPUs, but also GPUs make use of caches, and as systems contain a mix of coherent and non-coherent blocks. System coherency can be established in many ways. Software-based-coherency is a conventional and common solution in which any access to shared data must be known up-front, and managed using synchronization mechanisms such as mutexes and semaphores. However, this comes at the expense of significant run-time overheads (e.g. the need to busy-wait for a lock), as well as the additional programming challenges. Software-based coherence is infamous for lurking bugs and additional efforts required to avoid them - directly impacting the critical Time-To-Market (TTM). In addition, software-based coherency requires consumers and producers to interact via the main memory. Each DRAM (off-chip) access is orders of magnitude more expensive (both power- and latency-wise) comparing to an on-chip data access (e.g. SRAMs used by caches). In fact, we could squeeze in more than 1000 logical operations on chip before offsetting a DRAM access.

1.2 Hardware-based system coherency

ARM, a semiconductor and software company dominating in the field of mobile phone chips, develops a wide range of IPs such as RISC CPUs, GPUs, VPUs, high-speed connectivity products etc. It introduced the Advanced Microcontroller Bus Architecture (AMBA), a de facto standard for on-chip communication, and develops various AMBA IPs. AMBA 4, the most recent version of the AMBA protocol family, contains the AXI Coherence Extensions (ACE) [6], aiming at providing hardware-based system coherency. Among ARM’s latest products, the CoreLink CCI-400 Cache Coherent Interconnect [8] is a high-performance, power-efficient interconnect designed to interface between various processors, peripherals, dynamic memory controllers and so forth. CCI-400 is the first interconnect implementation which supports AMBA 4 ACE, thus enables hardware-managed coherency for cache-coherent computer subsystems such as ARM Cortex clusters combined with GPUs and coherent I/O devices. AMBA 4 ACE and CCI-400 supporting this protocol aim at hardware-based system coherency in order to improve on software-based coherency: reduce off-chip accesses, avoiding buggy software, shortening TTM and so forth. However, the assumptions at the base of ACE are yet to be proven. Investigating the true impact of hardware-based coherency is a challenging problem: it requires both developing realistic models of the hardware which support ACE, as well as suitable workloads: software which actually utilizes this new infrastructure. Recently ARM presented its big.LITTLE [7] architecture of SoCs, combining heterogeneous sets of processing clusters. In such an architecture, hardware-based coherency is a key for allowing fast and inexpensive migration of tasks between clusters. Similarly, GPUs can take advantage of the additional instructions provided to directly

1 communicate with the CPUs, saving off-chip activity and various synchronization overheads. However, these sce- narios have not yet been tested on real systems, with dedicated software: there is currently no evidence that ACE indeed is advantageous. This work aims at answering some of the questions asked, and providing better tools to understand the true impact of ACE-like protocols in a modern compute subsystem. This research involves the entire spectrum of a system: how decisions in the architectural and micro-architectural levels affect global system performance. Correct modeling, appropriate experiments and valid conclusions require in-depth understanding across this spectrum.

1.3 Goals

The main objective of this research is to investigate the impact of hardware-based system-coherent memory system, and in particular an interconnect which establishes system coherence. This is a challenging task for several reasons: • Conducting system-level performance analysis in general, and particularly interconnect performance analysis, is both complex and tricky. Firstly, a full-system simulator is required, such that provides sufficiently realistic hardware models, capable of running real-world workloads. At the same time, the simulator has to provide a reasonable level of abstraction both for enabling high-level models, as well as for performance. The use of RTL simulations is limited from several aspects: – Implementation availability: a sufficiently mature implementation of the platform under investigation is required - which is rarely the case when exploration has to be done. – Simulation speed: RTL simulations have notably crippling performance, which severely limits the workloads being used. – Flexibility and composability: exploring configurations is very limited, and in many cases impractical for large design-space explorations. Currently available computer system-modeling infrastructures, such as SimpleScalar [14], usually provide limited capabilities, making them less appropriate for full-system level evaluation. For example, it is mostly not possible to use them for booting Linux or Android OS, or to run full-system workloads; mostly their IP library (e.g. GPU, ARM cores, other devices) is not diverse enough for composing realistic complete platforms. Simics [32], for instance, is such a full-system simulation platform, yet it is outdated, unmaintained, and no longer available, as it was commercialized in 1998. • Availability of appropriate hardware models: there are currently no models for composing a platform which implements hardware-based system coherency. This can in theory be achieved using gem5’s Ruby [19] memory model, yet that remotely resembles CCI. • Currently there are no workloads (hence software and drivers) that utilize or stress I/O coherency. Having the right hardware model is worthless if it is not exercised to demonstrate its full capabilities. There are, however, workloads that stress multi-core/thread platforms (e.g. PARSEC [17] and SPLASH-2 [44]). • The underlying assumptions made when designing AMBA 4 ACE seem well-based, and expectations regarding the potential benefit are high, yet this is fairly an uncharted research territory, with very limited relevant research done in this domain. There are no (published) existing SoC-oriented cache coherent interconnects available other than CCI. As such, there is no shared knowledge about the impact different design options have on performance and costs.

1.3.1 Problem Statement As with many other complex IPs, CCI-400 has to be evaluated for its performance and system-wide implications under various scenarios. Performing rigorous analysis and exploration on RTL implementation is infeasible. Full- system models for simulation with a real operating system and realistic benchmarks are now within reach. However, they lack adequate modeling of system coherence support, and more importantly a cache-coherent interconnect model. Hence, a high-level, functionally correct and performance-reflecting transaction-level model of the cache coherent interconnect is required.

2 1.3.2 Thesis Goals • Capture the behavior of the ARM AMBA4-family Cache Coherent Interconnect in a transaction-level model. • Develop a strategy to reason bottom-up about the model’s fidelity. This includes defining an evaluation method, performance metrics, generation of stimuli etc.

• Identify sharing behaviors in full-system simulation using realistic benchmarks. Analyze the interaction between the CPUs, GPUs, memory system and interconnect. Investigate the performance impact of CCI in various systems in terms of total latency per workload, cache statistics (e.g. hit/miss rates, blocking time), interconnect throughput etc.

1.4 Contribution of this work

This work introduces a set of system-coherency transactions to gem5’s memory system. The added value of this is multi-fold: • It enables followers (e.g. gem5 users) to utilize these transactions with various workloads, and analyze the impact of hardware-based cache coherence in a full-system context. • The transactions implemented are inspired by AMBA 4 ACE transactions. Yet while ACE is intended to support a single-hop interconnect, gem5’s bus model must support being part of a multi-hop interconnect. This is since gem5 provides flexible connectivity and easy composition of systems in many shapes and kinds. As such, each transaction that was added to gem5 was designed to comply with the ACE specifications when the bus is directly connected to masters and slaves, yet also to handle intricate situations which are not considered by ACE. Therefore, the extended transactions serve as leading examples and solve dilemmas raised in multi-hop cache-coherent interconnects. This topic is further discussed in Section 4.3. • In order to develop, verify and evaluate these extensions in a fine-grained manner, a suitable infrastructure was created. It is based on gem5’s MemTest, as described in Section 5.2. This framework is a convenient vehicle for both small-scale experiments, as done in Section 6.3, automated functional validation of the memory system, and a friendly means for introducing new transaction types to gem5. In addition, this work was part of a joint effort in making gem5’s bus model more realistic in the structural and temporal aspects. This included: • Improved resource contention modeling by splitting the bus to AXI-like layers replacing a unary resource. This is a leap forward from a tri-state-like bus model, towards a realistic crossbar model with separate request, response, and snoop channels. Further details are provided in Section 4.2. • Improving the temporal modeling of snoop traffic, from a loosely-timed approach towards approximately- timed snoop responses. This aims at overcoming modeling abstractions made by gem5’s memory system at the expense of simulation performance, and is dealt with in Section 4.1.

The interconnect is the heart of its hosting SoC, arbitrating and transporting data in various patterns. As such, being able to collect data from the bus model is essential for gathering insight about the system. In order to improve the bus’s observability, statistics of various bus internals were added. These statistics justified their necessity during the experiments stage, as an enabling tool for in-depth analysis of the system behavior (e.g. of cross-cluster sharing patterns, or the lack of sharing). The experiments provided in Chapter6 provide insights about system bottlenecks under various multi- threaded workloads, and mostly reasoning for the impact the interconnect has in such circumstances on the system’s performance. Apart from the afore-mentioned contributions which are directly related to the investigation at-hand, indirectly- related contributions to gem5 have been made. These are described in AppendixA.

1.5 Thesis organization

The rest of this report is organized as follows: Chapter2 discusses recent publications dealing mostly with var- ious approaches for implementing system-level coherency. Chapter3 contains an in-depth description of system

3 coherency, ACE, the modeled CCI-like interconnect, and gem5 - the simulation framework used for modeling CCI. This includes an overview of gem5’s memory system’s building blocks and semantics. This overview is required to understand the modeling challenges and inaccuracies dealt with next. In Chapter4 the main drawbacks of the previous bus model are discussed, and which changes had to be implemented for the bus to become a more realistic coherent interconnect model. Chapter5 provides lower-level details of the implementation flow, the challenges in extending gem5 to support new transaction types, and a description of the verification process. The infrastruc- ture and methodology developed are described, as these have great value for any future work on gem5’s memory system. Chapter6 describes the experiments performed for gaining insight about the performance of a coherent compute subsystem under various workloads, followed by a discussion of the results. Small-scale experiments were used to clearly demonstrate the impact of varying snoop responses latency on a system’s performance. Large-scale experiments provide a better understanding of how dependent the impact is on a workload’s sharing patterns. Chapter7 provides general conclusions from this research, as well as a collection of related challenges that either were beyond the scope of this work, or are now within reach. AppendixA contains a list of contributions to gem5 the author made which were not necessarily related to the problem under investigation, yet were part of a joint open-community effort to improve gem5 as an enabling architectural exploration platform.

4 Chapter 2

Related Work

The niche covered by this project intersects the following general domains: on-chip interconnects, cache and I/O coherency, coherent-devices sharing patterns, and memory system performance. As SoC-oriented I/O-coherent interconnects are cutting-edge technology, very little published related work exists, either due to the topic’s novel nature, or as vendors avoid sharing their proprietary knowledge.

2.1 System coherence protocols

• ARM’s latest cache-coherent interconnect, CoreLink CCI-400, is the first to support hardware-managed co- herency as defined by AMBA 4 ACE [6]. This is a differentiating feature under investigation, with a promising performance gain wherever interaction between CPU and I/O is a bottleneck. I/O data coherence deals with maintaining coherence of data shared by a CPU and a peripheral device. Typically, CPU and I/O devices interact according to a consumer-producer model, each signaling to the other when data is ready or actions have to be taken, using some signaling mechanism. As consumer devices, such as set-top boxes, contain a growing number of peripherals, enforcing I/O coherence becomes a challenging topic. In [16], four types of solutions for this challenge are presented: – Software-managed coherence is typical for traditional embedded architectures. The CPU’s caches are explicitly managed by software. For instance, by forcing write-backs or performing uncached write operations. – Hardware-managed coherence is a typical solution in general-purpose computers, and is in general trans- parent to software. It may significantly improve average-case performance for some workloads, yet it might cause negative effects (e.g. consume snoop bandwidth in vain when data is not shared). – Scratchpad-based: in this approach, instead of transferring data through the main memory, I/O data passes via private memories which must be completely managed by software. This solution is inappro- priate when sharing data across several CPUs as each CPU can only access its own scratchpad. – A hybrid solution that combines all previous solutions: data can flow via more than one channel between devices, making coherence-managing a tougher task, aiming at harvesting the fruits of all approaches. These four solutions are compared by their potential performance, design complexity (directly affecting the product’s time-to-market), silicon costs and suitable applications. Amongst these solutions, the optimal so- lution depends on application and system characteristics. In this work, the underlying assumption is that hardware-based system coherence can provide sufficient advantages at run time to justify the hardware’s com- plexity. In addition, this work contains an more in-depth quantitative evaluation of the impact of hardware- supported system coherence, providing more insight. The avid reader can find in [16] an extensive explanation of each of the possible solutions for providing system-level coherency.

• The Open Core Protocol (OCP) [40, 30] is an openly-licensed alternative to ARM’s AMBA family of protocols, both aiming at system-level integration of IPs. OCP is an international partnership, driven by leading semiconductor companies such as MIPS, Nokia, and Philips. The OCP open community alternative to ARM’s proprietary system coherence protocol is a system-level cache coherence extension to OCP, presented in [5]. The proposed extension is a backwards compatible coherent OCP interface. The authors

5 discuss the design challenges and implications introduced by this extension. Similarly to ACE’s cache line states terminology described in Section 3.1.4, the OCP extension supports a range of coherence protocols and schemes - hence both are orthogonal to any specific coherence protocols. The authors describe how a snoopy bus-based scheme can be specified, as well as a directory-based scheme. The correctness of the OCP extension was verified using NuSMV [22], a symbolic model-checking tool. In contrast, in this work, correctness is verified by means of simulation-based functional testing.

• Enabling support for ACE transactions, as any coherence protocol, involves much more than changes to the bus. Hardware support for system-coherence requires changes to the entire memory system for supporting new transactions, including changes to the bus, cache controllers, support for multi-layered hierarchies etc. Such challenges are discussed in [24], providing both theoretical examples and a real-world multiprocessor system - the Sun SGI Challenge board. While no performance analysis or simulation results are provided in [24], in this work a quantitative study is also provided in Chapter6. • The concept of hardware cache-coherent I/O is not a recent notion outside the context of mobile device. In [31], the HP PA-RISC architecture has been introduced, as part of the HP 9000 J/K-class program. Its cache coherency scheme allows the participation of I/O hardware, reducing coherence overheads from the memory system and processors. The concept of hardware-based cache-coherent I/O is presented, including implications on software aiming to utilize it. I/O data transfers are commonly of two kinds: DMA or PIO. DMA transactions take place without the intervention of the host processor. PIO requires the host processor to move data by reading and writing each address. In the presented architecture, PIO transactions are not part of the memory address space, and thus are not a coherency concern. Only DMA transactions are treated as part of the memory coherency challenge. The heart of the proposed solution is an I/O Adapter which bridges between the processor-memory bus (the Runway bus, which all processors use for snooping) and the HP-HSC I/O bus. A set of limiting assumptions on the transactions made by an I/O master are provided, hence unlike an ACE or ACE-lite master, it is not just another master. Very limited quantitative data regarding the benefits from this solution are provided. In gem5, as in ACE, any I/O master can participate in the cache coherence scheme as any other master with no limitations on the transactions made by the I/O master. In addition, allowing ACE-lite masters (see Section 3.1.4 for more details) enables relieving the master from handling incoming snoops, in cases where this is either redundant or software-coherence is used.

2.2 Coherent interconnects

• The tight coupling of interconnect ordering properties to cache coherence schemes is discussed in [33]. While a shared snooping bus has till now been adequate for multi-core designs, the integration of CPU clusters and more coherent clients on a single chip requires a different type of coherent interconnect. A greedy snooping algorithm is presented, including its applicability to an arbitrary unordered interconnect topology. While a bus enforces a total ordering and atomicity, modern interconnects such as the CCI-400 might not. The presented solution requires each cache controller to broadcast coherence requests to all other nodes in the system. A processor first sends its coherence request message to an ordering point to establish the total order. The ordering point then broadcasts the request message to all nodes. The ordering point also blocks on that address, preventing any subsequent coherence requests for the same address to race with a request already in progress. CCI enforces memory consistency by ordering transactions to same addresses using a Point of Serialization (PoS) per master interface. Simply put, this additional hardware stalls any incoming request in case another transaction to the same address is still in flight beyond the PoS. The gem5 bus model, however, does not limit or stall in-flight requests to the same address, thus implementing a weaker consistency model. Similarly to the solution presented, gem5’s coherence protocol requires each shareable request to trigger a snoop broadcast to all snoopable masters, as discussed in Chapter4. • The system coherence support provided by ACE enables hardware blocks across the system to snoop each other. Different workloads can demonstrate very different sharing patterns, e.g. how much data is actually transferred on-chip as a result of inter-IP snooping, and how much snooping traffic was in vain. This work provides quantitative insight for various workloads, analyzing a major part of the cost of system coherence. Redundant snoop traffic can be reduced by introducing snoop filters. In [4], a method for filtering snoops based on dynamic detection of sharing patterns is presented. This enables a more scalable cache coherence, reducing the impact of broadcasts at the heart of snoopy-based protocols. AMBA 4 ACE specifies optional support for external snoop filters in the form of dedicated transactions and possible cache line state

6 transitions. For example, a master must broadcast an Evict message for each cache line that it evicts, to notify snoop filters about the change. • Moving from an atomic bus to split buses decouples the request to response phases of a transaction, enabling new requests to be issued before previous ones were responded. This potentially complicates coherence, as conflicting requests can be issued. In [26], two techniques for overcoming this hurdle are proposed: a retry technique and a notification technique. As in [24], the authors provide the SGI Challenge system as a motivating example, in which requests which might create a conflict are delayed. This solution is impractical when a processor cannot snoop data responses intended for other processors and foresee a conflict. The two proposed techniques are evaluated using simulations. Similarly, the gem5 bus model, which was initially a unary resource, has been split to AXI-like layers (as described in Section 4.2) for realistically modeling an ARM interconnect. The modified gem5 bus model does not limit the transactions in-flight, hence conflicting requests can be issued. Coherence is enforced by the cache controllers and by utilizing express snoops and instantaneous upstream requests, the need to implement transitional cache-line states and race conditions is avoided. As a side-effect, snoop responses occur instantaneously, and therefore snoop responses are punished to compensate for the temporal inaccuracy of snoop requests. The method used for improving snoop temporal behavior modeling is further discussed in Section 4.1.

• A major difference between ACE and previous non-coherent AMBA protocols is the addition of dedicated AXI channels for snoop requests and responses (namely an address channel - AC, a response channel - CR, and a data channel - CD). A different approach for establishing a coherent interconnect is presented in [38]. The authors claim that control messages are short and data messages typically carry cache block- sized payloads. To exploit this duality, the proposed interconnect architecture consists of two asymmetric networks to separate requests and responses, and reduce the channel width of the request network. The authors used FLEXUS [43], a cycle-accurate full-system simulator, for simulating a tiled CMP, and ORION 2.0 [29] for power estimations. Their experiments demonstrated reduced power consumption with minor performance degradation. • Similarly to the approach in [38], a NoC which differentiates short control signals from long data messages is presented in [20]. The differentiation is implemented using traffic prioritization and its purpose is to reduce cache access delays. However, the proposed architecture is aimed at larger-scale CMPs and therefore utilizes a directory-based coherence protocol. Here again, the multiplexing of coherence traffic on the same channels motivates the added complexity. While priorities are a typical solution in a multi-hop network (and reflected to a NoC), the single-hop CCI is aimed at sufficiently small SoCs, where this crossbar interconnect is most adequate.

2.3 Memory system performance

• A significant part of this work aimed at improving the temporal modeling of the gem5 memory system, and the ability to measure its performance. In [35], a similar study was performed, aiming at gaining insight about Intel’s Nehalem Multiprocessor System. Specifically, this work tried to quantify the performance of the memory system in different locations, as it is composed of multi-level caches and an integrated memory controller (Quick Path Interconnect). This work provides similar insights, both as transportation of transactions through the bus are modeled more accurately, the bus’s resource contention is improved, and new statistics are introduced to the bus model and its layers, significantly increasing the system’s observability.

7 Chapter 3

Background

3.1 System coherency and AMBA 4 ACE

At the heart of this project lies the CCI-400, a cache-coherent interconnect. Understanding its implication on the system requires familiarization with memory consistency models and cache coherence.

3.1.1 Memory Consistency Firstly, in order to motivate why are these needed, a simple example is given in Listing1.

// assume initial values of the following global variables // bool flag = false; // int data = OLD_VAL;

void thread_1() { data = NEW_VAL; flag = true; }

void thread_2() { int local_data; while (flag) {}; local_data = data; }

Listing 1: A 2-core shared-memory example

Assume a multi-core system, where thread 1() and thread 2() are simultaneously invoked. One would expect local data’s value to be NEW VAL at the end of each invocation, yet this is not the case. It relies on the execution order of load and store instructions issued. In this example, a shared memory is used by two independent processes; only if all instructions are executed sequentially at the program order, will local data be guaranteed to eventually contain NEW VAL. Such ordering is one feature of a memory consistency model. The reader can find an extensive introduction to memory consistency and cache coherence is in [39], including motivation, formal proofs and high-level concepts, theoretical solutions, and real-world examples. A memory consistency model defines both what behaviors programmers can expect when running programs, as well as which optimizations system implementors may use for improving performance or reducing costs. The three leading models presented in [39] are: • Sequential Consistency (SC), the most straight-forward model, requiring ”memory order” (a total order on the memory operations of all cores) not to conflict with any ”program order” (the order of per-core operations). • Total Store Order (TSO), a more relaxed model, which enables the use of issued-writes, hence write-buffers, under some limitations. Whenever explicit ordering is required, ”FENCE” instructions are used.

8 • eXample relaxed Consistency model (XC), which enables out-of-order scheduling under most circum- stances, unless explicitly specified using ”FENCE” instructions.

Each model raises implementation challenges. An example of a common one is how atomic instructions can be realized. In general, models are evaluated using the 4 Ps: • Programmability: how easy it is to understand the model, and to write multi-threaded programs which conform to the model. • Performance: the trade-offs the model allows between performance and costs.

• Portability: the complexity of modifying code from one model to another. • Precision: how precisely the model is defined, mostly in formal representation and not only using natural languages (although common in practice).

3.1.2 Cache Coherence The use of caches for improving average-case performance comes the a price of requiring to manage all copies of a memory location, to avoid making use of obsolete values. The goal of a coherence protocol is to maintain coherence by enforcing the following invariants:

• Single-Writer, Multiple-Reader (SWMR): at any time for a memory location, either one core may write (and read) the location or one or more cores may only read from it. • The Data-Value Invariant: updates to the memory location are passed correctly so that cached copies of the memory location always contain the most recent version.

To implement these invariants, each storage structure (e.g. cache, LLC/memory) is associated with a finite state machine called a coherence controller. The controllers create a distributed system, exchanging messages for ensuring these invariants. This interaction is specified by the coherence protocols. Each controller implements an FSM per block, and processes events for each block that might change its current state. For each cache, each block B is characterized by its validity (holding a valid/invalid copy of B), dirtiness (a modified B), exclusivity (permitted to modify B), and ownership (responsible of answering queries regarding B). An example of a primitive protocol, is a protocol with states V (valid) and I (invalid), and a transient state IV D stating B is invalid, and once data D arrives, it becomes valid. Most coherence protocols support a subset of the {M(odified), O(wned), E(xclusive), S(hared), I(nvalid)} set of states. While adding more states enables better performance, it comes at a price of design complexity. Coherence protocols are generally divided into two groups:

• Snooping Coherence Protocols: a cache controller initiates a request for a block by broadcasting a request message to all other coherence controllers. The coherence controllers collectively do the right thing. Such protocols rely on the interconnection network to deliver the broadcast messages in a consistent order to all cores, hence requires the use of a shared logical bus with atomic request handling. • Directory Coherence Protocols: scalable alternatives based on uni/multicasting queries instead of broad- casting. A cache controller initiates a request for a block by unicasting it to the memory controller that is the home for that block. Each memory controller maintains a directory that holds state about each block in the LLC/memory. When the owner controller receives the forwarded request, it completes the transaction by sending a data response to the requestor. Hence although scalable, transactions might take more time to complete.

CCI-400, for instance, implements a fully-connected snoop topology, enabling each slave to snoop other slaves (see Section 3.2.1 for details). As with memory consistency models, implementation challenges (such as using non-atomic buses), are abundant. Yet while memory consistency models may complicate a programmer’s work, hardware-managed cache coherence should be transparent to the programmer.

9 3.1.3 System-level coherency Caches have been proven to be an efficient means of improving average-case performance in general-purpose com- puting. They provide low-latency accesses to local copies of the memory, while eliminating many off-chip accesses. As such, the use of caches has spread to other SoC components such as GPUs and accelerators. While the benefits and costs of caches in CPUs are by now well established, it is not the case for systems with caches in multiple devices. System-level coherency extends the same principles used with CPU-caches to the entire system’s scope: the contents of any memory address may reside in more than a single place in the system, as long as all copies are of the most recent data. This is to avoid making use of stale copies of this address. At any moment, at most one local copy can be modified. To to obtain this privilege (write access), no other cache should hold a copy of that line. Maintaining memory consistency and cache coherence can be either implemented by software or by dedicated hardware. Furthermore, sharing of resources (such as the main memory) requires an interconnect: ranging from a simple bus to a complex network-on-chip. In addition to these typical systems, smartphones, tablets, and other mobile devices are all by now cache-coherent systems. A top-level block diagram of such a system is provided in Figure 3.1.

Mali-T604 Coherent I/O GIC-400 DMA-330 LCD graphics unit device

Asynchronous Asynchronous NIC-400 bridge bridge

Cortex-A15 Cortex-A15 MMU-400 MMU-400 MMU-400

ACE ACE ACE-Lite plus DVM ACE-Lite plus DVM ACE-Lite plus DVM interface, S4 interface, S3 interface, S2 interface, S1 interface, S0

CoreLink CCI-400 Cache Coherent Interconnect

ACE-Lite ACE-Lite ACE-Lite Interface, M2 Interface, M1 Interface, M0

DMC-400 Other DMC NIC-400

DDR3/ Wide I/O Other Other LPDDR2 DRAM slaves slaves

Figure 3.1: Example top-level system using CCI-400. Source: [8]

The system’s main processing units are the two coherent Coretex-A15 clusters, and the Mali-T604 GPU, sharing distributed memory (including a DDR and a wide I/O DRAM). The interconnect of this system is the CCI- 400, gluing together various clients using the AMBA 4 ACE protocol. The MMU-400 [10] is a System Memory Management Unit, which controls address translation and access permissions. It serves a single master, such as a GPU. Other key components are the DMC-400 [13] (a dynamic memory controller), NIC-400 (a network interconnect), and the GIC-400 [9] (a Generic Interrupt Controller). The system’s main interconnect, the CCI-400, establishes cross-system consistency and coherence based on the ACE protocol. In this example, slave interfaces S4 and S3 (both connected to Cortex-A15 processors) support the ACE protocol. Full coherency and sharing of data is managed between these processors. The CCI-400 is further described in Section 3.2.1.

Advantages over software-based coherency Software-based coherency requires explicit orchestration between sharing components, and is based on off-chip shared data structures. This is commonly a complex task, which is error prone. In addition, running the software

10 required for managing coherency consumes processing time and as such also power. Invalidation of cache lines is a must when passing data between caches - introducing off-chip accesses. In general, as the amount of caches and their sizes increase, a more efficient solution is needed. In addition to the conventional advantages of using caches, ARM has recently presented its multi-clustered big.LITTLE [7] architecture, depicted in Figure 3.2. This architecture introduces a new scenario where system-

GIC-400

Cortex-A15

Kingfisher Cortex-A15 Cortex-A15 big KF KF LITTLE Best performance Maximum efficiency SCU + L2 Cache SCU + L2 Cache Demanding tasks “Always on” tasks

128-bit AMBA 4 128-bit AMBA 4 CoreLink CCI-400 Cache Coherent Interconnect

System Memory

Figure 3.2: big.LITTLE architecture concept. Source: ARM One of the reasons the task migration can be so fast is that the amount of processor state involved in the task migration is relatively small. The processor that is going to be turned off, which is termed the level hardware-based coherencyoutbound processor, has must a promising have all of the integer advantage. and Advanced In SIMD this register scenario, files contents depicted saved along in Figure 3.3, a task is with the entire CP15 configuration state. The processor that is going to resume execution, which is migrated from the outboundtermed the inbound processor processor, to must the then inbound restore all of processor, the state saved forfrom instancethe outbound processor. in order to either allow it to run Additionally, any active interrupts that are being controlled by the GIC-400 must be migrated. Less than on a slower and more power-efficient2,000 instructions are required processor to achieve or save vice-resto versa.re and because System the two coherence processors are allows seamless transfer of data from the outbound processor’sarchitecturally cache identical tothere the is a one inbound-to-one mapping processor’s between state cache registers directly in the inboun on-chipd and via the interconnect. This outbound processors.

Inbound Processor Outbound Processor

Power on & Reset Task Migration Stimulus

Cache Invalidate Normal Operation

Enable Snooping

Ready for Task Migration Save State

Task Migration State (Snoop Outbound Processor) Restore State

Normal Operation L2 Snooping Allowed

Clean L2 Cache

Disable Snooping

Power Down

Outbound Processor OFF

Figure 5 big-LITTLE Task Migration

Figure 5 describesFigure the task 3.3: migration big.LITTLE process between task inbound migration. and outbound Source: processors. [12 Coherency] is clearly a critical enabler in achieving a fast task migration time as it allows the state that has been saved on the outbound processor to be snooped and restored on the inbound processor rather than going via scenario assumes mostmain accesses memory. by Additionally the inbound, because the cluster level-2 cache will, of the due outbound to temporal processor is andcoherent spatial it can locality, be to cache lines currently residing in outboundremain powered processor’s up after a task cache. migration Onceto improve the the threadcache warming has time been of the migrated inbound processor to execute on its new cluster, the outbound processor can be powered-off. After a period of time or depending on cache hit-rate changes, the outbound cache can be flushed and powered-down.Copyright © 2012 ARM Limited. All rights reserved. The ARM logo is a registered trademark of ARM Ltd. All other trademarks are the property of their respective owners and are acknowledged Page 6 of 8 3.1.4 AMBA 4 AXI Coherence Extensions As its abbreviation states, ACE [6] is an extension of AXI for providing hardware-based system-level coherency. ACE aims to keep traffic as much as possible on-chip, allowing multiple up-to-date cached copies of the same address. It aims at supporting various types of caches, and not only the typical processor caches. For instance, an I/O coherent device might have a write-through cache, hence cannot accept a dirty cache line when issuing a

11 Table 3.1: Projection of ACE cache line state to MOESI naming. Source: [12] ACE notation MOESI abbreviation MOESI meaning ACE meaning Unique Dirty M Modified Not shared, dirty, must be written back to memory Shared Dirty O Owned Shared, dirty, must be written back to memory. Only one copy can be in Shared Dirty state. Unique Clean E Exclusive Not shared, clean Shared Clean S Shared Shared, no need to write back, may be clean or dirty Invalid I Invalid Invalid read request. ACE is designed for effective scaling, supporting snoop-based, directory-based and hybrid coherency mechanisms (using snoop filters). In fact, ACE is protocol-agnostic, and is based on five cache-line state terms: • Unique: the cache line resides only in this cache. • Shared: the cache line may be in another cache. A line held in more than one cache must be in Shared state, and contain the same data.

• Clean: the cache controller does not have to update the main memory. • Dirty: the cache controller is responsible for updating the main memory. • Invalid: the cache line must not be considered valid. The state-space of possible combinations of these indicators is depicted in Figure 3.4.

Valid Invalid Unique Shared

Unique Shared

Dirty Dirty Dirty Invalid Unique Shared

Clean Clean Clean

Figure 3.4: ACE cache line states. Source: [12]

The ACE cache line states are designed to support components using any of the MOESI-protocol family. Devices complying to ACE are not required to support all five cache states. For example, an ARM Cortex-A15 internally implements a MESI protocol. As an example, mapping of ACE to MOESI is described in Table 3.1. In order to accommodate coherence transactions, AXI was extended with three new channels, as depicted in Figure 3.5: • AC channel - Coherent address channel: ACADDR, used for sending the address of a snoop request to a cached master, accompanied with control signals. • CR channel - Coherent response channel: CRRESP, used by the master to respond to each snoop address request. A narrow 5-bit response indicating data transfer, dirty data and sharing.

• CD channel - Coherent data channel: CDDATA, used by the master to provide data in response to a snoop. Optional for Write-Through caches. The ACE specification defines two types of ACE master interfaces: • Full-ACE, which contains all ACE channels, as depicted in Figure 3.5. Hence a full-ACE master can issue snoop requests and can be snooped by the interconnect. An example of such a master could be an ARM Coretex A15 cluster.

12 ARADDR

RDATA

AXI AWADDR

WDATA Interconnect Master BRESP (Slave)

ACADDR

ACE CRRESP

CDDATA

Figure 3.5: AXI and ACE channels. Source: [12]

• ACE-lite, which does not include the AC, CR and CD channels yet has the additional signals on the existing AXI channels, enabling it to issue snoop requests but it cannot be snooped. An example of such a master can be a GPU, or a coherent I/O device. In ACE, unlike AXI, snoop requests must be responded in-order as they do not have an ID. The interconnect controls all transactions progress, and may either issue snoops to masters in parallel or serialize them. Accesses to external memory can be issued upon snoop miss, or speculatively before snoop responses have arrived. However, if a response from the external memory arrives prior to a snoop response, it is discarded and the fetching might be re-issued if necessary. Such a scenario is possible, for instance, if one of the caches being snooped is in a low-power state and is therefore slow to respond.

3.1.5 ACE transactions At the heart of the ACE protocol are the transaction types which are aimed at minimizing off-chip traffic at the expense of additional on-chip transactions, added logic, and the derived power consumption. The ACE transaction bank can be categorized according to several criteria: • Shareability, which is a generalization of cacheability. A shareable address can reside in more than one cache simultaneously. To clarify the difference between non-shareable and uncacheable addresses: an uncacheable address cannot reside in any cache in the system. A non-shareable address can reside in a cache, yet not in more than one. • Shereability domain, denoting a group of masters that might share an address range. Each shareable transaction contains its shareability domain, in order to avoid issuing redundant snoops to masters which cannot contain this address. • Bufferability, or more specifically, whether or not a response has to come from the final destination, and write request made visible at the final destination. For instance a normal non-cacheable bufferable read request can be responded by any master, yet should not be stored locally for future use. This means received data should not cause allocation of the line in its local cache.

• Cache Maintenance transactions, such as invalidation broadcasts. • Write-Back transactions: either requiring the written line to stay allocated or not. In addition, an Evict notification transaction is optional and is only required for supporting snoop filters. • Response data criteria, such as:

13 – ownership-passing: whether the response can or cannot be a dirty line. A requesting master might not be able to accept ownership, thus must be provided with a clean response. – uniqueness: whether the response can be shared with other caches or must be unique. Figure 3.6 provides an overview of all ACE transactions, including an indication whether a transaction is supported by ACE-lite masters or only by full-ACE masters. The complete specification of all ACE transactions is

ACE-lite subset ReadOnce CleanShared Read WriteUnique CleanInvalid Write WriteLineUnique MakeInvalid Non-shared Non-cached Cache Maintenance

ReadShared MakeUnique WriteBack

ReadClean ReadUnique WriteClean

ReadNotSharedDirty CleanUnique Evict Shareable Read Shareable Write Write-Back

Read Channel Write Channel

Figure 3.6: ACE transaction groups. Source: [12] available in [6]. The selected transactions that have been modeled are further described as part of Section 4.3.

3.2 Interconnect modeled: CCI++ 3.2.1 CoreLink CCI-400 ”CoreLink” [11] is a family of system IPs developed by ARM, which includes interconnects, memory controllers, system controllers and accompanying tools. It supports ARM’s latest AMBA 4 interface specifications, including AXI Coherency Extensions (ACE) interface protocols. The CoreLink product range includes an SoC-oriented interconnect, the CCI-400 Cache Coherent Interconnect [8], which combines interconnect and coherency functions in a single module. The CCI-400 supports connectivity for up to two ACE [12] masters (e.g. Cortex-A15 multicore processors), three ACE-Lite masters (e.g. Mali-T604 GPUs or coherent I/O devices), and three ACE-lite master interfaces (e.g. for memory controllers, system peripherals). Additional features of this high-bandwidth, cross-bar interconnect are:

• A fully-connected snoop topology: each ACE/ACE-lite slave interface can snoop the other ACE slave interfaces. This provides higher snoop bandwidth comparing to the previous I/O coherency solution provided by ARM - Accelerator Coherence Ports (ACP), which allowed external processors to snoop inside caches of Coretex A15 clusters. Also, ACP required the entire snooped cluster to be powered-up in order to perform snooping from a remote cluster.

• Quality-of-Service (QoS) regulation for shaping traffic profiles, based on latency, throughput, or average outstanding transactions. • A Performance Monitoring Unit (PMU) which consists of an event bus, event counters, clock cycle counter etc.. It enables counting performance-related events, such as measuring the snoop hit rate for read requests on a particular slave interface.

• A Programmers View (PV) to control the coherency and interconnect functionality. • Speculative fetches support: each master interface can issue a fetch in parallel with issuing a snoop request, thus hiding snoop latencies in case of a snoop-miss.

14 • Support of all types of AMBA 4 barrier transactions. These are broadcast from each slave interface to each master interface, ensuring that intermediate transaction source and sink points observe the barrier correctly. This is commonly used in order to guarantee execution order in less strict memory models, as discussed in Section 3.1.1. • Support of Distributed Virtual Memory (DVM) messages transport between masters. DVM transactions enable invalidating other cluster’s TLB’s. The DVM support is based on a separate physical network, yet is part of the hardware-offloading effort, potentially replacing software-based TLB invalidation. • An independent Point-of-Serialization (PoS) (referred to as serialization point in [39], or point of coherence) per connected slave. All transactions to any single address in the system have to be serialized. As such, the interconnect arbiters which request should be accepted, thus orders all requests. Speculative fetches bypass the PoS, yet in case the response from a slave arrives before all snoop responses, it is ignored as it might contain a stale copy. Figure 3.7 depicts the micro architecture of the CCI-400. Due to the commercial nature of this product, apart

ACE-Lite ACE-Lite ACE-Lite C-channel ACE master ACE master master master master

CoreLink CCI-400 Architectural Cache Coherent Clock Gating ACG entry ACG entry ACG entry ACG entry (ACG) entry Interconnect

C-channel Control Register slice Register slice Register slice Register slice Register slice block S4 S3 S2 S1 S0

Burst splitter Burst splitter Burst splitter (ABS) (ABS) (ABS)

QoS latency QoS latency QoS latency QoS latency QoS latency and OT and OT and OT regulator regulator regulators regulators regulators

ACE Slave ACE Slave ACE Slave ACE Slave ACE Slave Interface (ASI) Interface (ASI) Interface (ASI) Interface (ASI) Interface (ASI) S4 S3 S2 S1 S0

Snoop Router Write Unique Fully-connected cross-bar interconnect Block (SRB) Block (WUB)

ACE-Lite Master ACE-Lite Master ACE-Lite Master Performance DVM Manager Interface (AMI) Interface (AMI) Interface (AMI) Monitoring (DVMM) M2 M1 M0 Unit (PMU)

Peripheral CCI Register Interconnect Block (CRB) (PIC)

Register slice M2 Register slice M1 Register slice M0

EVENTBUS ACE-Lite slave ACE-Lite slave ACE-Lite slave

Figure 3.7: CCI-400 internal architecture. Source: [8] from the above-mentioned information (originating from its Technical Reference Manual [8]), internals are mostly disclosed. As CCI-400 is the first commercial SoC-oriented interconnect with system-level coherency support, it sets itself as a useful vehicle for investigating system-level coherency in a real-world compute subsystem context. In addition, CCI-400 was planned to be the interconnect of a compute subsystem which is further discussed in Section 3.6, enabling correlation of simulation results with real hardware.

15 3.2.2 CCI++: extending ACE and CCI to multi-hop memory systems ARM’s CCI-400 is a state of the art SoC-oriented interconnect. However when performing such a modeling-based research, one should focus on the most interesting and relevant aspects: • The research is aimed at investigating the impact of hardware-based system-level coherence and does not aim at functional verification of a product. Rather the product is a vehicle for understanding technology capabilities and trends. The product provides a correlation point for aligning simulations with reality on small scale, in order to justify large-scale experiments which cannot be performed otherwise. • Creating a cycle-accurate model which fully covers the product’s features and temporal behavior has a very high price tag: – Achieving such an accurate level of modeling requires several man-years. – A cycle-accurate model will cripple simulation performance, thus will significantly limit the feasible set of workloads. • Similarly, all hardware models available currently in gem5 provide some level of abstraction which means they are not a one-to-one representation of a product. As such, some of the abstractions made limit the ability to realistically model CCI’s behavior, or will exhibit different performance. Examples of such differences are discussed in Section 3.7. • Investigations should focus on general concepts and should not be limited to any specific implementation. The model should avoid limitations which are platform or product-specific, and serve a broader audience. As such, the changes introduced to gem5 as part of this work should not be measured on the same scale as CCI-400, as they provide much more: • On one hand, we limited the research to the most interesting and representative ACE transactions supported by CCI-400 (as elaborated in Section 4.3), which resulted in gaining many insights about challenges in imple- menting hardware-based system-level coherency. Implementing all ACE transactions would, in many cases, be more of the same, when aiming at a qualitative analysis. • On the other hand, the ACE specifications, and CCI that implements it, were designed to provide a single- hop interconnect solution. Meaning, ACE transactions can only exist directly between an ACE master, CCI, and ACE slaves. However, as systems scale up and contain more components, the use of buses and bridges becomes a must. Multi-layered hierarchies of coherent components will require a scalable solution, such that does not involve a single, central interconnect. Such complex systems are expected to utilize multi-hop interconnects, in which end-to-end transactions pass through several relays. As such, the gem5 coherent bus model was extended to support ACE-like transactions, such that: – when the bus model is instantiated in a classic CCI context, where it is directly connected only to ACE-like masters and slaves, it will function according to the ACE protocol. Such a system is depicted in Figure 3.8(a): the system contains two CPUs, each with a private cache. Each cache is connected to the main, single, interconnect. From the interconnect’s perspective, these caches are ACE masters, as they issue ACE-like transactions. In addition, the interconnect is connected to a DRAM controller, representing an ACE-like slave. – when the bus model is instantiated in a multi-layered system containing several buses, e.g. for arbitrating between two level-1 caches and a level-2 cache, all such intermediate buses must be able to handle ACE- like transactions. However, these intermediate buses must differentiate between slaves which correlate to ACE-slaves, and slaves which may lead to ACE-like masters. An example of such a system is depicted in Figure 3.8(b). This system consists of four CPUs in a two-cluster formation. Each cluster has its level-2 cache, shared by two level-1 caches. Therefore each cluster requires its own CCI-like bus to arbitrate access to the level-2 cache. For instance, in such a system, an intermediate bus might receive a downstream snoop request from its slave port. This is a use-case which is not possible in the ACE architecture. Such complex systems will most probably exhibit inter-cluster sharing (depending on the workloads at hand). As such, they are extremely important when investigating the impact of a hardware- managed system-level coherency. Therefore, the ACE-like transactions modeled were extended beyond the definition of the ACE specifications, to support multi-hop interconnects. As such, to avoid any confusions or misinterpretation of gem5’s bus model’s implementation of memory transactions, it better be treated

16 root root : Root : Root system : System system : System cpu3 cpu2 cpu1 cpu0 : MemTest : MemTest : MemTest : MemTest

cpu1 cpu0 test test test test : MemTest : MemTest

test test l1c3 l1c2 l1c1 l1c0 : BaseCache : BaseCache : BaseCache : BaseCache

cpu_side mem_side cpu_side mem_side cpu_side mem_side cpu_side mem_side

l1c1 l1c0 : BaseCache : BaseCache "CCI++" "CCI++" : CoherentBus : CoherentBus

cpu_side mem_side cpu_side mem_side slave master slave master

l2c1 l2c0 CCI : BaseCache : BaseCache : CoherentBus cpu_side mem_side cpu_side mem_side

slave master "CCI++" : CoherentBus

slave master

physmem : SimpleMemory physmem : SimpleMemory port port

(a) A gem5 system with an ACE-like interconnect (b) A gem5 system with interconnects supported by CCI++

Figure 3.8: gem5 example systems with typical CCI-like interconnects

as an extended version of CCI, hence, a CCI++, as it provides much more in the contexts of multi-hop interconnects.

As such, each ACE-transaction that was implemented required extrapolating the original requirements to a broader context. The impact of such changes is not limited to the bus model only, but to the entire gem5 memory system, as described in Section 3.5. For instance, an ACE-like transaction issued by a CPU must be interpreted by a cache, which should either handle the request on its own, forward it onwards, or generate a different type of transaction and dispatch it to its slave (downstream). We capitalize on the flexible connectivity provided by gem5, to investigate complex systems, rather than limiting the scope of this research to systems with a single centric coherent interconnect. Figures 3.8(a) and 3.8(b) were automatically generated using gem5. This feature, including interpretation of such diagrams, is further described in AppendixA as one of this project’s indirect contributions.

3.3 The Modeling Challenge

A full-system simulation framework enables design-space exploration. It allows performing functional feasibility analysis, performance analysis and power and costs estimation. Performing such simulations using actual hardware implementation (e.g. RTL simulation) may be very accurate but less convenient for exploring new designs. On the other hand, many simulators lack adequate models for existing components. Models should be as realistic as possible, without compromising on simulation speed. Quantifying realistic is a tricky challenge. When analyzing the behavior of full systems, the ability to utilize workloads that run on actual hardware is critical for correlating models. It enables extrapolating from the correlated set of points to any modeled platform, and reasoning about the conclusions made. To that extent, gem5 is a simulation framework which offers convenient exploration, can run real-world workloads, and has been proved to correlate with real hardware.

17 3.4 Simulation and modeling framework: gem5 3.4.1 gem5 gem5 [19] is a flexible and highly configurable full-system event-driven simulation framework. gem5 has a detailed and flexible memory system, useful for exploring state of the art interconnects and protocols, such as this project involves. As depicted in Figure 3.9, it offers a wide spectrum of simulation speed vs. accuracy, based on the

Processor Memory System CPU System Ruby Classic Model Mode Simple Garnet Atomic SE Speed Simple FS Timing SE Simple FS SE In-Order FS SE O3 FS Accuracy

Figure 3.9: gem5 Speed vs. Accuracy Spectrum. Source: [19] following orthogonal set of capabilities: • Four CPU models: AtomicSimple (a single IPC model), TimingSimple (which includes timing of memory references), InOrder (a pipelined in-order CPU), and O3 (a pipelined out-of-order CPU). • Two System Modes: System-call Emulation (SE) which emulates most system-level services, thus eliminat- ing the need to model devices and an operating system, and Full System (FS), which contains a bare-metal environment, including peripheral devices and an operating system. • Two memory systems: Classic, a fast and easily configurable system, and Ruby, for studying new coherency mechanisms in networks on chip. In this work we extend the Classic memory system to capture the behavior of the CCI-400. • Support of major popular ISAs including ARM, ALPHA, MIPS, Power, SPARC and x86. • Ability to boot Linux and Android OSs using ARM, ALPHA and x86 ISAs. The flexibility provided by gem5 enables creating test systems with multiple CPUs and GPUs (as demonstrated in Section6). Note that not all combinations of the above-mentioned capabilities are currently supported. gem5 is developed by a wide community [3] including both academy and industry. gem5 has been used in hundreds of publications. Unlike SimpleScalar [1], gem5 is released under an open-source license. gem5’s community is very active, keeping gem5 changing - adding more models, supporting more workloads, and fixing more bugs, as in any large-scale software project. Revision control is distributed and based on Mercurial [37], hence based on differential changes from a given revision. Each change proposal is reviewed by the community prior to being committed. A key aspect in gem5 is its object-oriented design. gem5 utilizes standard interfaces such as the port interface (used to connect two memory objects together) and the message buffer interface. Although constantly changing, the port and interface semantics have become more TLM-like semantics. These SystemC [28] and TLM -like semantics are essential in establishing connectivity between independent models. gem5 is being continuously correlated with the actual hardware being modeled, such as development boards (e.g. Snowball SKY-S9500-ULP-C01, as published in [21]) containing ARM CPUs.

3.4.2 Simulation flow The core of gem5 is an event-driven simulation engine which tightly combines C++ and Python sources. Each com- ponent in the simulation is represented by a SimObject, reflected simultaneously as a C++ object and as a Python object. The purpose of these two worlds is to enable easy and flexible composition of any system, made possible with Python. The simulator is compiled for a specific architecture (e.g. ARM, x86) and verbosity/optimization level. Figure 3.10 depicts simulation inputs, outputs and runtime interfaces. Generally speaking, a simulation flow contains the following stages:

18 test system .py scripts config.ini

Android Kernel stats.txt

Android Disk Image simout

Boot loader simerr gem5 simulator

system.terminal .rcS script

framebuffer.bmp

Figure 3.10: gem5 simulation inputs, outputs and runtime interfaces

• Creation of the test-system: the simulator is invoked with several command-line arguments, the most important one being a Python script which composes the test-system, and what stimuli will the test use. The test-system is generated in a sequential manner, instantiating its components (CPUs, memories, peripherals) and connecting them to a single entity. In addition, the script specifies what software will be run on the test system. For a full-system simulation, this includes specifying which Operating System (OS) kernel to use, a disk image, a boot loader, and a script which will be invoked by the guest system (the system being simulated) once it has finished booting. The main test script utilizes other Python scripts for common tasks such as instantiating caches and CPUs. The test script may provide input arguments, such as the number of CPUs, the CPU model to be used, cache sizes, cache associativity and so forth. At the end of this process, a config.ini file is generated. This file contains a complete listing of the test-system’s components and settings. Hence it fully reflects what is being simulated. In addition, we have added automatic visualization of the system in the form of a block-diagram is generated. This output is further discussed in AppendixA. • Event-driven simulation: It is either limited by a maxtick parameter, or self terminated by the guest system. During a Full-System simulation, the following outputs are produced: – simout and simerr: the standard output and error streams generated by the host (simulator) – system.terminal: the output of the simulated system’s terminal – framebuffer.bmp: the latest contents of the simulated system’s display In addition to these outputs, it is possible to interact with the simulated system using: – A telnet connection: enabling text-based control and visibility of the test-system. – A Virtual Network Computing (VNC) session: the test-system runs a VNC server, making it possible to interact with the active session using keyboard and mouse inputs from the outside world. The display is equivalent to the contents of the framebuffer.bmp. The simulation stage include booting the test-system, followed by a test-specific scenario specified in an .rcS script.

19 • Post-simulation: including reporting of statistics gathered during simulation. These statistics are provided in stats.txt. Note that statistics can be reported and reset during simulation, either once or periodically. For instance, for benchmarking a specific scenario, it might be misleading to include statistics gathered during the booting-stage. The gem5 term for such statistics is Stats, following the name of the class which implements all required functionality.

3.4.3 Test-System Example An example for a test-system is depicted in Figure 3.11. For clarity reasons, this architecture block-diagram is

root system (type:LinuxArmSystem, machine:RealView_PBX)

cpu0 (AtomicSimple) cpu1 (AtomicSimple) D$ I$ dtb itb D$ I$ dtb itb

tol2bus

L2 $

NV memory

terminal memory bus physical memory VNC server

bridge cf0 (disk image)

interrupt cntr IO bus

RealView GIC RTC UART timers

Figure 3.11: gem5 2-core test system example partial. The complete list of components, their configuration and connectivity is listed in the config.ini file created by gem5 prior to the main simulation stage. This example system includes two CPU clusters. Each cluster contains an ARM AtomicSimple CPU, a data cache, an instruction cache, an instruction TLB, and a data TLB. Both clusters share a level-2 cache. Multiplexing the level-2 cache requires arbitration. This is the reason for having an instance of gem5’s bus model named tol2bus in-between the cache and its masters. The system also includes an interrupt controller, various timers, a UART, and components for realizing a terminal and a VNC session. Most of these peripherals have been omitted from Figure 3.11 for clarity.

3.5 The gem5 memory system 3.5.1 The basic elements: SimObjects, MemObjects, Ports, Requests and Packets Each model instance in gem5 is a SimObject. A particularly interesting type of SimObjects are MemObjects. These are objects which can communicate with other objects via their TLM-like Ports. A MemObject can have either a Master port, a Slave port, or both. Typically: • A master is a model which generates transactions, namely requests. Examples for such models are: CPUs, GPUs and traffic generators. • A slave is a model which can only respond to requests is received. A memory model or a peripheral device such as a UFS model are typical slaves.

20 • In addition, there are devices which have both a master and a slave port. Such devices include caches, bridges, buses, and communication monitors which are gem5 shims used to monitor activity between a master and a slave. The collection of such devices, together with all memories, will be referred in this work as the gem5 memory system. The gem5 port semantics enforces clear connectivity and responsibilities. Each port must be connected upon simulation start, and can only be connected to a single port of the opposite type. Buses make use of port vectors which enable them to connect to more than one master, and more than one slave. The implication of this connectivity semantics is that in each system there is a directed connected graph of MemObjects, with edges leading from masters to nodes. Figure 3.8(a) is a simple example for such a graph: each directed edge represents a connection between a Master port and a Slave port. We denote a master’s slave and a slave’s master as its peer. In order to communicate, a master creates a Request. A request contains an abundance of parameters, with a destination address being one of the most important ones. This request is conveyed to the slave in an enclosing Packet. Throughout the request’s lifecycle it could be conveyed using different packets, and eventually it will reach its destination slave. In case a request requires a response, the original packet will be converted to a response and will be transported back to its initiating master in a similar manner - conveyed in packets. Each packet contains a command field, indicating the purpose of the request. The trivial example of such command types are a read request, a read response, a write request, but also less common, such as load-conditional response. Each master port can be either snoopable, hence can accept snoop requests from its slave peer, or non-snoopable. The collection of the port semantics presented thus far is sufficient to model ACE and ACE-lite interfaces, using a mapping provided in Table 3.2. Note that a slave port is snoopable if its peer’s port is snoopable. A packet is

Table 3.2: Correlation of ACE and gem5 interface semantics ACE-naming gem5-naming gem5-snoopable (full) ACE master master yes ACE-lite master master no ACE slave slave no considered to be a snoop if it is either marked as express snoop, or if it is transported in a direction opposite to the conventional direction. Hence a read request that a master receives is considered to be a snoop read request, since a master can only issue read requests and does not handle read requests. The interaction between a master and a slave port in order to pass-on a packet is done using function calls listed in Table 3.3.

Table 3.3: Main port interface methods Master port method Slave port method send request receive request receive response send response send snoop response receive snoop response receive snoop request send snoop request

Each of these functions has three variants: • Functional: used by gem5 only for performing backdoor operations such as loading the kernel binary to the guest-system’s memory. Hence a functional necessity for bypassing the complexities of the simulated memory system. • Atomic: fully functional with minimal time tracking, used mainly for fast-forwarding, e.g. when using Atom- icSimple CPUs, in order to quickly reach an interesting point in time before switching to a detailed mode. • Timing: fully functional with detailed temporal modeling, at the cost of degraded simulation performance.

Changes to the memory system must be reflected in all three modes, as switching between atomic and timing modes can occur at any time, while functional-mode can co-exists with both.

21 3.5.2 Request’s lifecycle example In order to explain how transactions occur in gem5’s memory system, several use-cases are provided in the following sections. These use-cases are listed starting from the simplest transaction to more complex scenarios. Instead of providing a single complex scenario, isolated segments demonstrating how each model interacts with the memory system are given. The simplest possible system contains only a single master and a single slave. For instance, a CPU (the initiator, a master) directly connected to a memory (a slave, responder). Alternatively, it could be a CPU issuing a read request to a cache resulting in a hit. Such an example is depicted in Figure 3.12. The general sequence of actions Master Slave request initiator responder

CPU master port slave port memory

send request receive request receive request

receive response receive response send response make response

Figure 3.12: Master to slave transaction sequence diagram throughout the request’s lifecycle is as follows: • Assuming the need to send a request has arisen in the CPU, it will first create a request and enclose it in a packet.

• The master will call its (master) port’s send request method. Depending on the circumstances, this could be either of the send functional / atomic / timing request methods. Currently we abstract from any mode-related details. • The master port calls its (slave) peer’s receive request method, passing the packet to the slave’s port with the original request. • The slave port calls its owner’s receive request function to handle the request. • The slave (memory in this case) receives the packet, handles the request (for example, reads from some address and attaches the data to the packet), and transforms the request to a response. This is merely a change in the request’s attributes.

• The slave calls its port’s send response method. • The slave port calls its peer’s (master) port’s receive response • The master handles the response, and de-allocates both the packet and the response.

In this simple case, during the entire transaction the same packet and request instances were used. In more complex cases the request may be passed on enclosed in different packets along its route. From here onwards the explanations will abstract from the owner-to-port details and will regard them as a single entity for clarity.

3.5.3 Intermezzo: Events and the Event Queue The difference between atomic and timing modes is crucial, and is a typical tradeoff between modeling accuracy and simulation speed. gem5 is an even-driven simulator which maintains a central event queue for scheduling events to any moment in the future. Each model can make use of this mechanism and register events, requiring to provide:

• the time (simulation tick) at which this event should occur • an event handler (process() function) that is called once the event is due.

22 As such, modeling a sequence of events in gem5 can be done in one of two general methods: • Function call, as done in Atomic mode, annotating estimated temporal progress and letting the caller deal with actual delays in simulation. This resembles SystemC’s loosely timed notion. • Scheduling an event to happen later on, as done in Timing mode, causing time to actually progress along the sequence of actions. Hence a sequence can be split to several phases which can occur at different times. The result is a much more accurate and realistic behavior, which slows down simulation. This resembles SystemC’s approximately timed notion. One of the changes made as part of this work included improving the temporal behavior of response flow through the memory system by switching from a loosely-timed to an approximately-timed approach. This is further discussed in Section 4.1. The same is applied to any of the stages described in Section 3.5.2 which was depicted in Figure 3.12: each function call could be, and many times is, replaced with queuing of an event to a future time, thus modeling time progress and resource latency.

3.5.4 The bus model The gem5 bus model is the sole component that can be connected to multiple masters and multiple slaves. The name bus may be at some times misleading since its internal implementation can model any type of interconnect. Figure 3.13 depicts an abstraction of the bus prior to the start of this work. The bus was modeled as a unary

master master master

master port master port master port

slave port slave port slave port

retry list bus busy? bus Outstanding requests

master port master port

slave port slave port

slave slave

Figure 3.13: Abstracted diagram of the bus resource that can be either occupied or not. As such, each incoming transaction attempt would first check whether the bus is available to provide service. In case the bus is indeed busy, the requesting port is added to a retry list. An exception was in the case of snoop requests, hence requests coming from a slave, which are passed in zero-time regardless of the bus’s state. This modeling peculiarity is discussed in Section 4.1. Once the bus is available after a period in which it was occupied, the first port in the retry list is granted the right to utilize the bus. Each slave declares its address range to the bus upon pre-simulation system composition. This is required such that the bus can dispatch requests from a master to the correct slave (responsible for that address). Each request that is received and requires a response is added to the outstanding requests list. This is used to distinguish responses (which need to be sent upstream to the initiating master) from snoop responses (which need to be sent downstream towards the snooping bus). In the next sections, the main four use-cases will be presented. In order to provide descriptions which are not obsolete, the interaction sequences provided describe the bus as it is in its current state, after being split to a layered bus. The layered bus is further described in Section 4.2.

23 3.5.5 A simple request from a master Once a bus receives a request from a master: 1. The bus first checks whether its request layer is idle. Snoop requests are serviced regardless of the resource’s availability yet regular requests may be refused if the request layer is busy.

2. The bus forwards the packet upstream to all snoopable masters (hence issue a snoop request). 3. The master will receive the packet and handle the snoop. The packet is forwarded upstream recursively to all masters. In case of a snoop hit, the cache will do the right thing and act according to the coherence protocol, marking the packet as MemInhibited, stating that a master took responsibility of providing a response. Following masters will be aware of this, thus understand this snoop caused a snoop-hit, and act accordingly.

4. The bus sends the request packet to its destination slave according the the packet’s destination address. 5. In case the packet is marked as MemInhibited, the slave will delete the packet as some master has committed to providing a response. Otherwise, it will make the packet a response and schedule the response. 6. The bus marks the request layer as occupied for the duration of the packet’s sending time, plus a fixed bus latency. The entire process is done at the same simulation tick, hence in zero-time. This introduces several peculiarities, such as the fact a slave is aware of the snoop result (hit or miss) already at the time of the request’s arrival to the bus. The slave will only react if needed, hence the result of this sequence is equivalent to modeling perfect speculative prefetching. Hence snoop latency is hidden, as practically snoop miss responses come at zero-time with zero-cost.

3.5.6 A simple response from a slave Once a bus receives a response from a slave:

1. The bus first checks whether its response layer is idle. Responses may be refused if the response layer is busy. 2. The response is sent to upstream to its appropriate master.

3. The bus marks the response layer as occupied for the duration of the packet’s sending time, plus a fixed bus latency.

3.5.7 Receiving a snoop request Snoop requests pass through instantly. As such, once a bus receives a snoop request (from a slave), the received packet is forwarded to all snoopable masters.

3.5.8 Receiving a snoop response Once a bus receives a snoop response: 1. The bus first checks whether its snoop response layer is idle. Snoop responses may be refused if the snoop response layer is busy. 2. In case the snoop response is due to a snoop request we forwarded, forward the snoop response to the snoop origin (downstream). 3. In case the snoop response is due to a snoop request we issued (as a result of receiving a request), send a response to the requesting master (upstream). 4. The bus marks the snoop response layer as occupied for the duration of the packet’s sending time, plus a fixed bus latency.

24 3.5.9 The bridge model Currently gem5 does not support connecting two buses to each other, due to an assumption that a response from a bus must always be accepted, which might fails if the receiving side is also a bus which is occupied. The sole functionality of a bridge is to buffer requests in both directions in a finite queue. In case its queue is full, the packet will be marked as not-acknowledged (nacked) and sent back.

3.5.10 Caches and coherency in gem5 gem5’s cache model implements an MOESI-like coherence protocol. This protocol aims at minimizing the amounts of off-chip activity - which might not always realistically represent a system. The term cache model actually is an abstraction which describes both the cache and the cache controller. The cache model is based on three main data structures: • A TagStore, which is the actual cached data and tags stored as a list of blocks. The TagStore can implement any replacement policy, yet the one used by default is Least-Recently Used (LRU). Replacements and evictions are determined by the TagStore during handling of a request.

• A Miss Status Handling Registers (MSHR) block which contains tracking data regarding cached read and write requests for cache lines that are not in the TagStore and are pending a response. • A Write Buffer for uncatchable, evicted, and dirty lines that need to be written downstream.

3.5.11 gem5’s cache line states Each cache line stored in the TagStore contains the following state indicators: • Valid: indicates whether the data stored is valid. • Readable: indicates whether the data stored can be read. An example for a situation where a line can be valid yet not readable is a write-miss to a previously-readable line (e.g. shared and pending an upgrade to exclusive-state). • Writable: indicates that the cache is permitted to make changes to its copy. • Dirty: indicates that the copy stored has to be updated in the main memory.

Examples of optimizations utilized in gem5’s cache model which are implementation-dependent and might not realistically model a system are: • A read request for a line which does not reside in any other cache will automatically result in exclusive-state. This optimization eliminates the need for requesting an upgrade (to Exclusive-state) upon receiving a write request, yet does not necessarily realistically represent a system.

• Ownership passing is supported, yet only when a dirty line is snooped by read-exclusive request. The impact of ownership-passing policies is further discussed in Section 7.3. • Read requests to a modified line never result in write-back of the line, which is the classic expected behavior. Data is passed on-chip which might not truly represent a system.

• Snoop requests from the slave are handled and forwarded in zero time. This major inaccuracy is intended for avoiding race conditions in the memory system, and mostly the need to implement transition-states in the cache-controller. Table 3.4 projects gem5’s cache line state-space to MOESI. The purpose of this projection is to provide a complete picture which enables correlation with the ACE cache-line state space provided in Table 3.1

3.5.12 Main cache scenarios In the following sections, the main possible scenarios are provided.

25 Table 3.4: Projection of gem5 cache line state space to MOESI MOESI state Writable Dirty Valid M 1 1 1 O 0 1 1 E 1 0 1 S 0 0 1 I 0 0 0

A read request Upon a read request (from the master):

• In case of a hit, the cache line is provided to the master (upstream). • In case of a miss, a cache-line-sized read request is generated and sent to the slave (downsteram). Once a response arrives, it is stored locally (except for cases such as an uncacheable read or a full cache), and a response to the original request is provided to the master (upstream).

A write request Upon a write request (from the master):

• In case of a hit, the cache line is updated. • In case of a miss, a read-exclusive request is generated and send to the slave (downstream). Once a response arrives, the new data is stored locally (here too there are exceptions according to the coherence protocol which is beyond the scope of this abstracted description.). A write response is sent to the master (upstream).

• In case of an uncacheable write request, the write request is added to a write-back buffer to be sent downstream.

A snoop request Upon a snoop request (from the slave):

• The cache will forward the snoop request upstream in zero-time. This ensures all caches are updated instan- taneously, thus eliminating the need for transitional cache states, and various race conditions. • The cache controller follows the coherence protocol and provides a response. E.g. a snoop read request for a valid cache line will be responded with the data, and the packet will be marked as MemInhibited, stating that this cache is responsible for scheduling a snoop response. The cache line state will be updated according to the coherence protocol. • In case of a miss, the packet will not be modified..

A downstream snoop response In case of a snoop response from a master:

• If the response is to a request which the cache forwarded, it is forwarded downstream. • Otherwise, it is treated as any response to a read or write request as described in Sections 3.5.12 and 3.5.12.

3.6 Target platform

The CCI-400 was intended to be the interconnect in a compute subsystem developed by ARM. This would have enabled correlation of the model with real hardware, running the same workloads both on hardware and in the simulator. This platform aims to model hand-held mobile devices. A block-diagram of the target platform is provided in Figure 3.14.

26 Radio Antenna

LED TFT/OLED Backlight

Ethernet USB/Dock RF SD/ Flash Flash MMC Comms SSD I2C Backlight Touch DSP (Setup & HDMI Controller I2C Controller Controller Brightness)

CSI-2 Video Post LVDS USB 3.0 SATA Multi-Format PHY Processing PHY PHY PHY PHY Encoder/ (Scaling, De- Decoder CSI-2 LVDS Ethernet USB 3.0 SATA SD/MMC DMA interlace) I2S Controller Controller Controller Controller Controller Controller Controller

Media Display High Bandwidth Peripherals DMA

Trace Bus + JTAG Power Gate Control S M Eagle 2/4-CPU Cluster Kingfisher 2/4-CPU Cluster GPU T604/T608 Display e M c CoreSight U a PPU Display Power Gates f r e t

Unified Unified Unified Unified n I PPU Eagle Eagle KF KF DMA Power Gate Shader Shader Shader Shader HDLCD HDLCD s Debug & Trace CPU 0 CPU 2 CPU 0 CPU 2 n o

0 2 4 6 i e e e e Display 0 Display 1 PPU Media Power Gates s h h h h c c n c c a a a a e C C t C C

x 2 2 2 2 PPU Radio Power Gate L L L L

E System Unified Unified Unified Unified SMMU SMMU Eagle Eagle KF KF Control Shader Shader Shader Shader High BW Peripherals CPU 1 CPU 3 CPU 1 CPU 3 Processor PPU 1 3 5 7 Power Gates

SMMU SMMU 27 System & Power NIC 400 I2C

PMIC IC Cache Coherent Interconnect (CCI) Voltage Domains NIC 400 Configurations SoC Security Security n = Tablet/Smartphone n B&L Ctrl Clock & e e o o i i Orwell r r Clocks, Resets t t Reset

Scratch Secure u u n n

LPDDR2/DDR3 c ROM c e e Control e e t RAM t SRAM Timers e e = High-End Tablet Controller S S R R Temp Orwell On-Chip 2 x UARTS Base Sensor DMC Memories Peripherals Columbus GPIO Keypad/Input

UART UART/Test Mux

Random NV Flash Audio AC97 DDR3 DDR3 GPIO Fuses ADC I2C Columbus Based High- PHY PHY Num Gen Counter Controller Buffer Control Internal I2C Bus Performance Tablet SoC Security Audio External Peripherals I2C Bus Keypad/Input Fuse Interface Proximity Gyroscope Audio + Light + GPS DDR3 DDR3 Flash CODEC DDR3 DDR3 Sensor Compass

GPS Antenna

Figure 3.14: Target platform block diagram. Source: ARM technical symposium 2011 Modeling the target platform in gem5 required: • composing an equivalent platform from the available components, • tweaking gem5’s models to match the target platform’s configuration as much as possible (e.g. clock frequen- cies, cache sizing and latencies)

• preparing workloads which can be run both on the target platform and on gem5. As a key workload, Browser- Bench (BBench), a web-rendering benchmark, was selected. BBench is further discussed in Section 6.4.4. The gem5 target platform was initially designed as an abstracted version of the planned target platform. Its high-level design is provided in Figure 3.15(a).

A15 A15 A15 A15 “GPU” HD Vid. Display A15 A15 little little GPU traffic traffic traffic L2 L2 gen. gen. gen. L2 L2 L2

CCI RT NIC CCI

physmem extendedMem Peripherals NIC physmem Peripherals NIC

RealView peripherals RealView peripherals

(a) gem5 simplified target platform overview (b) gem5 target platform (end goal) overview

Figure 3.15: gem5 target initial and final target platform overview

The gem5 target platform: • Is based on gem5’s RealView ARM development board configuration, peripherals and memory map, hence capable of running same software/OS including BBench, as a currently-available hardware product,

• Is composed of two clusters of two ARM Coretex A15-like out-of-order (O3) CPUs, each with a private level-1 cache and a shared level-2 cache per cluster. • A trace player / random traffic generator as a starting point to mimic GPU behavior, generating read and write traffic to a private memory (extendedMem) and also read requests to the main memory (physmem). • A CCI-like interconnect, capable of issuing ACE-like system-level coherent transactions

• Communication monitors (described in Section 3.6.1) on each of the interconnect’s ports, for improved ob- servability of traffic around the interconnect. The communication monitors are depicted as discs on each of the interconnect’s port. This model was implemented and tested. However during testing of the target platform model, simulation performance issues (discussed in Section 3.6.2) were detected and required re-evaluating the investigation approach. A model which closer-resembles the actual target platform was designed to contain a GPU model (instead of a traffic generator), additional real-time sensitive video and display traffic generators utilizing a separate interconnect, is depicted in 3.15(b). This model was not realized due to the performance issues discussed in Section 3.6.2 and a lack of adequate GPU model, which was essential for investigating sharing patterns between CPUs and the GPU.

3.6.1 gem5 building block for performance analysis A fundamental feature in any simulation framework is observability. In order to analyze a system’s performance, the simulator must provide infrastructure for monitoring events and collecting statistics during a simulation. gem5 provides two main tools for this matter:

• Stats is a class which provides convenient means of collecting scalars or vectors or inputs upon sampling events. These can either simply record values, or perform more complex arithmetical operations (Formulas).

28 • Communication monitors enable in-depth analysis of accumulated traffic between a master and a slave port. The communication monitor acts as a non-functional shim, sniffing all port activity. Example statistics provided by the monitor are: bandwidth and latency statistics, transaction distribution and so forth. In both cases, statistics can be reset or output to a file, either once or periodically.

3.6.2 Simulator performance The process of testing the target platform model raised question marks regarding the simulator performance when running a test system which utilizes multiple O3 CPUs. Repeated testing and analysis of the simulator’s progress based on reported kIPS implied simulation of a real-world workload would require an unreasonable amount of time, for the scope of this project. For example, we discovered that it would take several weeks to run a BBench simulation for a system with four O3 CPUs. In order to verify these findings, an in-depth investigation using Valgrind [36] and Callgrind [42] was performed. Callgrind is a profiling tool which logs function and library calls history at runtime, including the number of instructions executed, their relationship to source lines, the caller/callee relationship between functions, and the numbers of such calls. Gaining such insight comes at a price of significantly degraded run-time. Results for a partial BBench run using the target platform model described in Section 3.6 are provided in Figure 3.16. The simulator was found to be spending a major part of the time performing memory allocation and de-allocation as part of the O3-model CPUs operation. The impact on overall simulation performance was acute as the target platform contained four instances of the O3 model. The investigation’s findings spurred several changes in gem5 which resulted in 12% performance improvement, yet concluded that full simulations of real-world workloads cannot be performed using the O3 model. Alternatives are discussed as part of Chapter6.

3.7 Differences between ACE and gem5’s memory system

When approaching the task of modeling the ACE protocol in gem5, we first need to understand the gap between the current state of gem5’s memory system, and what ACE tries to achieve. The purpose of this section is provide motivation for these differences using a leading example. As stated before, the ACE protocol is aimed at single-hop interconnects, while gem5’s connectivity requires supporting multi-hop interconnects. An example of ACE’s intention is the listing of all expected cache-line after- states per transaction. For instance, the ACE specifications for a WriteNoSnoop request (discussed in Section 4.3.3 and in [6]) require the end-state of the cache line to never be dirty (modified).

• In a single-hop interconnect, the interconnect would receive a WriteNoSnoop request and forward it directly to the slave, assuming it is a memory device that will be updated with the new data, and therefore any line containing this address could not be dirty once the transaction is over. • However, in gem5, a CPU may issue a WriteNoSnoop to any type of peer slave. In case this is a cache, a WriteNoSnoop does not require updating the end-memory responsible of this address, but simply specifies the address is not shareable for avoiding unnecessary snooping. As such the receiving cache should: – In case of a hit to a writable line, perform a write-hit – In case of a miss or a hit to a readable line, issue a ReadNoSnoop downstream, which is inherently also a read-exclusive request. – Only once a lowest-level bus receives a WriteNoSnoop request, will it perform the request as defined by the ACE specifications. This subtlety is a typical example of how an ACE transaction should be extended, and cannot be modeled as-is in gem5. ACE transactions influence the entire memory system, including caches and buses, and not only the bus. Additional differences are discussed as part of Section 4.3

3.7.1 System-coherency modeling gem5 system-level coherency relies on a set of conventions which maintain the system coherent at any point in time:

29 • All snoops requests received by a bus or a cache from its slave are forwarded to all upstream masters, as described in Section 3.4: – The bus forwards all requests arriving from any of its masters to all snoopable masters (upstream). – The bus forwards all snoop requests originating from its slaves towards its masters. – The cache forwards all snoop requests originating from its slave to its master.

• The memory system relies on all upstream requests to be forwarded in zero-time in order to keep the contents of all caches coherent without the need of transition states, and without the hassle of race conditions. Hence in gem5 cache-line state changes occur unrealistically faster than in ACE. However, as both gem5 and ACE utilize a MOESI-like coherence protocol, the after-states that should be modeled in gem5 in the CCI-context must be the same. This leading requirement is fulfilled per implemented ACE transaction as described in 4.3. Compliance can be demonstrated by issuing the modeled ACE transactions in a gem5 system which resembles the CCI-context (as in Figure 3.8(a)). Extending each ACE transaction should guarantee system-level coherence is maintained in any form of interconnect and memory system. Furthermore, the notions of shareability and bufferability discussed in Section 3.1.5 are missing in gem5 and are essential for modeling ACE transactions. The above-mentioned requirements for modeling extended, ACE-like, hardware-managed system-level coherency have been implemented and are further discussed in Chapter4.

30 (a) Call-graph statistics summary

(b) Call-graph visualization of O3 activity

Figure 3.16: gem5 target initial and final target platform overview

31 Chapter 4

Interconnect model

The gem5 bus model provides a mature ground for modeling realistic system-level coherent interconnects. However, some key points in the model should be dealt with in order to increase its correlation with a CCI-like interconnect. In this chapter, gaps between the old gem5 bus model to the desired model are discussed. Solutions for bridging these gaps are provided, some by means of implementation and some by guidelines to future work. During this project the bus model has been modified by other members of the gem5 community, and mostly by my supervisor, Dr. Andreas Hansson. Although these changes are not fruits of this work, they made the bus model much closer to an ACE-like interconnect. Problems in the bus model that have been dealt with are:

• Connectivity semantics: gem5’s memory system lacked a strict notion of master and slave ports. A port could be connected to any other port. Recent work modified all ports to be either master or slave ports, enforcing strict directed connectivity with distinguished responsibilities. This introduced semantics made gem5 ports AXI-like. An additional change, where each master port can be either snoopable or not, was another step closer towards ACE-like interfaces.

• The bus model was previously used in connectivities which should and should not handle coherent traffic. The bus functionality was redesigned to a BaseBus bus base-class, from which two types of models inherit, as depicted in Figure 4.1:

EventManager Serializable

SimObject

MemObject

BaseBus

CoherentBus NoncoherentBus

DetailedCoherentBus

Figure 4.1: The bus-model’s inheritance diagram

– A NonCoherentBus class, which does not support any coherent traffic (hence cannot snoop or be snooped), which better resembles non-coherent interconnects than the original bus, – A CoherentBus class, which supports both coherent and non-coherent traffic, and resembles a CCI-like interconnect.

32 – A more timing-accurate bus model was designed, the DetailedCoherentBus, which is derived from the CoherentBus and is a step closer to a realistic interconnect. It is further described in Section 4.1. • The bus had utilized a unary resource for modeling contention, as described in Section 3.5.4. This type of contention is suitable for modeling tri-state-like buses. Modern medium-scale interconnects are typically crossbar-based. An improved resource contention mechanism in the form of layers was introduced as part of the bus-class split to coherent and non-coherent bus. This solution is further discussed in section 4.2. • The bus model supported a very limited set of coherent transactions. As such, support of ACE-like transactions extended to suit both CCI-like as well as multi-hop interconnects has been added, and discussed in Section 4.3. • The bus model did not contain any performance monitoring capabilities, which crippled the ability analyze its performance and impact on the entire guest (simulated) system. In order to provide such capabilities, statistics were added to all bus models, as described in Section 4.5. • The memory system as a whole does not support barrier transactions, while these are part of the ACE specifications. Although barriers (of both data and sync types) may have significant impact on a system’s performance, they are currently implemented within gem5’s CPU models. The required effort estimated to cure the situation was beyond the scope of this work. As it is both an interesting feature with potentially significant performance impact, it is further discussed in Section 7.3.

4.1 Temporal transaction transport modeling

The gem5 memory model utilizes an approach which is a mix of loosely-timed and approximately-timed approaches. Hence in some cases function calls are used, passing on the responsibility of modeling temporal behavior, and in other cases event queuing is used, better modeling time progress in each model. As discussed in Section 3.4, in order to avoid race conditions and transitional states in caches, snoop requests are passed-on in zero-time. In order to compensate for this approximation, gem5’s memory system tries to penalize snoop responses with longer latencies, aiming at to provide a reasonable aggregate delay for the entire transaction. However, snoop responses, as well as several other cases which will be covered in this section, pass through the bus in a loosely-timed manner, passing on the responsibility to mimic time-progress to the receiving-end. In order to improve the bus model’s temporal-behavior accuracy, transaction scheduling using queued-ports was implemented. As explained in 3.5.3, gem5 provides the infrastructure to delay execution of events till a future point in time. This infrastructure was used in order to delay outgoing traffic from the bus to its destination port, instead of letting the bus directly send the packet with timing annotations that should be dealt with by the receiving end. The change is based on queued ports, which maintains an event queue, enabling activities to be scheduled in addition to its basic functionality of a port. Table 4.1 lists outgoing traffic cases which now utilize queuing. While the first two cases are trivial to grasp, the

Table 4.1: Bus outgoing traffic queuing Bus port traffic direction scenario figure Slave port upstream (from the bus to a master) response from a slave 4.2(a) Master port upstream (from the bus to a slave) request from above 4.2(b) Master port downstream (from the bus to a slave) snoop response from above 4.2(c) third scenario is not: it requires a system in which there are buses in-between two levels of caches, and situation where a snoop is responded and flows downstream through the bus. This flow of response - from the a master to a slave, cannot occur in classic ACE but only in such a multi-hop interconnect. All three scenarios are depicted in Figure 4.2. A description of all three scenarios: • In Figure 4.2(a), the CPU issues a request which traverses downstream through the bus to the memory. The request traversal is marked as a green arrow. The response, marked with a blue arrow, is queued (hence scheduled for future sending) at the bus’s slave port (marked by a red Q sign).

• In Figure 4.2(b), the same scenario occurs. Queuing has also been added for request traffic going downstream towards the slave. The same marking method is used.

33 root root root : Root

: Root : Root system : System system system cpu3 cpu2 cpu1 cpu0 : MemTest : MemTest : MemTest : MemTest : System : System test test test test

cpu cpu Request

: MemTest : MemTest l1c3 l1c2 l1c1 l1c0 : BaseCache : BaseCache : BaseCache : BaseCache test test cpu_side mem_side cpu_side mem_side cpu_side mem_side cpu_side mem_side

Snoop response Snoop request Request tol2bus1 tol2bus0 Response : CoherentBus : CoherentBus Q Response slave master slave master membus membus Snoop request : CoherentBus : CoherentBus Request Snoop response Q l2c1 l2c0 slave master slave master : BaseCache : BaseCache cpu_side mem_side cpu_side mem_side

Snoop Request Q Request Request membus request : CoherentBus physmem physmem slave master : SimpleMemory : SimpleMemory

physmem port port : SimpleMemory

port

(a) response from a slave (b) request from above (c) snoop response from above

Figure 4.2: Added bus outgoing traffic queuing scenarios

• In Figure 4.2(c), a read request is issued by CPU3. Assuming the cache line only resides in the level-1 cache of CPU 1, the request will be sent downstream till the main interconnect (membus), which will forward a snoop request upstream (marked in blue), through level-2 cache 0 and through an intermediate bus. The snoop response (marked in purple) will enter a queued master port on route from the intermediate bus to the level-2 cache. This set of changes included shifting time-delaying the packet from the cache’s responsibility to a split responsi- bility which better models a real system. In the updated implementation, the cache adds its own latency on top of the arrival moment, and not on top of a timing annotation provided buy the bus. In addition, the semantics of the annotated time have been updated such that the bus provides the delay taken for the packet to traverse through the bus, and not absolute delivery times. In order to provide backwards-compatibility, the CoherentBus was not modified but rather a new class which derives from the CoherentBus is used - the DetailedCoherentBus. One reason for not forcefully applying this change onto the CoherentBus is that additional queuing has a negative performance impact. Hence it is essential to maintain a less-accurate yet faster bus model. Note that the added queuing mechanism is not applied to any packets marked as express snoop, as they must be passed-on in zero-time. In addition to queuing of these three scenarios, a dedicated snoop response latency parameter was added to the DetailedCoherentBus in order to investigate the impact of punishing snoop responses on the system’s performance. Evaluation of the above-mentioned changes is extensively discussed in Chapter6.

4.2 Resource contention modeling

The bus model, in its previous form, was presented in Section 3.5.4. The bus’s availability was determined by a unary shared resource, regardless of the initiating port and the type of transaction (whether it is a request, a response, a snoop request or a snoop response). To better model modern on-chip buses, the notion of layers was introduced. A layer represents a unary resource. Each layer has a state which represents its current availability: • IDLE, indicating that the layer is not occupied. • BUSY, indicating that the layer is occupied for some duration due to an in-flight transaction.

34 • RETRY, when a bus finished serving a port and now checks if any other port has been pending the bus to service its request. Each bus now utilizes layers for modeling contention over its resources: • Both the non-coherent and the coherent bus have separate request layer and response layer. The bus will only serve a request if its request layer is idle. Similarly, the bus will only serve a response if its response layer is idle. • The coherent bus has an additional layer, a snoop response layer, for modeling contention over a snoop-response channel. This added channel brings the bus a step closer towards an ACE-like interconnect. • Conceptually, the coherent bus virtually has an additional snoop request layer which has infinite capacity and zero latency - due to the gem5 memory system peculiarity for avoiding intermediate cache-line states, forcing the bus to serve snoop requests at any time. An abstracted overview a the layered coherent bus is available in 4.3. As a reference, recall the unary-resource- based bus, as depicted in 3.13.

master master master

master port master port master port

slave port slave port slave port

Request layer retry list Response layer

Snoop response bus Outstanding layer requests

master port master port

slave port slave port

slave slave

Figure 4.3: Abstracted diagram of the layered bus

This set of improvement was a result the group’s work, as part of joint parallel efforts to enable a more accurate modeling of the memory system. Following this change, the bus can be easily extended to a true crossbar interconnect by having a request-layer per slave port and a response layer per master port, which is the equivalent of two multiplexer sets that implement a crossbar.

4.3 ACE transaction modeling

A key contribution of this work lies in a set of ACE-like transactions that have been added to gem5’s memory system. The ACE specifications introduce coherence-related transactions and attributes (as discussed in Section 3.1.5), yet gem5 is already supports system-level coherency for its current set of transactions. As such, the transactions that were selected to be modeled need to be such that provide added value, rather than simply slow down the simulator. Furthermore, it is beyond the scope of this work to model the complete set of ACE transactions, neither is this a cost-efficient approach when analyzing a system. In order to evaluate which ACE transactions are most important to model, a study was concluded. The most useful input for this study would have been the ACE transaction distribution in real workloads. However, there

35 is currently neither hardware nor any utilizing software. As such, we consulted with ARM’s CCI-400 design and verification team. This included analyzing which transactions are expected to be most influencing, most common, and for what type of master. As a reference point, we were aided by the transaction distributions used throughput the validation process of the product - knowing that their estimations might be misleading. Note that each transaction modeled is described in detail starting Section 4.3.1. We expect the following distribution patterns:

• CPUs will mostly issue shareable transactions, a few unshared transactions, and cache maintenance transac- tions which are expected to be an enabler in the big.LITTLE architecture. • GPUs will issue mostly ReadOnce read requests, and WriteLineUnique or WriteLine write requests. This assumption was also reassured by a throughput-computing R&D specialist conducting research in that domain. However, WriteLine requires cache line (byte-enable) strobes which currently exist only in a side-branch of gem5 and thus WriteLine will not be modeled. • Video controllers will exhibit a distribution similar to the GPU’s. • Coherent I/O devices are expected to issue unshared reads, ReadOnce reads, unshared writes, and a combi- nation of WriteLineUnique and WriteUnique.

• Barriers and DVM transactions were expected to be issued in relatively minor quantities, yet have a non- negligible impact. However, as discussed in Section4, due to gem5 limitations this topic is beyond the scope of this research. When prioritizing tasks to be modeled, we also considered the expected implementation effort required for implementing each transaction. gem5’s coherence protocol has a fixed ownership-passing and on-chip data passing policies, as described in Section 3.5.10. Every non-exclusive read request issued in gem5 is responded with a clean copy. As such, ReadClean, ReadNotSharedDirty and ReadShared will all result in the same end-state if modeled as-is in gem5. As such, since ReadShared is already implemented, ReadClean and ReadNotSharedDirty are not supported. In order to add meaningful support for these two transactions, one must add the notion of capability of receiving ownership. Hence each request must contain an additional attribute stating whether the responsibility for this address can or cannot be passed on to the initiator. An example for such a case is I/O-coherent devices with Write-Through caches, and cannot accept dirty data. This further research possibility is raised as part of Section 7.3. The validation process of all ACE-like transactions is described in Section 5.3, including simulation examples demonstrating each transaction’s impact on the system. In the following section, each ACE transaction that was added to gem5 is described, including:

• Its ACE semantics, including a typical use-case. • How the ACE specifications were extended to fit into a multi-hop interconnect, while complying to the ACE definition when a single-hop interconnect is modeled. • Implementation details that provide insight as to how similar transactions can be extended and modeled in gem5.

4.3.1 ReadOnce ReadOnce is a read transaction that is used in a region of memory that is shareable with other masters. Hence the cache line requested might reside in a cache. This transaction is used when a snapshot of the data is required. The location must not be cached locally for future use. The data passed must therefore be clean, meaning ownership of this cache line must not be passed on to the requesting master. A typical use-case for ReadOnce is a GPU in a need to read configuration data, or contents that it will not require for any future use. The shareable domain in such a case would contain CPUs and the GPU. The main differences between ReadOnce and a ReadShared (or simply Read in gem5 terminology):

• The response should not result in allocation of a cache line. This is not the same as an uncacheable address, which is memory-region specific and not request-specific. • The cache-line owner can keep its line in a writable state, even if it is dirty.

36 • The cache-line owner must not pass ownership. Therefore, the main benefit from a ReadOnce comparing to a SharedRead is that a cache line in M-state will not be forced to O-state or even to perform a write-back operation, nor a read-from-memory - saving two off-chip operations. A cache containing an M-state line being snooped by a ReadOnce must supply its data but does not need to modify its line state. This type of transaction must not be generated randomly without any constraints, as it assumes the line is not in the request initiator’s cache. A failing scenario would consist of the line in being M state in the initiator’s cache, followed by a ReadOnce issued by the initiator. The result would be the fetching of a stale copy from the main memory. In order to support ReadOnce in gem5, a new packet attribute had to be introduced: noAllocate. This indication implies that the response must not cause allocation to any cache along the traversal of the request. No changes to the bus are required, as ReadOnce is forward to all snoopable masters. A more accurate imple- mentation would involve the notion of shareability domains, for limiting the set of masters being snooped to those of the same domain as specified in the request. Extending ReadOnce to work in a multi-hop interconnect requires modifying each cache which is snooped by a ReadOnce snoop to forward the snoop as-is without any modifications to its cache lines. In case of a cache hit, the cache is responsible of providing a response without making any changes to the line state.

4.3.2 ReadNoSnoop ReadNoSnoop is a read transaction that is used in a region of memory that is not shareable with other masters. Note that this does not mean uncacheable, hence the line can be cached, but not in other master’s caches. The response must not contain shared or dirty data, hence ownership-passing is not permitted. The purpose of this transaction is to avoid issuing snoop requests in vain, in cases where there is a-priori knowledge that this cache will not reside in any other master’s cache. Thus issuing ReadNoSnoop transactions cannot be transparent to the programmer and is derived from the system architecture. To support this type of transaction in gem5, a new packet attribute has to be added, adding the notion of shareability. Shareability can be interpreted as a generalization of cacheability, since shareable implies cacheable, but non-shareable may be either cacheable or not. A ReadNoSnoop request packet should be marked as non-shareable. The bus model should not forward this request to any other master as a snoop request. Insights about ReadNoSnoop and its possible extension to multi-hop interconnects: • Since the requested line must not be shared, this is in fact a private case of a ReadExclusive request, where no snooping is performed beyond the transactions route from the initiating master to either a containing cache or a slave. • As the request is for exclusive state, caches along the transaction’s traversal can contain the line only in states M (if inclusive) or I. This requirement has to be implemented in the cache model.

ReadNoSnoop cannot be randomly generated as it assumes a line does not reside in any other cache outside the port-chain leading from the requesting master to the destination slave. In case the line does reside in another cache, the initiating master might read a stale copy from the destination slave. A typical scenario for ReadNoSnoop would be a coherent I/O device issuing a ReadNoSnoop request to an address which is cacheable (e.g. a cache line-fill request) yet not shared with any other master, as a private buffer in the main memory.

4.3.3 WriteNoSnoop A WriteNoSnoop transaction is used in a region of memory that is not shareable with other masters. A WriteNoS- noop transaction can result from: a program action (e.g. store) or an update of the main memory for a cache line that is in a non-shareable region of memory. WriteNoSnoop responses must never be shared or dirty. Similarly to ReadNoSnoop, the original purpose of this request is to avoid issuing snoop requests in vain when the cache line cannot reside in a different cache in the system. WriteNoSnoop is a useful methodological example of demonstrating the difference between the ACE specification and its extension to a multi-hop memory system:

37 • In ACE, a master which issues a WriteNoSnoop is directly connected to the main interconnect (CCI). The request is then directly forwarded to the appropriate slave, thus performing a memory write operation. • In gem5, a master issuing a WriteNoSnoop might be connected to multiple layers of caches and buses along the path to the destination slave. Assuming the master is connected to a cache: – In case of a cache miss, the cache will issue a ReadNoSnoop request to exclusively obtain the cache line. Once the cache line is obtained, the line is modified and a write response is provided to the issuing CPU. – In case of a cache hit, as the line is not shareable, it is either in E-state or M-state. The cache will update the line and provide a write response to the issuing CPU. – Upon eviction of the cache line, since it is a non-shareable line a WriteBack request can be issued downstream without performing any snooping.

Thus in the context of gem5, a WriteNoSnoop can also trigger a ReadNoSnoop and a non-snooping WriteBack requests.

4.3.4 MakeInvalid A MakeInvalid transaction is used in a region of memory that is shareable with other masters. The MakeUnique transaction ensures that the cache line can be held in a Unique state, by broadcasting an invalidation message to all possible sharers. This permits the master to carry out a store operation to the cache line, but the transaction does not obtain a copy of the data for the master. All other copies of the cache line are invalidated. The request must be of full cache-line size. This request type is a member of the cache maintenance set of transactions provided by ACE. An example for a utilizing request is WriteLineUnique, which was implemented and is further described in Section 4.3.5. MakeInvalid is the first transaction to be introduced to gem5 that broadcasts a request. Hence it does not require a response, yet should be forwarded to all snoopable masters. As the gem5 memory system did not support broadcasts, prior to implementing MakeInvalid, broadcasting had to be implemented, as described in Section 4.3.4. In ACE, a master issues a MakeInvalid request, which is forwarded by the interconnect to all other snoopable masters. In gem5, the broadcast message reaches each coherent bus, which in turn forwards this invalidation-snoop message upstream. The process continues till all snoopable masters in the system received the invalidation message.

Broadcast support gem5 supported transaction types which do not require a response, such as gem5’s WriteBack, which blindly writes data downstream, with no need of a response. However, such transactions are aimed at a specific, single destination. One a packet reaches its destination, the request conveyed by the packet would be de-allocated. However, in a broadcast, any receiving end must not attempt to free the conveyed request. On the other hand, the memory system must implement some sink for eventually freeing the original request. Since this infrastructure is transaction-independent, it has been developed and posted separately. To implement broadcasting, a new packet attribute was added, which denotes that the packet contains a forwarded request, and thus should not be freed at the receiving master’s end, but be forwarded on upstream (e.g. to a bus’s masters, or to a cache’s master port). It is the responsibility of the broadcast initiator to free the request once the broadcast is over. This mechanism resembles the isMemInhibit mechanism used to instantaneously flag (hence broadcast, share knowledge between components) that a request has been responded. This need is a result of the possible multi-level memory systems gem5 supports, and is a generalization of ACE and CCI.

4.3.5 WriteLineUnique A WriteLineUnique transaction is used in a region of memory that is shareable with other masters. A single write occurs, that is required to propagate to main memory. A WriteLineUnique transaction must be a full cache line store and all bytes within the cache line must be updated. According to ACE, a master issues a WriteLineUnique towards CCI. In turn, CCI performs two operations: • Send the write request downstream to a slave,

38 • Sends MakeInvalid requests to all other snoopable masters. In gem5, each master port receiving the MakeInvalid broadcast must forward it upstream towards all snoopable masters. Once the MakeInvaild broadcast has ended, it is sunk once it reaches a slave port which does not have any further downstream port. This solution is similar for the method used to sink requests in a slave, when the request had already been responded by some cache in the system. However, the major difference between the ACE and gem5 extension is in case the request results in a hit to an exclusive or modified cache line. In such cases no broadcast has to be performed, as the line is guaranteed not to reside in any other cache in the system. The bus behavior in gem5 is the same as CCI’s: the data should be written to main memory and invalidation should be broadcast to all other snoopable masters. The purpose of a WriteLineUnique is to inform all masters that might contain a line that its entire contents will be written to, and as such there is no reason to perform a WriteBack as would be required when a WriteLine is issued.

4.4 Modeling inaccuracies, optimizations and places for improvement

Although this work was extensive with regards to the changes made to the gem5 memory system, there are still several modeling inaccuracies which should be revised in order to make the memory system more realistic: • Perfect access speculation: currently a snoop hit will result in the memory controller not to handle a request but to delete it. This means a speculative fetch to the DRAM was magically avoided. This harms simulation accuracy, as the memory controller should have been occupied - thus we lose both bandwidth, latency, and resource contention that would have been different had the speculative fetch been performed. The afore-mentioned oddity is a result of the express-snooping mechanism. Another inaccuracy caused by this mechanism is the parallel 0-time snooping. An interconnect might serialize snoop requests, or implement different fetching speculation. • In addition, snoop misses practically come at zero-cost. • The memory system adopts an optimization which provides a cache line in an exclusive state upon a read request which did not result in any snoop-hit. This is not a magical modeling, yet is not a by the book implementation of the MOESI protocol. • Due to all of the afore-mentioned reasons, it seems inevitable to re-design the snooping mechanism such that no zero-time or magical knowledge sharing could be done. The effect of such a change would be noticable in workloads that generate sufficiently high inter-cluster snoop hit rates (passing data between clusters), while a memory-intensive client is simultaneously draining the DRAM’s bandwidth. It would best to correlate gem5’s memory system with existing hardware for using such a mixed workload. • The cache-line ownership passing policy has significant impact on the reported performance. For instance, data is passed on-chip when a read-exclusive snoop request results in a hit to a modified line. The by the book implementation would require performing a Write-Back to update the main memory with the modified line prior to providing the requesting master a clean, exclusive copy. Instead, data is passed on-chip together with ownership of the line. The impact of such assumptions can significantly change overall system performance and therefore might be redesigned to support other protocols. • It is due to the afore-mentioned trade-offs that the actual benefit of the implemented ACE transactions cannot be demonstrated using the current gem5 memory system. An elaborated discussion with per-transaction-type explanation is provided in Section 7.1.4. Nevertheless, the insight gathered during the process of implementing and extrapolating ACE-transactions to gem5 is a significant contribution on its own. • The bus model now utilizes layers to better model resource contention. However it still misses a mechanism for limiting the number for outstanding transactions. The work done on the bus brought it significantly closer to a modern interconnect yet this key feature is still missing.

• Currently barriers are implemented in the CPU models. Offloading barriers to the memory system could have significant impact. The avid reader can an informal description of ARM barriers in [34], or [6] for the exact ACE definition.

39 • The bus model does not model any PoS, QoS features, and address striping. These features should be investigated for their impact. E.g. QoS support should be tested to verify the assumptions made during the interconnet’s design. All of the afore-mentioned topics have been troubling to several members of the active gem5 community. They range from challenging to daunting and are left as future work for very justified reasons. Nevertheless, important computer-architecture and SoC design conclusions can be made based on these features and as such they should be carefully considered as part of future research. Further topics proposed as future work are provided in Section 7.3.

4.5 Bus performance observability

In order to analyze the interconnect’s performance, and evaluate the changes made to the bus and the memory system, statistics were added to model. In Section 3.6.1 the main performance-analysis building blocks gem5 offers were presented. The main advantage of utilizing stats in the bus models and in each layer is the ease of achieving insight from just several new stats. The current alternative would be to utilize multiple communication monitors, one between each two ports which are of interest from a system perspective, and post-processing their data to get a system-wide an overview. Utilizing communication monitors also comes at a high performance price when comparing to stats, as communication monitors result in deeper call stacks, and additional allocation and de-allocation of SenderState structs which are attached to each packet passing through a communication monitor. Since adding statistics collecting and dumping has a negative impact on performance, statistics have to be carefully considered. To improve observability with a moderate performance impact, a small set of statistics were added.

Global bus statistics The following statistics are collected on a bus-level per bus type: • Throughput: bandwidth (bytes/sec) provided by the bus as a result of servicing packets of any kind. Aggregated over all ports. In case the bus is a coherent bus, the throughput is the sum of the data (regular requests and responses) and snoop traffic throughput. • Transaction distribution: per transaction type (e.g. read request, read exclusive request, write response, upgrade request), the number of transactions serviced by the bus, aggregated over all ports. • A per-port transaction distribution: per master/slave port, the number of transactions of any type serviced by the bus. • In both the coherent and non-coherent bus, aggregate data through the bus is monitored and used for calculating the bus’s throughput. • In the coherent bus, aggregate snoop data through the bus is monitored in addition. The above-mentioned statistics are calculated using aggregation of packet data and type collected upon servicing each incoming packet. The packet’s request type, and payload size (when applicable) are recorded and averaged over the sampling period.

Layer statistics The following statistics are collected per bus layer, for gaining resource-specific insight: • Occupancy: the average number of master or slave ports waiting for the layer to become available. More specifically, ports pending in the layer’s retry list. • Utilization: the percent of the time the layer was occupied (not in IDLE state). • Average waiting ports: the average number of ports pending for the layer to become available for their service. • Average waiting time in the retry list: the average time a port waits from its arrival to the retry list till the layer becomes available for service. • A per-port occupancy: reflecting the time each port waits till the layer becomes available for its service. The above-mentioned statistics are based on book-keeping done upon any layer state change.

40 4.6 Conclusions

• Making changes to gem5’s memory system requires in-depth understanding of a vast amount of code. Specif- ically, the bus and caches’ mechanisms are far from trivial and require a very intensive study period before changing a single character. • In order to make any changes to such complex systems, it would be best to have proper documentation, convenient observability, and fine-grained test-rigs for playing around with changes before trying to invoke a real workload as BBench.

• Such debugging infrastructure has been added to better visualize complex interactions in the memory system. • Such a fine-grained testing infrastructure has been created, based on a MemTest unit-test like tester which is further described in Section 5.1. • Applying real-world specifications to an exploration infrastructure introduces very challenging design ques- tions. These questions can lead to insight about how a system or interconnect should be designed. Some of the questions raised are provided as suggested future-work topics in sections 4.4 and 7.3.

41 Chapter 5

Implementation and Verification Framework

Before commencing any code changes to gem5’s memory system, in-depth know-how of transaction flows and implemented coherence protocol had to be acquired. gem5 includes a convenient verbosity system which enables triggering informative messages according to the need, yet these were not sufficient for bridging the knowledge gap. Furthermore, no small-scale workloads were available, small enough to be able and isolate sequences and interactions. For this reason, on one hand, additional debug information was added to the existing mechanism, and a small-scale testbench named MemTest was used. In this chapter MemTest is described, focusing on improvements made that are useful for testing and making changes to the memory system. It is then used for validating ACE transactions. Lastly, statistics that have been added to the bus model for improved observability are discussed.

5.1 Memtest

This project required making several types of changes to gem5’s memory system, all throughout its hierarchy: in buses, caches, packet and request methods and so forth. As gem5’s memory system and the interactions which occur in a system are very intricate, a fine-grained tool for learning and testing small-scale sets of events was required. gem5 already had a unit-test-like infrastructure called MemTest, yet it suffered from various problems, and was in general obsolete. The main changes included: • The old MemTest was based solely on random generation of requests. However, it did not generate any true sharing, as each CPU would generate a request to a specific byte in a cache line, according to its unique CPU ID. The updated MemTest generates truly random requests, thus generates true sharing. As such, MemTest poses itself as a functional tester or small-scale regression for the memory system. • In order to enable fine-grained manual injection of interesting sequences of instructions to the memory system, a scenario support was added to MemTest. Each scenario contains a list of instructions to be injected, including basic details such as the CPU that should initiate the request, the type of the request, the time at which the request should be initiated and so forth. This boosted development and verification progress and is demonstrated in Section 5.2. • The original MemTest composed test systems in a recursive method, which eventually created awkward systems. Figure 5.1 contains a ‘3:2:1‘ (hence three L2 caches, each connected to two L1 caches, each connected to a CPU) system using the original MemTest. Figure 5.2 contains a functionally equivalent ‘3:2:1‘ system using the improved MemTest. The visual difference makes it much easier to understand the simulated system. • Supporting any test-system hierarchy: a MemTest system is defined using a fanout-like tree colon-separated string specification, where each level is defined by the number of caches (or CPUs in the last level) per cache in the previous level. For example: a ‘2:1‘ specification describes a system with two caches, each connected to a CPU. A ‘2:3:1‘ specification describes a system with two L2 caches, each connected to three L1 caches, each of which is connected to one CPU. While the original MemTest supports a limited set of specifications, the modified MemTest supports any specification. Note that the purpose of the funcmem is for functional checking of each performed request, hence this memory is a scoreboard and not part of the simulated system.

42 root : Root

system : System

physmem : SimpleMemory

cache2 cache1 cache0 : BaseCache : BaseCache : BaseCache

cache1 cache0 cache1 cache0 cache1 cache0 : BaseCache : BaseCache : BaseCache : BaseCache : BaseCache : BaseCache

cpu cpu cpu cpu cpu cpu : MemTest : MemTest : MemTest : MemTest : MemTest : MemTest

test functional mem_side mem_side test functional mem_side mem_side test functional mem_side mem_side test functional mem_side test functional mem_side mem_side test functional

cpu_side_bus cpu_side_bus cpu_side_bus cpu_side_bus funcmem : CoherentBus : CoherentBus : CoherentBus : CoherentBus : SimpleMemory

system_port cpu_side slave master cpu_side slave master cpu_side slave master cpu_side cpu_side slave master cpu_side port

port cpu_side cpu_side cpu_side

Figure 5.1: A 3:2:1 old (unintuitive) MemTest system

root : Root

system : System

cpu5 cpu4 cpu3 cpu2 cpu1 cpu0 : MemTest : MemTest : MemTest : MemTest : MemTest : MemTest

test functional test functional test functional functional test functional test functional test 1 1 1 1 1 1 l1c5 l1c4 l1c3 funcmem l1c2 l1c1 l1c0 : BaseCache : BaseCache : BaseCache : SimpleMemory : BaseCache : BaseCache : BaseCache

cpu_side mem_side cpu_side mem_side cpu_side mem_side port cpu_side mem_side cpu_side mem_side cpu_side mem_side

2 tol2bus2 2 tol2bus1 2 tol2bus0 : CoherentBus : CoherentBus : CoherentBus

slave master slave master slave master

l2c2 l2c1 l2c0 : BaseCache : BaseCache : BaseCache

cpu_side mem_side cpu_side mem_side cpu_side mem_side

membus 3 : CoherentBus

system_port slave master

physmem : SimpleMemory

port

Figure 5.2: A 3:2:1 new MemTest system

5.2 ACE-transactions development

A major part of this project required adding support of ACE-like transactions to gem5’s memory system. Devel- opment and testing of these transactions was heavily based on MemTest, as it allowed injecting any stimuli and covering corner cases easily. Each newly-introduced transaction was tested using: • Scenarios in which other caches in the system contain the same cache line accessed in states M/O/E/S/I, in order to verify caches were correctly modified and the cache-line after-state is as required. Each scenario contained a warm-up stage which is meant to bring the system to the required situation before the transaction under test is issued. • Simple and complex systems, for instance systems with two and four CPUs, with or without level-2 caches,

43 to verify snooping is forwarded and handled correctly. Hence systems with either single or multi-hop inter- connects. • Using either Atomic or Timing memory modes, to verify functional correctness in both modes.

5.3 ACE transaction verification

The following section describes how each of the ACE-like transactions was tested. Since the verification process covering all situations for all test systems is exhaustive and lengthy, only illustrative traces will be provided. All provided examples are for Timing memory mode. The provided snapshots contain the injected scenario and snippets from gem5’s log which contain any initiating or completion of a request, as well as main cache activities such as cache-line state changes. Note that all addresses specified in the input scenario are offsets that are added during simulation on top of a base-address (0x10000 in this case), thus the logs will contain a different address. Time units are simulation ticks, which are equivalent to 1 ps. Requests are always issued with a sufficiently long break between consecutive transactions such that there is never more than one transaction in-flight.

5.3.1 ReadOnce This section contains a simple scenario for testing a ReadOnce request to a line which resides in a cache in the system in M-state. The test system is a simple ‘3:1‘ MemTest system depicted in Figure 5.3.

root : Root

system : System

cpu2 cpu1 cpu0 : MemTest : MemTest : MemTest

test functional functional test functional test

l1c2 funcmem l1c1 l1c0 : BaseCache : SimpleMemory : BaseCache : BaseCache

cpu_side mem_side port cpu_side mem_side cpu_side mem_side

membus : CoherentBus

system_port slave master

physmem : SimpleMemory

port

Figure 5.3: MemTest 3:1 system

The test scenario is trivial: • A Write request to address 0 is issued by CPU 0 (system.cpu0 ) at time 0. As a result, CPU 0’s cache (system.l1c0 ) will issue a read-exclusive request to obtain the cache line in E-state, and then modify the line (updating it to M-state) due to the write request. • A ReadOnce request to the same address is issued by CPU 2 (system.cpu2 ) at time 3000. The request is sent to the bus, forwarded (snoop-read) to all other caches, handled as snoop-hit by system.l1c0 which later provides a snoop response with the data, while keeping its line in M-state. The test ends once the ReadOnce response arrives to CPU 2. The scenario and a snippet of the simulation log are provided in Figure 5.4.

5.3.2 ReadNoSnoop This section contains a test scenario for testing a ReadNoSnoop request to a line which resides in a cache in the system in E-state. The test system is a simple ‘3:1‘ MemTest system depicted in Figure 5.3. The test scenario is very simple:

44 Figure 5.4: ReadOnce for an M-state line

• A ReadEx (read exclusive) request to address 0 is issued by CPU 0 (system.cpu0 ) at time 0. As a result, CPU 0’s cache (system.l1c0 ) will issue a read-exclusive request to obtain the cache line in E-state. • A ReadNoSnoop request to the same address is issued by CPU 2 (system.cpu2 ) at time 3000. The request is sent to the bus, yet not forwarded to any other master, but directly sent to the main memory, which provides the response. The CPU 0’s handles the response and allocates a cache line. At the end-state, the cache line is in E-state. This scenario obviously breaks coherency in the system, yet it is injected for testing and demonstration system. The purpose of ReadNoSnoop is to avoid snooping in vain, and what we demonstrated is that indeed the bus does not forward ReadNoSnoop requests upstream as snoops.

Figure 5.5: ReadNoSnoop for an E-state line

5.3.3 WriteNoSnoop This section contains two test scenarios for testing a WriteNoSnoop request:

• A scenario for demonstrating WriteNoSnoop as-defined in the ACE specifications in a system which utilizes a single-hop interconnect, as depicted in Figure 5.6:

root : Root

system : System

cpu1 cpu0 cpu2 : MemTest : MemTest : MemTest

test functional test functional test functional

membus funcmem : CoherentBus : SimpleMemory

slave master system_port port

physmem : SimpleMemory

port

Figure 5.6: MemTest 3 system

In this scenario, CPU 2 issues a WriteNoSnoop request to address 0 is issued by at time 0. As a result, no snooping is performed and the bus directly sends the request to the memory which provides a completion

45 response. The simulation log is provided in Figure 5.7. The scenario and a snippet of the simulation log are provided in Figure 5.7.

Figure 5.7: WriteNoSnoop in an ACE context

• A scenario for demonstrating WriteNoSnoop in a simple multi-hop interconnect, as depicted in Figure 5.3.

Figure 5.8: WriteNoSnoop for an E-state line in a multi-hop interconnect

– CPU 0 initiates a ReadEx (read exclusive) request to address 0 at time 0. CPU 0’s cache receives the line and stores it in E-state. – CPU 2 initiates a WriteNoSnoop request to address 0 at time 3000. CPU 1’s cache initiates a ReadNoS- noop request, receives the line, stores it in E-state and then modifies it using the write-request’s data. As such, the cache line state changes from E to M. The bus does not issue any snooping during the write operation. Also in this case, the toy example breaks coherency as now two caches in the system contain different values representing the same address. Yet this illegal scenario is for demonstration purposes only.

5.3.4 WriteLineUnique WriteLineUnique triggers the generation of MakeInvalid in its ACE context, and thus MakeInvalid will be covered in this section. Similarly to Section 5.3.3, two scenarios will be provided: one for demonstrating an ACE implemen- tation, and another for a multi-hop interpretation. Both scenarios make use of a two-clustered system as depicted in Figure 3.8(b).

• In its ACE implementation, when the bus receives a WriteLineUnique request, it issues a MakeInvalid broad- cast to all other masters to invalidate their copies of the cache line being accessed. Such a scenario is provided in Figure 5.9.

– The first two instructions are issued to create a situation with a line in an O-state: CPU 0 issues a write request to address 0 at time 0, eventually leading to the line being in M-state in its cache (system.l1c0). A following read request from CPU 1 at time 1000 leads to system.l1c0 moving from M-state to O-state and passing a copy to CPU 1’s cache. The end result of these first two instructions is a system with two copies of the same cache line, one in O-state (system.l1co) and one in S-state (system.l1c1). – At time 3000, CPU 3 issues a WriteLineUnique to address 0. As this write instruction is of a cache-line size, its level-1 cache allocates the line and updates it to be in M-state. An invalidation (MakeInvalid) broadcast is sent downstream which is forwarded to all snoopable masters. The copies in caches sys- tem.l1c0 and system.l1c1 are both invalidated and the transaction ends.

• In the gem5 context, a WriteLineUnique issued by a master can be received by an intermediate cache. In such a case, if the cache line is in a Unique-Clean (E) or Unique-Dirty (M) states, there is no need to broadcast an invalidation request as the coherence protocol ensures the line will not reside in any other cache downstream in the system, nor in any other cluster.

46 Figure 5.9: WriteLineUnique for an O-state line in an ACE context

Figure 5.10: WriteLineUnique in a multi-hop context

47 – The system wakes up and at time 3000 CPU 3 issues a Read request to address 0. This eventually leads to this cache line being in E-state, and not as one might have expected, in S-state, due to an optimization described in Section 4.4. This is since the line does not reside in any other cache in the system at that moment. – At time 4000, CPU 3 issues a WriteLineUnique. While in the classic ACE context this would have triggered an invalidation broadcast, here since the line is already in E-state, there is no need to issue an invalidation broadcast. The cache line state is updated to M-state and the transaction ends.

5.4 Conclusions

• MemTest proved to be essential for studying small-scale interaction in gem5’s memory system, and an enabler for rapidly introducing new types of transactions. It has been demonstrated as an easy tool for creating any sought situation in the memory system, even if it is coherency-breaking. • All ACE-like transactions were verified using MemTest. • Automatically visualizing test systems makes verifying a test-system’s structure trivial. This feature is further discussed in AppendixA. • The afore-mentioned changes are useful for a large croud of the gem5 users. Specifically, they provide users with more convenient tools and improved system observability.

48 Chapter 6

Performance Analysis

6.1 Metrics and method

Throughout the experiments, the following indicators will be used to investigate a system’s performance: • General system, CPUs and caches statistics: test execution time (guest time), CPU idle cycles percentage, CPU memory references, L1 I-cache miss-rate, L1 D-cache miss-rate, L2 miss-rate • Memory stats: DRAM read / total bandwidth, aggregate total / read data through the bus • Bus stats: throughput, request / response / snoop response layer utilization, average time in snoop response layer retry list, aggregate data and snoop-data through the bus. • Simulator performance: simulation time (wall-clock), host instruction rate, host tick rate. This is meant for evaluating how reasonable it is to run various sizes of workloads. All information gathered will be based on gem5’s stats mechanism, which was described in Section 3.6.1.

6.2 Hypotheses

The importance of an engineered post-experiments hypothesis section is mostly methodological. However, the purpose of this section is to stress expected trends or behavior throughout the simulations covered in this chapter. Regarding the impact of snoop response latency, and queued responses in general: • Little if any change should be seen for a sweeping increased snoop response latency for all tests with a single core. There might be some impact as the system also contains an I/O cache which can be snooped, yet the sharing patterns will most probably be negligible. • For larger systems, we expect all system indicators to degrade as the snoop response latency knob is scaled up. • For all experiments, we expect simulations utilizing the CoherentBus (and not the DetailedCoherentBus) to perform better consistently, as its temporal behavior is less accurate, in too good a way. • These hypotheses are based on an underlying assumption that PARSEC benchmarks exhibit intensive inter- core and inter-cluster sharing patterns. As inter-cluster sharing degrades, so will the implication of the snoop response latency (if there is no sharing between clusters, no snoop hit traffic on the interconnect, then no responses are penalized). • Due to the acute simulation performance problem discussed in Section 3.6.2, running lengthy simulations with multiple A15 O3 cores would not be feasible. For instance, a BBench run on a four-core platform will require weeks of processor time. Regarding the impact of splitting the bus to layers for improved resource contention modeling: • There is no question about this move’s necessity. Previous single-resource bus was an outdated model. Since the bus has been rapidly changing, we cannot establish a fair comparison of old-vs-new.

49 Regarding evaluation of ACE transactions • As briefly described in Section 4.4, and will be elaborated in Section 7.1.4, the current state of gem5’s memory system does not provide a reference system for holding a non-discriminating comparison.

6.3 Small-scale workloads

We are interested in evaluating the impact of snoop response service latency and bus resources contention on a system’s performance. We aim to demonstrate how the added bus statstics enable observing what the system bottlenecks are. In order to stress the bus such that snoop response latency will make a difference, we have to utilize workloads which are inherently multi-threaded with intensive sharing patterns. This is since the lack of sharing means no snoop hits will occur, thus no snoop response traffic will be observed. In order to evaluate a set of features, it would be wise to start with small-scale workloads. In such workloads, oriented to stress specific features, there is less noise in the system. For instance in a full-system running Linux OS, it is harder to reason for small-scale effects. Furthermore, small-scale simulations are also a form of verification to the work done, reassuring implementation is correct, hence functions as required. A convenient tool for small-scale explorations is MemTest, which was discussed in Section 5.1. MemTest can be used to conveniently generate any arbitrary symmetric system, and can be easily configured to inject transactions to the memory system in interesting patterns.

6.3.1 MemTest experiments setup The most primitive yet representative system that can be used for an experiment which deals with snoop traffic on the bus is depicted in Figure 6.1.

root : Root

system : System

cpu1 cpu0 : MemTest : MemTest

test functional functional test

l1c1 funcmem l1c0 : BaseCache : SimpleMemory : BaseCache

cpu_side mem_side port cpu_side mem_side

membus : DetailedCoherentBus

system_port slave master

physmem : SimpleMemory

port

Figure 6.1: Small-scale snoop-response test system

The test system: • The system and stimuli were designed to provide controllable and significant sharing behavior and snoop traffic through the bus. • Contains two MemTest CPU transaction generators. Each of them is configured to generate 10,000 random read or write transactions, with a 65 % probability of issuing a read request.

50 • The memory size is configurable. Each generated request is issued to a random destination address within this memory range. As a result, the smaller the memory, the higher the likelihood of an access to result in a snoop hit (triggering a snoop response). • Each CPU has a private 32 KB cache. • Both caches are connected to the main interconnect. The interconnect’s snoop response latency can be adjusted. • Each system was simulated with memory sizes ranging from 128 B to 32 MB. Beyond that range, the chance of randomly sharing a cache line are not interesting in this context as their impact on system performance will be negligible.

• Each system was simulated using a range of snoop penalty latencies (from 10 ns to 90 ns), and in addition using the previous coherent bus model (CoherentBus) as a reference.

6.3.2 MemTest experiments results Results for the MemTest-based set of experiments are provided in Figure 6.2. In all the observed metrics, results demonstrate four separate phases: • At the lowest range of the x-axis scale, which represents the memory address range to which random requests were issued (roughly 102 to 103 bytes), execution times (depicted in Figure 6.2(a)) are decreasing as the shared memory area increases. At this point the aggregate snoop data through the bus (depicted in Figure 6.2(c)) is maximal and decreasing, as is the miss rate. This can be attributed to the tiny shared address space: the likelihood that the caches contain an address required by the other cache is highest. The reason for the relatively high execution times is due to the higher miss-rates (depicted in Figure 6.2(g)) and additional snoop-traffic through the bus. The reason for the increased miss-rate is the high probability that a cache 0 will try to write to a line which resides in cache 1. Such an access will invalidate the cache 1’s line, thus increasing cache 1’s miss-rate in next instructions.

• As the shared range increases towards 104.5, the increased shared memory range causes less such cross- invalidations (observable in Figure 6.2(b)), resulting in lower cache miss-rates and thus shorter execution times. • In the range around the cache size we observe a sharp change in all parameters. Obviously, once the shared memory range is larger than the cache size, the miss rate will rise sharply, degrading all performance in all metrics. This is also observable by the deep dive in aggregate snoop data, seen in Figure 6.2(c). • Beyond 105 the sharing patterns decay, and the system becomes limited by the physical memory, as we can see in Figure 6.2(f) saturates. The results are all consistent with our hypotheses:

• As seen in Figure 6.2(a), the reference bus always outperforms the detailed coherent bus. • In systems which utilize the DetailedCoherentBus, as the snoop penalty rises, the system’s performance decreases. The set of metrics used provided insight about the system’s bottlenecks for any of the configurations.

6.4 Large-scale workloads

The small-scale experiments provided almost absolute observability regarding the impact of snoop response latency in the bus on system performance as a function of shareability: how much do clusters actually snoop-hit each other. Although the small-scale experiments used toy-systems, they provided meaningful insight. The confidence level in the changes made has been partially established. In order to evaluate the same phenomena in larger-scale, full-systems and real-life workloads, we PARSEC [17] benchmark tests.

51 100 0.0025 Coherent Bus 1200000 10 ns Coherent Bus 30 ns 10 ns 80 0.0020 50 ns 1000000 30 ns 70 ns 50 ns 90 ns 70 ns 800000 60 90 ns 0.0015

600000 40 0.0010 Coherent Bus 400000

Execution time (seconds) 10 ns

30 ns utilization - layer occupancy (%) 0.0005 20

50 ns total snoop data through layer (bytes) 200000 70 ns 90 ns 0.0000 0 0 2 3 4 5 6 7 8 2 3 4 5 6 7 8 4 5 7 10 10 10 10 10 10 10 10 10 10 10 10 10 10 102 103 10 10 106 10 108 shared memory area size (bytes) shared memory area size (bytes) shared memory area size (bytes) (a) Execution time (b) Snoop response layer utilization (c) Aggregate snoop data through the bus

1e9 7 2.0 10 1010 Coherent Bus Coherent Bus 10 ns 1.8 10 ns 9 6 30 ns 10 10 30 ns 50 ns 1.6 50 ns 70 ns 70 ns 108 90 ns 1.4 90 ns 105

1.2 107

104 1.0 Coherent Bus 106

bus throughput (bytes/s) 10 ns 0.8 30 ns 103 50 ns 105 0.6

total (non snoop) data through the bus (bytes) 70 ns Total bandwidth to/from this memory (bytes/s) 90 ns 2 0.4 4 10 2 3 4 5 6 7 8 2 3 4 5 6 7 8 10 4 5 7 10 10 10 10 10 10 10 10 10 10 10 10 10 10 102 103 10 10 106 10 108 shared memory area size (bytes) shared memory area size (bytes) shared memory area size (bytes) (d) Aggregate data (non-snoop) through the (e) Bus throughput (data and snoop) (f) DRAM throughput bus

1.0 Coherent Bus 10 ns 0.9 30 ns 50 ns 70 ns 0.8 90 ns

0.7

0.6 miss rate for overall accesses

0.5

0.4 4 5 7 102 103 10 10 106 10 108 shared memory area size (bytes) (g) Cache miss rate

Figure 6.2: Simulation results for MemTest-based snoop-response latency experiments

52 6.4.1 PARSEC experiments PARSEC is a benchmark suite composed of multi-threaded programs. The suite was designed to be representative of next-generation shared-memory programs for chip-multiprocessors. Initial exploration of PARSEC benchmarks performed at ARM on systems with two CPUs showed little sensitivity to L2 hierarchy (whether shared or private), as well as insensitivity to snoop latencies. Our hypothesis was that the modeling performed was partial and the current bus model will demonstrate variations in the results as a function of the snoop response latency. All experiments are run on the target platform gem5 model as presented in Section 3.6, which contain four cores, or slimmed-versions with one or two cores. Since invoking a simulation with four A15 O3 cores was seen to be problematic performance-wise (as discussed in Section 3.6.2), simulations were run using Timing CPUs. The following PARSEC benchmarks were used:

• Blackscholes: a financial analysis model: an option pricing kernel that uses the Black-Scholes partial differ- ential equation (PDE). Involves a large number of read requests, and with very few write requests. Exhibits light and regular communication pattern (each core has a consumer-producer relation with a single other core: neighborhood communication). • Bodytrack: a computer vision application: body tracking application that locates and follows a marker-less person. Exhibits very light (sparse) communication pattern (only few of the threads have a consumer-producer relation with a different core). • Canneal: an engineering workload: simulates cache-aware annealing kernel which optimizes the routing cost of a chip design. Consumes a very high DRAM bandwidth, exhibits and a very high IPC, high Cache and DTLB miss rate. Exhibits extremely heavy sharing pattern (each core communicates intensively with the rest of the cores). • Fluidanimate: fluid dynamics animation: fluid dynamics application that simulates physics for animation purposes with the Smoothed Particle Hydrodynamics (SPH) method. Exhibits high extra-CPU utilization, a light and regular communication pattern (similar to Blackscholes). • Swaptions: Financial Analysis: prices a portfolio of swaptions with the Heath-Jarro-Morton (HJM) frame- work. High extra-CPU utilization; dense but irregular communication pattern. • Streamcluster: Data mining: stream clustering data-parallel algorithm used for finding medians. Exhibits low sharing behavior, uses shared-memory mostly as read-only. Its communication pattern is similar to Blackhole’s (neighborhood communication) yet less intensive.

• Vips: Media Processing Image processing application; exhibits extremely high L2 miss rate ( 50%), and high DRAM write bandwidth. Exhibits irregular communication pattern. • x264: Media Processing: an H.264 video encoding application; spawns many threads (not more than 4 simultaneously), stresses I-side (I-Cache, Branch prediction, ITLB), and exhibits irregular communication pattern.

The provided analysis and description of each of the benchmarks is based on [15, 17, 18] and previous work conducted at ARM. As PARSEC is multi-threaded, it is important to understand the communication patterns between CPUs in each benchmark, as each benchmark may exhibit different sharing or consumer-producer patterns. Figure 6.3 provides insight into sharing patterns each PARSEC benchmark exhibits on a multi-core platform. The gray-level of each each cell in a matrix represents how intense is communication between CPUs, with darker shades representing intensive communication. The axis units are CPU IDs. The PARSEC benchmark suite offers six standardized input sets. According to [18], larger input sets guarantee the same properties of all smaller input sets. As such, to enable exploration of a large set of configurations, the simsmall input set was used. According to [17] the PARSEC benchmark set is diverse in the synchronization methods its benchmarks use, working sets, locality, data sharing and off-chip traffic. This poses PARSEC as a representative workload bundle for our needs.

6.4.2 PARSEC results The results provided in the following section are based on statistics collected during the benchmark run. Simulation requires Linux OS to be booted prior to starting each test, yet the boot process might severely effect the results.

53 blackscholes bodytrack canneal dedup

30 30 30 30 28 28 28 28 26 26 26 26 24 24 24 24 22 22 22 22 20 20 20 20 18 18 18 18 16 16 16 16 14 14 14 14 12 12 12 12 10 10 10 10

producer CPU 8 producer CPU 8 producer CPU 8 producer CPU 8 6 6 6 6 4 4 4 4 2 2 2 2 0 0 0 0

0 4 8 12 16 20 24 28 0 4 8 12 16 20 24 28 0 4 8 12 16 20 24 28 0 4 8 12 16 20 24 28 consumer CPU consumer CPU consumer CPU consumer CPU

facesim ferret fluidanimate freqmine

30 30 30 30 28 28 28 28 26 26 26 26 24 24 24 24 22 22 22 22 20 20 20 20 18 18 18 18 16 16 16 16 14 14 14 14 12 12 12 12 10 10 10 10

producer CPU 8 producer CPU 8 producer CPU 8 producer CPU 8 6 6 6 6 4 4 4 4 2 2 2 2 0 0 0 0

0 4 8 12 16 20 24 28 0 4 8 12 16 20 24 28 0 4 8 12 16 20 24 28 0 4 8 12 16 20 24 28 consumer CPU consumer CPU consumer CPU consumer CPU

streamcluster swaptions vips x264

30 30 30 30 28 28 28 28 26 26 26 26 24 24 24 24 22 22 22 22 20 20 20 20 18 18 18 18 16 16 16 16 14 14 14 14 12 12 12 12 10 10 10 10

producer CPU 8 producer CPU 8 producer CPU 8 producer CPU 8 6 6 6 6 4 4 4 4 2 2 2 2 0 0 0 0

0 4 8 12 16 20 24 28 0 4 8 12 16 20 24 28 0 4 8 12 16 20 24 28 0 4 8 12 16 20 24 28 consumer CPU consumer CPU consumer CPU consumer CPU

low high communication Figure 5: Normalised communication between different CPUs during the entire parallel phase of the program for the Parsec benchmarkFigure 6.3: suite. Normalized communication between different CPUs during the entire parallel phase of the program for the Parsec benchmark suit. Source: [15] plications that use a significant fraction of the shared ad- set. As such, a migratory sharing pattern will be classified dress space for producer-consumer communication, also as a mostly stable read set. A produced value is consumed use a signification fraction of communicating writes in this by only one processor (and not consumed by all other pro- way. The two exceptions to this observation are volrend cessors). As such, a migratory location is considered highly and water-nsquared. Volrend only uses around 10% of the predictable. In order to classify a location as stable, it is shared address space for producer-consumer communica- necessary that at least two communicating write accesses tion, but more than 55% of its communicating writes. Water- are performed on that location. nsquared uses around 35% of its shared address space for Figures 11a and 11b show the results for the stability of producer consumer communication, but only 7% of its com- the read set. We find that both in the spatial and quantative municating writes. analysis a significant number of locations and write accesses 54 have a very stable read set (80% to 100%). In many cases 4.4 Read-Set Stability these results roughly overlap with the migratory sharing re- sults from figure 9. Minor differences in these results (for The read set is considered stable when it is known that a example more locations are classified migratory than there produced value will be consumed or not be consumed by a are locations with a read set stability) are due to slight dif- given processor. As such, a processor that always consumes ferences in measuring these locations. For example, the last aproducedvaluecontributestoastablereadset.Similarly, write in a migratory pattern does not have to be a communi- aprocessorthatneverconsumesaproducedvaluealsocon- cating write. As such, if a migratory pattern consists of only tributed to a stable read set. A processor that consumes only 2writesthenitispossiblethatitwillnotbeconsideredfor half of the produced values contributes to an unstable read the read set stability analysis.

93 moreover, when using small datasets which require a short time to run comparing to the boot process. As such statistics have been reset after the boot process and the results shown are for the benchmark period only. The selected set of results aims at providing insight regarding system performance and how memory-intensive the benchmark is. In case normalized results are provided, normalization is done per configuration (number of cores, number of threads, and benchmark used) using the result of the Coherent Bus as a reference point. Hence, normalized values presented are the ratio between each of the DetailedCoherentBus’s statistics and that of the Coherent Bus. Results are only provided if reference value exists.

Execution time and simulator performance From a system-level point of view, the two most important metrics are execution time and energy. There is ongoing work for adding power modeling to gem5, yet this is currently unavailable. The execution times presented are for a parallel execution, hence represent guest time and not an aggregation of the processing times of all CPUs. This section provides an overview of PARSEC execution (guest system) times and simulator performance. Results are for the Coherent Bus (and not the Detailed Coherent Bus), which is the reference bus in the rest of the provided results.

1 cores 1t 1.6 1 cores 1t 30000 70 1 cores 1t 1 cores 2t 1 cores 2t 1 cores 2t 2 cores 2t 2 cores 2t 1.4 2 cores 2t 2 cores 4t 2 cores 4t 60 25000 2 cores 4t 4 cores 4t 4 cores 4t 4 cores 4t 4 cores 8t 1.2 4 cores 8t 4 cores 8t 50 20000 1.0 40 0.8 15000

time [sec] 30 0.6 10000 20 Normalized execution time 0.4

5000 10 0.2 Real time elapsed on the host (seconds)

0 0.0 0

vips x264 vips x264 vips x264 canneal canneal canneal bodytrack swaptions bodytrack swaptions bodytrack swaptions blackscholes fluidanimatestreamcluster blackscholes fluidanimatestreamcluster blackscholes fluidanimatestreamcluster benchmark benchmark benchmark ” (a) Execution times overview (b) Normalized execution times (c) Simulation time

1e9 1 cores 1t 1 cores 2t 1600000 1 cores 1t 3.5 1 cores 2t 2 cores 2t 2 cores 2t 2 cores 4t 2 cores 4t 3.0 4 cores 4t 4 cores 8t 1400000 4 cores 4t 4 cores 8t 2.5

1200000 2.0

1.5 1000000

Simulator tick rate (ticks/s) 1.0 Simulator instruction rate (inst/s)

800000 0.5

vips x264 vips x264 canneal canneal bodytrack swaptions bodytrack swaptions blackscholes fluidanimatestreamcluster blackscholes fluidanimatestreamcluster benchmark benchmark (d) Host instruction rate (e) Host tick rate

Figure 6.4: General simulation and simulator-performance stats for PARSEC benchmarks

An overview of the execution times demonstrates the rich variety even when using the simsmall dataset: tests take from less than a second to more than a minute of guest time to finish. The benchmarks differ significantly in their behavior: • most of the benchmarks, e.g. blackscholes, canneal, and swaptions, demonstrate decreasing execution times as more cores are available. This can be observed both from Figure 6.4(a) and from the normalized execution times in Figure 6.4(b). • in the case of streamcluster, the using two threads per core significantly crippled performance. It should be stated that the CPU model used models an in-order core and results are expected to differ significantly using a multi-threading capable CPU. Simulator performance results are brought to provide the reader a glimpse of gem5’s capabilities and not as part of a scientific investigation. They provide an insight as to what activities slow down the simulation. The

55 host instruction rate results provided in Figure 6.4(d) demonstrate that for the benchmarks and CPU model used, the impact of simulating multiple cores is negligible - a positive piece of information. This of course comes at the price of accuracy, as the O3 model which provides more accurate results has demonstrated sever degradation as the number of CPUs used increases. The host instruction rate can be seen as the simulator’s horse power: how much computation can be done in a time unit. On the other hand, the simulator’s tick-rate in Figure 6.4(e) represents an imaginary clock frequency which is the clock rate at which the guest time is simulated comparing to the host time. A tick represents one picosecond. Since the simulated system contains components which utilize different clock frequencies, this is obviously an over-simplification. The simulator performance is therefore a combination of the two. During simulations of the reference bus, the simulator exhibited and average of roughly 1.2 MIPS. The canneal was the main culprit, demonstrating that intensive inter-core communication slows down the simulator significantly this observation is most visible in Figure 6.8(a), and by examining canneal’s communication pattern in Figure 6.3.

DRAM performance This section deals with observed usage behavior in gem5’s main memory model which represents both a memory controller as well as the dynamic/external memory itself. These results provide insight regarding any off-chip memory access which severely typically impacts both performance and power consumption.

1e8 1e8 1e9

1 cores 1t 1 cores 1t 3.5 1 cores 1t 1 cores 2t 1.1 1 cores 2t 1 cores 2t 1.2 2 cores 2t 2 cores 2t 2 cores 2t 2 cores 4t 2 cores 4t 2 cores 4t 1.0 3.0 4 cores 4t 4 cores 4t 4 cores 4t 4 cores 8t 4 cores 8t 4 cores 8t 1.0 0.9 2.5

0.8 2.0 0.8 0.7 1.5

0.6 0.6 1.0

0.5 0.5 Number of bytes read from this memory 0.4

Total bandwidth to/from this memory (bytes/s) 0.4 Total read bandwidth from this memory (bytes/s) 0.0

vips x264 vips x264 vips x264 canneal canneal canneal bodytrack swaptions bodytrack swaptions bodytrack swaptions blackscholes fluidanimatestreamcluster blackscholes fluidanimatestreamcluster blackscholes fluidanimatestreamcluster benchmark benchmark benchmark (a) Total DRAM bandwidth (b) DRAM read bandwidth (c) Aggregate data read from the DRAM

1e8 1.6 1 cores 1t 1 cores 1t 1 cores 2t 1 cores 2t 4 2 cores 2t 2 cores 2t 1.4 2 cores 4t 2 cores 4t 4 cores 4t 4 cores 4t 1.2 4 cores 8t 4 cores 8t 3 1.0

0.8 2

0.6

0.4 1 Normalized aggregate bytes read 0.2 Number of bytes written to this memory

0.0 0

vips x264 vips x264 canneal canneal bodytrack swaptions bodytrack swaptions blackscholes fluidanimatestreamcluster blackscholes fluidanimatestreamcluster benchmark benchmark (d) Normalized aggregate data read from (e) Aggregate data written to the DRAM the DRAM

Figure 6.5: Total DRAM bandwidth for PARSEC benchmarks

The observed total and read DRAM bandwidths appearing in Figures 6.5(a) and 6.5(b) respectively are very much benchmark-dependent. Canneal demonstrates a much higher bandwidth than all other benchmarks. All benchmarks demonstrate scalable bandwidth, dependent on either the number of threads (as in vips) or only on the number of CPUs (as in fluidanimate). To demonstrate the claimed scalability, in order to provide a clearer comparison, a normalized aggregate data read from the DRAM is depicted in Figure 6.5(d). In most benchmarks there is a distinct scaling: the read bandwidth scales up while the amount of data read scales down. We assume that the parallelizable workloads issue the same amount of read requests per time unit per CPU roughly as would a single CPU system would, causing the increase in total bandwidth. The down-scaling of aggregate data read is due to the added caches in systems with more CPUs. This behavior demonstrates that the gem5 memory system currently behaves as a Distributed Shared Cache, and that snoop traffic overheads might not be realistic enough. This pitfall was dealt with in this work, yet the added snoop response latency is just a chain in a link of changes required to model such systems

56 more realistically. On average each benchmark reads about 1.5 GB of data, yet throughout different patterns and execution times. Regarding written data, depicted in Figure 6.5(e): unlike the aggregate data read, here benchmarks vary signif- icantly in the total amount of data written. At most 0.4 GB is written (in vips), and in some benchmarks almost no data is written. Only in vips the amount of data written scales down as the number of threads increases. Again, these differences are reasonable as benchmarks differ significantly.

CPU and cache stats The statistics presented in this chapter focus on Timing CPU and cache statistics currently available in gem5. The abundance of available statistics is impressive yet here we focus on statistics most relevant to this investigation, in order to identify system bottlenecks in the memory hierarchy. The importance of these statistics is the insight they provide about the characteristics of each benchmark, such as how memory-intensive is a benchmark.

1e9 1e8 1 cores 1t 4.0 1 cores 1t 1.4 1 cores 1t 0.6 1 cores 2t 1 cores 2t 1 cores 2t 2 cores 2t 3.5 2 cores 2t 2 cores 2t 2 cores 4t 2 cores 4t 1.2 2 cores 4t 0.5 4 cores 4t 4 cores 4t 4 cores 4t 4 cores 8t 3.0 4 cores 8t 4 cores 8t 1.0 0.4 2.5 0.8 2.0 0.3 0.6 1.5 0.2

Percentage of idle cycles 0.4 1.0 number of memory referemces 0.1 0.5 0.2 rate of memory references (references/sec)

0.0 0.0 0.0

vips x264 vips x264 vips x264 canneal canneal canneal bodytrack swaptions bodytrack swaptions bodytrack swaptions blackscholes fluidanimatestreamcluster blackscholes fluidanimatestreamcluster blackscholes fluidanimatestreamcluster benchmark benchmark benchmark (a) CPU idle cycles percentage (b) CPU memory references (c) CPU memory references per second

Figure 6.6: Timing-model CPU statistics for PARSEC benchmarks

Figure 6.6(a) is provided under the assumption that a CPU in such benchmarks is idle mostly when pending for a memory operation to finish. As our CPU model is not an O3 model, this should provide a clear indication of how memory-performance-limited the system is. There is a visible scaling of the idle cycle percentage as more cores are added, yet not necessarily a linear increase. Another means of understanding the impact of the memory system is to understand the qualities of the bench- mark. By observing the rate of memory references in Figure 6.6(c) we can see how memory-intensive each benchmark is, and how its dependence on the memory system scales as more CPUs are added. There are several different pat- terns, for instance, for blackscholes we can observe a linear decrease as a function of the CPUs in the system. The swaptions benchmark, on the other hand, exhibits a fairly constant rate of memory references. In streamcluster, the adding a second thread per core significantly increases the number of memory references. To complete the picture of how a benchmark depends on the memory system, we must also observe the cache and interconnect performance.

1 cores 1t 0.0035 1 cores 1t 1 cores 1t 0.20 1 cores 2t 1 cores 2t 1 cores 2t 2 cores 2t 0.030 2 cores 2t 2 cores 2t 0.0030 2 cores 4t 2 cores 4t 2 cores 4t 4 cores 4t 4 cores 4t 4 cores 4t 0.025 4 cores 8t 4 cores 8t 4 cores 8t 0.15 0.0025

0.020 0.0020 0.10 0.015 0.0015

0.0010 0.010 0.05 L2 miss rate (misses/accesses)

0.0005 0.005

0.0000 0.000 0.00 L1 Ic miss rate for overall accesses (misses/accesses) L1 Dc miss rate for overall accesses (misses/accesses) vips x264 vips x264 vips x264 canneal canneal canneal bodytrack swaptions bodytrack swaptions bodytrack swaptions blackscholes fluidanimatestreamcluster blackscholes fluidanimatestreamcluster blackscholes fluidanimatestreamcluster benchmark benchmark benchmark (a) L1 I-cache miss rate (b) L1 D-cache miss rate (c) L2 miss rate

Figure 6.7: L1 and L2 cache miss rates for PARSEC benchmarks

Caches are designed to decrease access times under two main assumptions: temporal and spatial locality. While

57 high I-cache miss rates are related to code sparsity, D-cache miss rates indicate how data accessed is scattered. As such, workloads can exhibit very different I-cache and D-cache miss-rates, as code and data are orthogonal. For instance, in Figure 6.7(b) the x264 benchmark demonstrates a relatively high I-cache miss rate comparing to other benchmarks, and low D-cache miss-rates in Figure 6.7(a). A higher miss-rate moves the performance bottleneck lower in the memory system hierarchy, towards the L2-cache, the main interconnect, and the external memory. L2 miss rates are provided in Figure 6.7(c) and demonstrate again how memory-intensive canneal is. Comparing Figures 6.7(c) and 6.6(c) enables us to learn not just how memory-intensive a workload is, but how sparse and inconsistent the memory accesses are. For instance, while x264 is most memory intensive in terms of memory references rate, its miss rate is amongst the lowest in the benchmark set. Both L1 caches do not exhibit any exceptional patterns. However L2 miss-rates of blackscholes, bodytrack and swaptions in Figure 6.7(c) demonstrate a 3x to 6x increase in miss-rate in four-CPU configurations, comparing to other configurations. This can be attributed to the fact four-CPU configurations have two L2 caches, thus data that once was stored in a shared L2 cache is now split between two caches. This may be a useful indicator of characterizing inter-cluster sharing patterns in a workload. Obviously, L2 miss rates strongly correlate to the DRAM bandwidth provided in Figure 6.5(a). All miss-rates provided are the ratio between total access hits and total accesses performed. For systems with more than a single cache, the value represents the value of the lowest-indexed cache (hence the left-most in a system block diagram such as Figure 6.11) from that level and not an average. This is decision is mostly for post-processing convenience under the assumption that thread dispatching by the Linux SMP scheduler is uniform.

BUS stats One of the contributions this work provides is the improved bus observability due to added stats in the bus models, as described in Section 4.5. The stats used in the this section are a subset of these added stats.

1e8 3.6 1 cores 1t 1 cores 1t 1 cores 1t 1 cores 2t 3.2 1 cores 2t 1 cores 2t 2 cores 2t 2 cores 2t 3.4 2 cores 2t 1.2 2 cores 4t 2 cores 4t 2 cores 4t 4 cores 4t 3.0 4 cores 4t 4 cores 4t 4 cores 8t 4 cores 8t 3.2 4 cores 8t 1.0 3.0 2.8

2.8 0.8 2.6 2.6

bus throughput (bytes/s) 0.6 2.4 2.4 utilization - layer occupancy (%) utilization - layer occupancy (%)

2.2 0.4 2.2

vips x264 vips x264 vips x264 canneal canneal canneal bodytrack swaptions bodytrack swaptions bodytrack swaptions blackscholes fluidanimatestreamcluster blackscholes fluidanimatestreamcluster blackscholes fluidanimatestreamcluster benchmark benchmark benchmark (a) Bus throughput (b) Bus request layer utilization (c) Bus response layer utilization

1e7 1e9 1 cores 1t 1 cores 1t 1 cores 1t 1 cores 2t 1.4 1 cores 2t 1 cores 2t 3.5 7500 2 cores 2t 2 cores 2t 2 cores 2t 2 cores 4t 2 cores 4t 2 cores 4t 1.2 4 cores 4t 4 cores 4t 3.0 4 cores 4t 4 cores 8t 4 cores 8t 7000 4 cores 8t 1.0 2.5

6500 0.8 2.0

0.6 6000 1.5

0.4 1.0 total data through layer (bytes)

average time in retry list (ticks) 5500

total snoop data through layer (bytes) 0.2 0.5

5000 0.0 0.0

vips x264 vips x264 vips x264 canneal canneal canneal bodytrack swaptions bodytrack swaptions bodytrack swaptions blackscholes fluidanimatestreamcluster blackscholes fluidanimatestreamcluster blackscholes fluidanimatestreamcluster benchmark benchmark benchmark (d) Bus snoop response average time in (e) Aggregate snoop data through the bus (f) Aggregate non-snoop data through the retry list bus

Figure 6.8: Bus stats for PARSEC benchmarks

The bus throughput provided in Figure 6.8(a) represents the total data and snoop traffic in bytes per second. Here too, canneal stands out, demonstrating high bus bandwidth. The benchmarks can be categorized in three general groups:

58 • Streamcluster, exhibiting fixed bandwidth, regardless of the test system (number of cores/threads). • Exhibited bandwidth scales up with the number of cores - such as fluidanimate and bodytrack. • Exhibited bandwidth scales up with the number of threads - such as vips and x264. All bandwidth measurements strongly correlate to the DRAM bandwidth from Figure 6.5(a), hence reflect little snooping data through the bus (which is validated in Figure 6.8(e)). The bus request layer utilization statistics provided in Figure 6.8(b) strongly correlate with the total bandwidth. The bus snoop response layer throughout all the benchmarks demonstrated close-to-zero utilization. This is a good indication of very low cross-cluster sharing. The aggregate snoop data through the bus and the average time a snoop response was delayed by average due to a busy response layer are depicted in Figures 6.8(e) and 6.8(d).

PARSEC benchmarks execution time for various snoop penalties In section 6.3 a set of small-scale experiments demonstrated the impact of delaying snoop responses on a (toy-) system’s performance. One of the major reasons for holding the PARSEC set of experiments was for testing larger systems using larger multi-threaded workloads for their sensitivity to snoop performance. For each benchmark and system, we invoked simulations for various snoop-response latencies ranging from 10 ns to 90 ns. While in the small-scale experiments, the effects of our modification were visible and significant, the PARSEC set of simulations used demonstrated less to no impact. Execution times per PARSEC benchmark are depicted in Figure 6.9. These results stress our findings based on aggregate snoop data through the bus from Figure 6.8(e): the volume of snoop data passing through the bus is minor in all the tests performed. Minor differences can be seen when normalizing execution times, as depicted in Figure 6.10. Some of the benchmarks, such as the blacksholes’s simulations on two-CPU systems, demonstrate the expected sensitivity to snoop response latency. However most of the benchmarks do not exhibit such sensitivity. Note that results for most of the tests which make use of a four-CPU system (either with four or eight threads) are missing. This is further discussed in Section 6.4.3. As the same patterns of insensitivity to snoop response latency were consistently observed for other statistics, no further figures are provided as they do not add any new insights.

6.4.3 Why do the missing results prove our hypothesis Throughout Section 6.4.2, a vast majority of four-core-based simulation results are missing. The absolute majority of missing results are those which utilize the DetailedCoherentBus, which makes use of queued ports for delaying response and specifically for delaying snoop responses. The missing results are tests which terminated prematurely due to gem5 causing a Segmentation Fault, a bug indicator which is the bread and butter of every programmer. As gem5 is a bleeding-edge simulation infrastructure, such bugs are not necessarily due to a recent change. In most cases where the DetaildCoherentBus caused simulations to end prematurely, the reference bus, the CoherentBus, did not. In search for the root cause of this bug, it was found not to be due to a bug in the DetailedCoherentBus, but rather in gem5’s cache model. Simply put, the new bus pushed the simulated system to a new dark corner which was not supported by the simulator. This finding is very useful: • It demonstrates that a system is sensitive to the new method of delaying snoop responses. • This bug was only discovered in systems with four CPUs. Of all systems that were simulated, only the four- CPU systems which are composed of two independent clusters connected to the main bus. Each cluster has its own level-2 cache. Hence in such systems, two level-2 caches can communicate via the bus. In one- and two-CPU systems, only a single level-2 cache exists. In order to better understand the root cause, we re-simulated a failing setup, a four-core four-threaded bodytrack test, using a 90 ns snoop response latency. The system is depicted in Figure 6.11. As stated earlier, a reasonable guess was that the problem is related to a level-2 cache, as one of the main differences between the tests that passed and those which did not was the two-cluster formation. Investigation led to the root cause, being a full MSHR list in one of the level-2 caches. Hence the cache apparently became overly

59 4.0 coherent bus coherent bus coherent bus 10 ns 10 ns 10 ns 25 14 3.5 30 ns 30 ns 30 ns 50 ns 50 ns 50 ns 3.0 70 ns 70 ns 70 ns 20 13 90 ns 90 ns 90 ns 2.5 12 time [sec] time [sec]

time [sec] 15 2.0

11 1.5 10

1.0 10

5

1_cores_1t 1_cores_2t 2_cores_2t 2_cores_4t 4_cores_4t 4_cores_8t 1_cores_1t 1_cores_2t 2_cores_2t 2_cores_4t 4_cores_4t 4_cores_8t 1_cores_1t 1_cores_2t 2_cores_2t 2_cores_4t 4_cores_4t 4_cores_8t system configuration system configuration system configuration (a) Blackscholes (b) Bodytrack (c) Canneal

coherent bus 70 coherent bus coherent bus 35 10 ns 10 ns 10 ns

30 ns 60 30 ns 15 30 ns 50 ns 50 ns 50 ns 30 70 ns 70 ns 70 ns 90 ns 50 90 ns 90 ns 10

25 40 time [sec] time [sec] time [sec]

30 5 20

20

15 0

1_cores_1t 1_cores_2t 2_cores_2t 2_cores_4t 4_cores_4t 4_cores_8t 1_cores_1t 1_cores_2t 2_cores_2t 2_cores_4t 4_cores_4t 4_cores_8t 1_cores_1t 1_cores_2t 2_cores_2t 2_cores_4t 4_cores_4t 4_cores_8t system configuration system configuration system configuration (d) Fluidanimate (e) Streamcluster (f) Swaptions

coherent bus 20 coherent bus 60 10 ns 10 ns 30 ns 30 ns 50 ns 18 50 ns 50 70 ns 70 ns 90 ns 16 90 ns 40

time [sec] time [sec] 14 30

12 20

10 10

1_cores_1t 1_cores_2t 2_cores_2t 2_cores_4t 4_cores_4t 4_cores_8t 1_cores_1t 1_cores_2t 2_cores_2t 2_cores_4t 4_cores_4t 4_cores_8t system configuration system configuration (g) vips (h) x264

Figure 6.9: PARSEC execution (guest) times vs. snoop response penalty per configuration for PARSEC benchmarks

60 +9.98e 1 +9.99e 1 coherent bus coherent bus coherent bus 0.005 1.015 0.007 10 ns 10 ns 10 ns 30 ns 30 ns 30 ns 50 ns 50 ns 50 ns 0.006 0.004 70 ns 70 ns 1.010 70 ns 0.005 90 ns 90 ns 90 ns 0.003

0.004 1.005 0.002 0.003 normalized execution time normalized execution time normalized execution time

0.002 0.001 1.000

0.001

1_cores_1t 1_cores_2t 2_cores_2t 2_cores_4t 4_cores_4t 4_cores_8t 1_cores_1t 1_cores_2t 2_cores_2t 2_cores_4t 4_cores_4t 4_cores_8t 1_cores_1t 1_cores_2t 2_cores_2t 2_cores_4t 4_cores_4t 4_cores_8t system configuration system configuration system configuration (a) Blackscholes (b) Bodytrack (c) Canneal

+9.96e 1 +9.99e 1 0.009 coherent bus coherent bus 1.0 coherent bus 10 ns 0.008 10 ns 0.0035 10 ns 30 ns 30 ns 30 ns 0.007 0.8 50 ns 50 ns 0.0030 50 ns 70 ns 70 ns 70 ns 0.006 90 ns 90 ns 0.0025 90 ns 0.6 0.005 0.0020 0.004 0.4

0.003 0.0015 normalized execution time normalized execution time normalized execution time 0.2 0.002 0.0010

0.001 0.0005 0.0

1_cores_1t 1_cores_2t 2_cores_2t 2_cores_4t 4_cores_4t 4_cores_8t 1_cores_1t 1_cores_2t 2_cores_2t 2_cores_4t 4_cores_4t 4_cores_8t 1_cores_1t 1_cores_2t 2_cores_2t 2_cores_4t 4_cores_4t 4_cores_8t system configuration system configuration system configuration (d) Fluidanimate (e) Streamcluster (f) Swaptions

+9.97e 1 coherent bus coherent bus 0.006 10 ns 1.010 10 ns 30 ns 30 ns 0.005 50 ns 1.005 50 ns 70 ns 70 ns 0.004 90 ns 1.000 90 ns

0.003 0.995

0.002 0.990 normalized execution time normalized execution time

0.001 0.985

0.980

1_cores_1t 1_cores_2t 2_cores_2t 2_cores_4t 4_cores_4t 4_cores_8t 1_cores_1t 1_cores_2t 2_cores_2t 2_cores_4t 4_cores_4t 4_cores_8t system configuration system configuration (g) vips (h) x264

Figure 6.10: PARSEC normalized execution (guest) times vs. snoop response penalty per configuration for PARSEC benchmarks

root : Root

system : LinuxArmSystem

cpu3 cpu2 cpu1 cpu0 : AtomicSimpleCPU : AtomicSimpleCPU : AtomicSimpleCPU : AtomicSimpleCPU

itb dtb itb dtb itb dtb itb dtb : ArmISA::TLB : ArmISA::TLB : ArmISA::TLB : ArmISA::TLB : ArmISA::TLB : ArmISA::TLB : ArmISA::TLB : ArmISA::TLB

walker walker walker walker walker walker walker walker : ArmISA::TableWalker : ArmISA::TableWalker : ArmISA::TableWalker : ArmISA::TableWalker : ArmISA::TableWalker : ArmISA::TableWalker : ArmISA::TableWalker : ArmISA::TableWalker

dcache_port icache_port port port dcache_port icache_port port port dcache_port icache_port port port dcache_port icache_port port port

dcache icache itb_walker_cache dtb_walker_cache dcache icache itb_walker_cache dtb_walker_cache dcache icache itb_walker_cache dtb_walker_cache dcache icache itb_walker_cache dtb_walker_cache : BaseCache : BaseCache : BaseCache : BaseCache : BaseCache : BaseCache : BaseCache : BaseCache : BaseCache : BaseCache : BaseCache : BaseCache : BaseCache : BaseCache : BaseCache : BaseCache

cpu_side mem_side cpu_side mem_side cpu_side mem_side cpu_side mem_side cpu_side mem_side cpu_side mem_side cpu_side mem_side cpu_side mem_side cpu_side mem_side cpu_side mem_side cpu_side mem_side cpu_side mem_side cpu_side mem_side cpu_side mem_side cpu_side mem_side cpu_side mem_side

coretol2buses1 coretol2buses0 : DetailedCoherentBus : DetailedCoherentBus

slave master slave master

l21 l20 : BaseCache : BaseCache

system_port cpu_side mem_side cpu_side mem_side

membus : DetailedCoherentBus Failing L2 $ master slave L2 $

bridge physmem : Bridge : SimpleMemory

slave master port main bus iobus : NoncoherentBus

slave master

realview : RealView

sci_fake sp810_fake smc_fake kmi0 kmi1 uart clcd nvmem gpio0_fake ssp_fake uart1_fake aaci_fake gpio2_fake uart2_fake timer0 timer1 gic uart3_fake cf_ctrl dmac_fake gpio1_fake l2x0_fake realview_io local_cpu_timer sbcon0 watchdog_fake flash_fake a9scu rtc mmc_fake iocache : AmbaFake : AmbaFake : AmbaFake : Pl050 : Pl050 : Pl011 : Pl111 : SimpleMemory : AmbaFake : AmbaFake : AmbaFake : AmbaFake : AmbaFake : AmbaFake : Sp804 : Sp804 : Gic : AmbaFake : IdeController : AmbaFake : AmbaFake : IsaFake : RealViewCtrl : CpuLocalTimer : I2cBus : AmbaFake : IsaFake : A9SCU : PL031 : AmbaFake : BaseCache

pio pio pio pio pio pio dma pio port pio pio pio pio pio pio pio pio pio pio dma config pio pio pio pio pio pio pio pio pio pio pio pio cpu_side mem_side

Figure 6.11: bodytrack-failing system

61 occupied and continued to issue more requests to the memory system till the system broke due to a software bug in the simulator. Figure 6.12 compares occupancy levels of all caches in the failing configuration and an identical system using 10 ns snoop response latency.

1.0 1.0

0.8 0.8

0.6 0.6 cpu0 dcache cpu0 dcache cpu1 dtb_walker_cache cpu1 dtb_walker_cache cpu2 itb_walker_cache cpu2 itb_walker_cache cpu2 icache cpu2 icache l20 occ_percent::total 0.4 0.4 l20 occ_percent::total cpu0 icache cpu0 icache cpu2 dcache cpu2 dcache cpu1 dcache Average percentage of cache occupancy

Average percentage of cache occupancy cpu1 dcache cpu1 itb_walker_cache cpu1 itb_walker_cache 0.2 cpu0 dtb_walker_cache 0.2 cpu0 dtb_walker_cache l21 occ_percent::total l21 occ_percent::total cpu2 dtb_walker_cache cpu2 dtb_walker_cache cpu1 icache cpu1 icache cpu0 itb_walker_cache cpu0 itb_walker_cache 0.0 0.0 2.40 2.45 2.50 2.55 2.40 2.45 2.50 2.55 time [sec] time [sec] (a) snoop latency: 90 ns (b) snoop latency: 10 ns

Figure 6.12: PARSEC bodytrack 4 cores 4t simsmall cache occupancy in percent

The main difference between the two systems is the occupancy level of l21, a hence level-2 cache number 1. This cache is also highlighted in Figure 6.11. The difference is that in Figure 6.12(a), the occupancy level in level-2 cache number 1 rises till it is full. This eventually triggers the software bug in which a response to a read request cannot be stored in the cache. Typically the data is passed on upstream without a cache-line being allocated, yet in this case the bodytrack test ended in a corner-case that was not supported. This analysis provided micro-level insight to the impact different snoop response latencies can have, hence proved our hypothesis as correct. Inevitably a blocked level-2 cache can severely degrade execution times, and overall system performance.

6.4.4 BBench In order to perform full-system performance analysis, it is crucial to make use of real-world workloads, such that will stress the entire system, rather than mostly the CPUs. In case of smartphones, web-browsers are the most representative application. Browser bench (BBench [25]) is an interactive smartphone benchmark suite that includes a web-browser page rendering benchmark. BBench is automated, repeatable, and fully contained (to avoid any external dependencies such as network performance). It exercises not only the web-browser, but the underlying libraries and operating system. BBench is multi-threaded, comparing to the SPEC benchmark suite which is single-threaded. It has a much higher mis- predictions rate due to code size and sparseness. BBench makes use of more shared libraries (due to high-level software abstractions) and system calls. Its core is the rendering of 10 of the most popular and complex sites on the web: Amazon, BBC, CNN, Craigslist, eBay, ESPN, Google, MSN, Slashdot, Twitter, and YouTube, making use of dynamic content, video, images, Flash, CSS, and HTML5. BBench’s main metric is the load time of each website.

BBench Simulation Running a BBench simulation, from system startup till BBench results are available, takes roughly 11.5 hours when running on a 2.7Ghz Intel Xeon server. The given wall-clock time is for the fastest and less accurate CPU model (AtomicSimple). For more accurate models, simulation might take considerably longer. The simulation time (hence the time it would take to run BBench on a real system which matches gem5’s ARM full-system) is about 2 minutes of guest time, once booting has finished. During a full-system gem5 simulation, the display is dumped to a bitmap framebuffer. Snapshots from a BBench run are provided in Figure 6.13. The BBench results page is provided in Figure 6.13(d). The no network connection dialog box is a known persistent error which

62 does not effect the benchmark result, as no network connection is required. All sources and online documentation are available at [2].

(a) Home screen - BBench on GB (b) BBC website - BBench on GB

(c) Home screen - BBench on ICS (d) Results page - BBench on GB

Figure 6.13: Snapshots taken during BBench simulations on Gingerbread and Ice Cream Sandwich Android OSs

6.4.5 Insight from Bus Stats BBench was meant to be a leading workload during this research. Only as we discovered that it is not feasible to run BBench on a simulated platform which contains four A15 O3 CPUs did BBench’s importance start to fade away, clearing the stage for other, more appropriate alternative such as PARSEC. PARSEC’s scalable dataset was a key enabler to running multi-core simulations. The most important metric provided by BBench is the benchmark’s execution time. Yet running BBench till completion is beyond what gem5 can cope with in reasonable time at the moment. However, in order to demonstrate the usefulness of the added bus stats, BBench was run using a Timing CPU. Figure 6.14 provides a demonstration of one of the new bus capabilities: distributions of transactions that either have passed through the bus, but also distribution per master or slave port. Figure 6.14 visualizes the transaction distribution for the aggregate traffic through the bus during a BBench simulation. Boot-time has been omitted. Instead of plotting a set of curves, each of which represent the amount of transactions from that type, we stack these plots as layers. The layers’ stacking order used is the aggregate amount (volume) of transactions per type. The legend follows the same order.

6.5 Results analysis and conclusions

• The set of experiments discussed in this chapter provided insights from the micro-level to the macro-level regarding possible impact of revised traffic scheduling, mostly snoop responses. • In addition, it demonstrated the need of increased observability in key points in a system, such as its main interconnect. Such observability enables easy detection of bottlenecks, crucial when designing systems and workloads.

63 ReadResp

1e7 ReadReq 6 WriteResp WriteReq Writeback 5 ReadExResp ReadExReq UpgradeResp 4 UpgradeReq FlushReq PrintReq MessageResp 3 InvalidationReq LoadLockedReq WriteInvalidateResp 2 StoreCondFailReq

transactions per sample second FunctionalReadError FunctionalWriteError 1 HardPFReq InvalidDestError

0 20 30 40 50 60 time [sec]

Figure 6.14: BBench transaction distribution demonstration

• The results demonstrate that a significant impact from delaying snoop responses in a gem5-like system is expected only when high levels of inter-cluster sharing is expected. Currently even in PARSEC, which is a typical benchmark for such relatively intensive multi-threaded workloads, the impact is not critical and in many cases hardly noticeable. According to [15] ‘PARSEC benchmarks have significantly less communicating writes (4.2% on average) than Splash-2 applications (20.8% on average)‘. • One should take into consideration that the results shown cover the less interesting systems, those which only contain one cluster. The experiments with multi-clustered systems should be rerun once the open bug in gem5’s caches is fixed, to evaluate the impact in such systems.

• The bus’s snoop layer statistics, notably the utilization and aggregate data statistics, can determine how much inter-cluster sharing there actually exists, as they represent snoop-hits from two remote clusters. • The experiments should be run on more detailed CPU models and larger datasets to evaluate whether the results presented here can be extrapolated or not.

• In most cases additional threads had either no impact or positive impact. However in streamcluster tests adding more threads significantly increased execution time, thus they inevitably come with a price tag. • The modeling of snoop requests in gem5 must be improved for making realistic meaningful conclusions. The real cost of task migration currently cannot be realistically evaluated with gem5. This is based on understanding of the current gem5 memory model and its optimization. The current memory system is quite optimistic and behaves as a Distributed Shared Cache with zero-time snoop-miss penalties. • The impact of system-coherency expected on non inherently-multi-threaded or cooperating workloads (e.g. BBench, browsing, and typical applications) is currently minor. This is based on low snoop traffic observed on the bus.

6.6 Contribution

• The experiments provided important insight regarding a multi-processor SoC-like model running multi- threaded workloads, whether large or small. They shed light on the PARSEC benchmark tests, their sharing patterns and memory requirements. This was a direct result of improving the system’s observability by introducing statistics to the bus models.

• Specifically, the impact of snoop response latencies was investigated and demonstrated as meaningful in situations where sharing patterns are intensive.

64 • The entire flow of invoking simulations and post-processing statistics directly to figures has been automated, allowing any gem5 user to easily re-run diverse sets of PARSEC benchmarks on a multi-clustered SoC model.

65 Chapter 7

Conclusions and Future Work

• The proposal for this research project started as a one-liner bullet: ‘Cache coherent interconnect and model fidelity The aim of this project is to capture the behavior of the ARM AMBA4-family Cache Coherent Interconnect in a transaction-level model and develop a strategy (stimuli, metrics, evaluation method etc) to reason bottom-up about the model fidelity‘. Only after months of study and hard work one can grasp the immense magnitude this short statement holds. • This project required intensive knowledge ramp-up in a long list of cutting-edge domains: AMBA, system-level coherence, ACE, CCI, transaction-level modeling, target platforms, gem5 at user level, workloads, designing systems with gem5, the gem5 memory model - bus, caches, and their interaction. Studying all these topics and making them one was a challenging task on its own. • gem5 on its own provided a sea of challenges: being bleeding-edge comes at a high price of breaking software, missing deliverables, constantly changing patch-queue, a severe need for documentation, performance issues.

• On the same note, being a constantly-changing platform made it possible to contribute and help shape gem5 to a better simulation framework. This included both directly-related work as part of this project, as well as tangential contributions, which are discussed in SectionA. • The accumulated knowledge was a key factor in being able to bridge between ACE and gem5’s memory system. Seeing both sides, each implementing system-level coherence for a different purpose, enabled seeing pieces in both these puzzles which are missing. Some of these missing pieces are provided next.

7.1 Technical conclusions 7.1.1 Interconnect observability • The introduced bus stats have proven to be critical in evaluating on-chip traffic characteristics. This is an enabler for easily finding system bottlenecks.

• Nevertheless, observability comes at a performance price. This tradeoff is a classic modeling challenge.

7.1.2 Modeling SoC platforms • gem5 provides a convenient framework for exploring SoC architectures and modeling existing platforms.

• Such a platform was composed in order to correlate gem5 simulations with real hardware • Simulator performance currently significantly limit the possible workloads when using gem5 to model a multi- processor SoCs with detailed CPU model. • Nevertheless, smart selection of smaller workloads which exhibit the same patterns, e.g. using SimPoints [41], can still yield very meaningful results.

66 7.1.3 The impact of delaying snoop responses • We have demonstrated the impact of punishing snoop responses with varying latencies on a system’s perfor- mance for various workloads. We concluded that the impact heavily depends on the amount of inter-cluster sharing, thus snoop traffic that passes through the main interconnect, as is the case in I/O coherent systems.

• gem5 currently exhibits the behavior of a distributed shared cache due to the modeling-dependent low over- heads of system coherency. • In order to properly analyze sharing patterns, cache snoop statistics are essential and should be added. Currently gem5 provides only statistics for traffic from the master side (e.g. hit/miss rates, and not snoop hit/miss rates).

• It is essential to deal with gem5’s MemInhibit-related challenges, to enable realistic modeling of snoop costs, speculative fetching, and resource contention.

7.1.4 Truly evaluating ACE-like transactions As mentioned in section 4.4, the current modeling peculiarities of the gem5 memory system prevent us from evaluating the true impact of ACE transactions: • ReadOnce is meant to eliminate the cost of downgrading and then upgrading a cache line as a result of a snoop hit. However in gem5 snoop hits will result in data transfer on-chip, resulting in minor effects. For a more realistic comparison, a non-coherent reference system should be provided, or a system which does not have a snoop-channel on its bus. • ReadNoSnoop and WriteNoSnoop are meant to avoid issuing snoop requests in vain to masters which are known not to hold a cache line in their caches. However, snoop misses currently come with zero-cost. • WriteLineUnique is meant for eliminating the cost of peer caches being required to perform write-back upon a snoop-write hitting a modified line. This is done by broadcasting a MakeInvalid invalidation message. However in gem5 in case of a snoop-write hit to a modified line, the modified line will pass on-chip to the issuing master. Thus where the ACE architect intended to save the cost of a write-back, in gem5 the eliminated cost would be of a snoop response containing the dirty line. Obviously the cost of an on-chip transaction is nothing compared to an off-chip transaction. However:

• Understanding how to model ACE transactions in gem5 is of great value. It enables followers (e.g. whomever would like to add support of snoop filters to gem5 to evaluate their impact) to read a recipe which we have written for them. • The infrastructure created during this project for developing and testing changes in the memory system discussed in Section 5.1 is an enabler for further similar investigations. In addition, it serves as a functional- verification regression tool, being able to effectively stress the memory system. • Understanding the modeling peculiarities, flaws, pitfalls, or inaccuracies means we understand where we are today, what is a more realistic model, and what should be done to bridge the gap. Understanding why the situation is what it is today is crucial: sometimes the current model describes a specific implementation, and in other times it might abstract from details for simulation performance reasons. • Understanding a simulator’s performance means we know our limits: what can be evaluated and what cannot, as has been demonstrated with several fixes as a result of this work. Also, awareness of a problem is a key factor for solving it.

7.2 Reflections, main hurdles and difficulties faced

During the work on this project, challenges and surprises were waiting behind every corner. Obviously, if there were no challenges, there would be no need for a research in the first place. This section aims at providing useful practice guidelines, a postmortem, or lessons learnt, that might improve future projects.

67 • From the moment this project was born till the day it was submitted it has constantly evolved and diverted. While this process of re-evaluation and adaptation is fundamental, it comes with a high price tag. Most problematically, this project relied on external deliverables as a realistic GPU model, corporate cooperation enabling correlation on a micro-benchmark-level, and scalable simulation performance. Each of these links was essential for completing the chain of goals set for these project. It would be best to be more pessimistic when the project is as time-limited as this project was, since each each diversion de-rails work. Eventually time should be spent on most fruitful and promising tasks, de-risking should be done as soon as possible, and external dependencies should be avoided as much as possible. • The background knowledge required for this project was by all means extensive. Time required for such a study should be well planned, well accounted for, and preferably be mostly done during the preparation stage.

• Being a part of a bleeding-edge software project means frequently-changing code, environment, performance, and consistency. This is especially crippling when developing code in core models, rather than end-nodes. Furthermore, e.g. as discussed in Section 6.4.3, since this simulation infrastructure is still young, unexpected bugs can hold back work at any moment. While finding root-causes and solving bugs is the best solution (as was the case a couple of times and described in AppendixA), it comes at the expense of other scheduled tasks that will not be accomplished.

Obviously, all of the above does not relieve any responsibility (or blame) from the horses carrying the research wagon down the road.

7.3 Future work

During this work I gained vast knowledge and insight regarding system-level coherency, interconnects, and simu- lation challenges. Many interesting related topics drew my attention, and ideas of potentially promising topics to investigate. This section describes those topics which seem both feasible and worthy investing time in.

The most promising • Ownership passing policies: gem5 currently utilizes a fixed cache coherence protocol, in which ownership is passed only when an M-state cache line is hit by a snoop-write. However the ACE specification have a broader range of capabilities. For instance, a request can state whether ownership can be passed as part of the transaction. One motivation for passing ownership is the ability to avoid performing write-backs. Assuming the cache line will be required by some consumer soon, it is better to keep the line dirty on-chip longer. On the other hand, some clients (e.g. I/O-coherent devices with write-through caches) might not be able to receive dirty data. In such a case it can be either the responsibility of the interconnect to perform the write-back or the dirty response might also be sent back to the responding master. Eliminating write-backs can have significant power, bandwidth and performance impact. The underlying assumption, that the line will be needed again soon, is one of the fundamental principles of caches - temporal locality. The passing of ownership to the requesting master as a hot potato can be seen as updating an LRU indicator for a line in a distributed shared cache. • gem5’s memory system lacks the notion of QoS, which is crucial for multiplexing high-bandwidth consumers such as a GPU with latency-sensitive consumers such as CPUs or video controllers. This hot topic can provide plenty of insight as how on-chip traffic should be managed.

• While CCI-400 is a crossbar-based interconnect aimed at current mobile compute subsystems, future inter- connects might adopt scalable interconnect topologies, such as network-based. gem5 provides a convenient ground for exploring such interconnects. • Evaluating the true cost of migrating tasks is essential to a big.LITTLE architecture. For gem5 to accurately model these costs, several inaccuracies that have been discussed in Section 4.4 have to be improved. This will also enable investigating the true price of speculative fetching, which can hide the latency of a snoop-miss at the expense of potentially performing a redundant off-chip access. Its usefulness depends on the current workload’s sharing behavior. In order to investigate the effectiveness of such a feature gem5 has to be modified as currently an unrealistically perfect speculation is implemented.

68 Adding the missing pieces • A realistic GPU model will enable investigating the sharing patterns between CPUs and a GPU. This insight would be very important for system architects. • The target platform was described in Section 3.6. A simplified platform was eventually used due to a missing GPU model and performance issues. However once these are dealt with, implementing the more complex system depicted in Figure 3.15 will provide a more realistic model of a hand-held device, which could be better correlated with a fabricated SoC. • CCI-400 utilizes several interesting features which were conceived by experienced architects with limited exploration tools. gem5 provides a convenient ground for evaluating such features, amongst them:

– Address striping over several interconnect interfaces for load-balancing of external memory transactions – Per-interface point of serialization: each shared address in a coherent system must have a PoS for all masters that share that address. In essence, the purpose of a PoS is to avoid transactions to avoid same-address read or write transactions to be serviced out-of-order and to enforce barriers. – Limiting the amount of outstanding transactions in the interconnect. – Barrier support, which is currently implemented in the CPU models.

Leverage what is already there • Utilize ACE transactions: currently no piece of software makes use of ACE’s system-coherency support. As now gem5 supports ACE transactions, existing workloads can be modified to utilize ACE. In addition, data from the TLB can be used to determine the shareability domain of each request.

• As mentioned in Section 7.1.4, gem5’s transaction bank can be further extended with additional ACE-like system-level coherency transactions. • The bus model can be correlated with an existing hardware platform for improving its fidelity.

69 Appendix A

Contributions to gem5

As part of the work on this project numerous contributions have been made. Some of these are directly related to or required for this research, and some were a result of a personal whim or attempt to improve gem5. The project-related contributions include: • Introducing stats to the bus model. • Creating the DetailedCoherentBus which makes use of queued ports, and introduced the snoop-response latency knob.

• Adding support of a set of ACE-like transactions to the memory system. • Creating a fixed-configuration system-composing script for modeling the target platform. • Reviving, revising and improving MemTest; adding support of scenario inputs and the ability to generate any hierarchy of test systems.

• Adding informative messaging throughout the memory system. In addition, I had the privilege of contributing to the active gem5 community in several ways: • DOT [23] based automating system visualization. This added feature generates a block-diagram of the simu- lated system upon the invocation of the simulator. The diagram contains a connected directed graph where each arrow is from a master port to slave port. This output makes it trivial to comprehend a test-system’s hierarchy. Figure 5.2 is an example of such an automatically-generated image. Each node represents an in- stance of a model such as a CPU or a cache or a bus. Arrows connect nodes which represent master or slave ports. • Introducing statistics post-processing scripts which make use of pickle (Python object serialization) to store statistics in a compact file format which enables convenient data retrieval. • Demonstrate online stats visualization in gem5 for live monitoring purposes, based on Python Matplotlib’s [27] interactive mode. • Profiling gem5 and raising awareness of performance issues, as described in Section 3.6.2), demonstrating where time is spent during simulation.

• Wiki contributions related to the memory system and to integrated use of Eclipse’s CDT and PyDev for providing a complete integrated development and debug environment. • Reporting of bugs (e.g. in gem5’s memory system and traffic generator); providing a fix for some of them. • Various utility scripts, e.g. for semantically comparing system description files (config.ini), and for verifying ACE-like instructions when a cache line is already in the system in any of the MOESI states.

70 Bibliography

[1] Simplescalar llc. http://www.simplescalar.com/. [Online; accessed 23-January-2012]. [2] Bbench. http://www.gem5.org/Bbench, 2011. [Online; accessed 12-December-2011].

[3] The gem5 simulator system. http://www.gem5.org, 2011. [Online; accessed 12-December-2011]. [4] N. Agarwal, L.S. Peh, and N.K. Jha. In-network coherence filtering: snoopy coherence without broadcasts. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, pages 232–243. ACM, 2009.

[5] K. Aisopos, C.C. Chou, and L.S. Peh. Extending open core protocol to support system-level cache coherence. In Proceedings of the 6th IEEE/ACM/IFIP international conference on Hardware/Software codesign and system synthesis, pages 167–172. ACM, 2008. [Online; accessed 18-July-2012].

[6] ARM. Amba axi and ace protocol specification (free registration required). https://silver.arm.com/ download/download.tm?pv=1198016, 2011. [Online; accessed 17-July-2012].

[7] ARM. Big.little processing with arm cortex-a15 & cortex-a7. http://renesasmobile.com/news-events/ news/ARM-big.LITTLE-whitepaper.pdf, 2011. [Online; accessed 17-July-2012]. [8] ARM. Corelink cci-400 cache coherent interconnect technical reference manual. http://infocenter. arm.com/help/topic/com.arm.doc.ddi0470c/DDI0470C_cci400_r0p2_trm.pdf, 2011. [Online; accessed 12- December-2011].

[9] ARM. Corelink gic-400 generic interrupt controller technical reference manual. http://infocenter.arm.com/ help/index.jsp?topic=/com.arm.doc.ddi0471a/index.html, 2011. [Online; accessed 1-August-2012]. [10] ARM. Corelink mmu-400 system memory management unit technical reference manual. http://infocenter. arm.com/help/index.jsp?topic=/com.arm.doc.ddi0472a/index.html, 2011. [Online; accessed 1-August- 2012].

[11] ARM. Corelink system ip & design tools for amba. http://www.arm.com/products/system-ip/amba/index. php, 2011. [Online; accessed 12-December-2011]. [12] ARM. Introduction to amba 4 ace. http://www.arm.com/files/pdf/CacheCoherencyWhitepaper_ 6June2011.pdf, 2011. [Online; accessed 12-December-2011]. [13] ARM. Corelink dmc-400 dynamic memory controller technical reference manual. http://infocenter.arm. com/help/index.jsp?topic=/com.arm.doc.ddi0466b/index.html, 2012. [Online; accessed 1-August-2012]. [14] T. Austin, E. Larson, and D. Ernst. Simplescalar: An infrastructure for computer system modeling. Computer, 35(2):59–67, 2002.

[15] N. Barrow-Williams, C. Fensch, and S. Moore. A communication characterisation of splash-2 and parsec. In Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on, pages 86–97. IEEE, 2009. [16] T.B. Berg. Maintaining i/o data coherence in embedded multicore systems. Micro, IEEE, 29(3):10–19, 2009. [17] C. Bienia, S. Kumar, J.P. Singh, and K. Li. The parsec benchmark suite: Characterization and architectural implications. In Proceedings of the 17th international conference on Parallel architectures and compilation techniques, pages 72–81. ACM, 2008.

71 [18] C. Bienia and K. Li. Fidelity and scaling of the parsec benchmark inputs. In Workload Characterization (IISWC), 2010 IEEE International Symposium on, pages 1–10. IEEE, 2010. [19] N. Binkert, B. Beckmann, G. Black, S.K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D.R. Hower, T. Krishna, S. Sardashti, et al. The gem5 simulator. ACM SIGARCH Computer Architecture News, 39(2):1–7, 2011. [20] E. Bolotin, Z. Guz, I. Cidon, R. Ginosar, and A. Kolodny. The power of priority: Noc based distributed cache coherency. In Networks-on-Chip, 2007. NOCS 2007. First International Symposium on, pages 117–126. IEEE, 2007. [21] O. Luciano Sassatelli G. Butko A., Garibotti R. Accuracy evaluation of gem5 simulator system (to be pub- lished). In Reconfigurable Communication-centric Systems-on-Chip (ReCoSoC), 2012 7th International Work- shop on. IEEE, 2012.

[22] A. Cimatti, E. Clarke, E. Giunchiglia, F. Giunchiglia, M. Pistore, M. Roveri, R. Sebastiani, and A. Tacchella. Nusmv 2: An opensource tool for symbolic model checking. In Computer Aided Verification, pages 241–268. Springer, 2002. [23] J. Ellson, E. Gansner, L. Koutsofios, S. North, and G. Woodhull. Graphvizopen source graph drawing tools. In Graph Drawing, pages 594–597. Springer, 2002. [24] M. Guher. Physical design of snoop-based cache coherence on multiprocessors. [25] A. Gutierrez, R.G. Dreslinski, T.F. Wenisch, T. Mudge, A. Saidi, C. Emmons, and N. Paver. Full-system analysis and characterization of interactive smartphone applications. 2011.

[26] W. Hlayhel, J. Collet, and L. Fesquet. Implementing snoop-coherence protocol for future smp architectures. Euro-Par99 Parallel Processing, pages 745–752, 1999. [27] J.D. Hunter. Matplotlib: A 2d graphics environment. Computing in Science & Engineering, pages 90–95, 2007. [28] IEEE. Ieee draft standard for standard systemc(r) language reference manual. IEEE P1666/D3, May 2011, pages 1 – 674, 2011.

[29] A.B. Kahng, B. Li, L.S. Peh, and K. Samadi. Orion 2.0: A fast and accurate noc power and area model for early-stage design space exploration. In Proceedings of the conference on Design, Automation and Test in Europe, pages 423–428. European Design and Automation Association, 2009. [30] P.D. Karandikar. Open core protocol (ocp) an introduction to interface specification. In 1st Workshop on SoC Architecture, Accelerators & Workloads Jan, volume 10, 2010. [31] T.J. Kjos, H. Nusbaum, M.K. Traynor, and B.A. Voge. Hardware cache coherent input/output. HEWLETT PACKARD JOURNAL, 47:52–59, 1996. [32] P.S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner. Simics: A full system simulation platform. Computer, 35(2):50–58, 2002.

[33] M.R. Marty. Cache coherence techniques for multicore processors. ProQuest, 2008. [34] P.E. McKenney. Memory barriers: a hardware view for software hackers. Linux Technology Center, IBM Beaverton, 2010. [35] D. Molka, D. Hackenberg, R. Schone, and M.S. Muller. Memory performance and cache coherency effects on an intel nehalem multiprocessor system. In Parallel Architectures and Compilation Techniques, 2009. PACT’09. 18th International Conference on, pages 261–270. Ieee, 2009. [36] N. Nethercote and J. Seward. Valgrind: a framework for heavyweight dynamic binary instrumentation. ACM Sigplan Notices, 42(6):89–100, 2007.

[37] B. OSullivan. Distributed revision control with mercurial. Mercurial project, 2007. [38] C. Seiculescu, S. Volos, N.K. Pour, B. Falsafi, and G. De Micheli. Ccnoc: On-chip interconnects for cache- coherent manycore server chips. 2011.

72 [39] D.J. Sorin, M.D. Hill, and D.A. Wood. A primer on memory consistency and cache coherence. Synthesis Lectures on Computer Architecture, 6(3):1–212, 2011.

[40] O.C.P. Specification and I. Volume. Release 2.0. http://www.cvsi.fau.edu/download/attachments/ 852535/OpenCoreProtocolSpecification2.1.pdf?version=1, 2003. [Online; accessed 18-July-2012]. [41] V.M. Weaver and S.A. McKee. Using dynamic binary instrumentation to generate multi-platform simpoints: Methodology and accuracy. In Proceedings of the 3rd international conference on High performance embedded architectures and compilers, pages 305–319. Springer-Verlag, 2008. [42] J. Weidendorfer. Sequential performance analysis with callgrind and kcachegrind. Tools for High Performance Computing, pages 93–113, 2008.

[43] T.F. Wenisch, R.E. Wunderlich, M. Ferdman, A. Ailamaki, B. Falsafi, and J.C. Hoe. Simflex: statistical sampling of computer system simulation. Micro, IEEE, 26(4):18–31, 2006. [44] S.C. Woo, M. Ohara, E. Torrie, J.P. Singh, and A. Gupta. The splash-2 programs: Characterization and methodological considerations. In ACM SIGARCH Computer Architecture News, volume 23, pages 24–36. Acm, 1995.

73