Parallel SystemC Simulation for Electronic System Level Design

Von der Fakultät für Elektrotechnik und Informationstechnik der Rheinisch–Westfälischen Technischen Hochschule Aachen zur Erlangung des akademischen Grades eines Doktors der Ingenieurwissenschaften genehmigte Dissertation

vorgelegt von Diplom–Ingenieur Jan Henrik Weinstock aus Göttingen

Berichter: Universitätsprofessor Dr. rer. nat. Rainer Leupers Universitätsprofessor Dr.-Ing. Diana Göhringer

Tag der mündlichen Prüfung: 19.06.2018

Diese Dissertation ist auf den Internetseiten der Hochschulbibliothek online verfügbar.

Abstract

Over the past decade, Virtual Platforms (VPs) have established themselves as essential tools for embedded system design. Their application fields range from rapid proto- typing over design space exploration to early . This makes VPs a core enabler for concurrent HW/SW design – an indispensable design approach for meeting today’s aggressive marketing schedules. VPs are essentially a simulation of a complete microprocessor system, detailed enough to run unmodified target binary code. During simulation, VPs provide non-intrusive debugging access as well as re- porting on non-functional system parameters, such as execution timing and estimated power and energy consumption. To accelerate the construction of a VP for new systems, developers typically rely on pre-existing simulation environments. SystemC is a popular example for this and has become the de-facto reference for VP design since it became an official IEEE stan- dard in 2005. Since then, however, SystemC has failed to keep pace with its user’s demands for high simulation speed, especially when embedded multi-core systems are concerned. Because SystemC only utilizes a single processor of the host computer, the underlying sequential discrete event simulation algorithm becomes a performance bottleneck when simulating multiple virtual processors. It is the main goal of this thesis to overcome this bottleneck by utilizing multi-core computers as simulation hosts, given they are abundantly available today. To that ex- tent, it presents new tools and modelling methodologies that facilitate parallelization of new and existing VPs, even if they include legacy source code and models that were written without parallelism or thread-safety in mind. After proposing a new parallel simulation engine for SystemC, this thesis continues by investigating modelling ap- proaches that address commonly encountered issues when migrating from sequential to parallel simulation environments, such as the functional correct representation of optimistic exclusive memory access models. The efficacy of the proposed techniques is evaluated using realistic VPs as the driving case studies. In the past, these platforms have been used productively, either in design space exploration or as debug targets for early SW development. Their performance gains due to parallel simulation reach 4 – 8x over the current state-of- the-art implementation of SystemC on modern multi-core host computers.

Acknowledgements

While writing these words I am looking back at the six years (even more if you count my time as a student and research assistant) I spent at the Institute for Communication Technologies and Embedded Systems at the RWTH Aachen University. It has truly been a privilege to be able to work alongside so many excellent personalities that have supported and shaped my academic career and I would like to take the opportunity to thank them here. First, I would like to thank Professor Rainer Leupers, who encouraged me to pursue a doctoral degree and allowed me to join his research group. His vast expertise and constructive feedback allowed me to refocus my research whenever I lost track and helped me to concentrate on solving the practical problems. I am grateful to Christoph Schumacher, who accompanied me during my first voyages into SystemC and parallel simulators during my early days at the Institute. His determined and thorough approach to problem solving quickly became a shining example and his research formed a solid foundation to build upon. Beyond those already mentioned, I had the pleasure of working with many other Institute members that deserve a mention here. I would like to express my gratitude to Luis Gabriel Murillo, Róbert Lajos Bücs and Stefan Schürmans for our inspirational conversations as well as challenging questions that helped me improving and opti- mizing this work. I would also like to thank my research assistants Christian Jöhri, Alexander Wihl and Florian Walbroel for helping me to carry the workload. A special thank you goes to my colleagues Gereon Onnebrink and Diego Pala for proof-reading and helping me to give this thesis its finishing touch. Finally, I would like to express my gratitude to my friends and family – those who are still around and those who passed away too soon. It is because of your love and support that this work became possible, and I cannot thank you enough for it. Jan Henrik Weinstock, June 2018

Contents

1 Introduction 1 1.1 EmbeddedSystems...... 2 1.2 ElectronicSystemLevelDesign ...... 4 1.3 SystemLevelDescriptionLanguages ...... 5 1.3.1 SystemC ...... 7 1.3.2 SpecC...... 7 1.3.3 SystemVerilog ...... 8 1.3.4 Summary...... 8 1.4 AnatomyofaVirtualPlatform ...... 9 1.4.1 SupportingToolset ...... 9 1.4.2 PlatformSimulator ...... 11 1.5 ThesisContributionsandOutline...... 12 1.6 Synopsis ...... 13

2 Background 15 2.1 DiscreteEventSimulation ...... 15 2.2 DeterministicSimulation...... 18 2.3 RaceConditions...... 19 2.4 SynchronisationProblem...... 21 2.5 Synopsis ...... 24

3 Related Work 25 3.1 TraditionalParallelSimulation ...... 25 3.2 ParallelSystemC ...... 26

i ii CONTENTS

3.2.1 SynchronousSimulation ...... 26 3.2.2 AsynchronousSimulation ...... 27 3.2.3 DistributedSimulation ...... 28 3.2.4 AcceleratorSupportedSimulation ...... 29 3.2.5 SummaryofParallelSystemCApproaches ...... 30 3.3 FastInstructionSetSimulation ...... 32 3.3.1 CompiledSimulation...... 33 3.3.2 DynamicBinaryTranslation...... 34 3.4 Synopsis ...... 34

4 Target Platforms 35 4.1 TheEURETILEPlatform ...... 35 4.1.1 VirtualEURETILEPlatform ...... 36 4.1.2 EURETILESoftwareandToolchain...... 37 4.2 TheGEMSCLAIMPlatform ...... 38 4.2.1 GEMSCLAIMVirtualPlatform ...... 39 4.2.2 GEMSCLAIMSoftwareEnvironment ...... 41 4.3 TheOpenRISCPlatform ...... 41 4.3.1 OpenRISCVirtualPlatform ...... 42 4.3.2 OpenRISCSoftwareEnvironment ...... 43 4.4 Synopsis ...... 44

5 Time-Decoupled Parallel SystemC 45 5.1 AsynchronousSystemCSimulation...... 45 5.1.1 SimulationLoopandContext ...... 47 5.1.2 SimulatorPartitioning ...... 48 5.2 Cross-threadCommunication ...... 50 5.2.1 RemoteEvents...... 51 5.2.2 RemoteEventQueues ...... 53 5.2.3 BlockingTransportInterface...... 55 5.2.4 AugmentedTLMTargetSocket ...... 57 5.3 ExperimentalResults ...... 59 CONTENTS iii

5.3.1 ExperimentSetup...... 60 5.3.2 SequentialPerformance ...... 62 5.3.3 ParallelPerformance ...... 63 5.3.4 LookaheadAnalysis ...... 65 5.4 LimitationsandOutlook ...... 66 5.5 Synopsis ...... 68

6 Flexible Time-Decoupling 69 6.1 SimulatorOperationModes ...... 69 6.1.1 DeterministicSimulationMode ...... 71 6.1.2 FastSimulationMode ...... 72 6.2 FlexibleInter-threadCommunication ...... 74 6.2.1 Zero-delayRemoteNotifications ...... 74 6.2.2 RemoteSignals ...... 76 6.2.3 RemoteDirectMemoryAccess ...... 77 6.3 TemporalDecoupling...... 78 6.3.1 TimingError...... 78 6.3.2 MitigationStrategy ...... 79 6.4 ExperimentalResults ...... 79 6.4.1 SyntheticExperiments ...... 80 6.4.2 GEMSCLAIMExperimentSetup ...... 82 6.4.3 GEMSCLAIMExperimentation ...... 83 6.5 LimitationsandOutlook ...... 85 6.6 Synopsis ...... 86

7 Exclusive Memory Access Models 87 7.1 ModellingConsiderations ...... 88 7.1.1 Load-LinkedandStore-Conditional ...... 88 7.1.2 TheABAProblem ...... 90 7.2 ModellingApproach ...... 91 7.2.1 DMICacheModel ...... 92 7.2.2 Memory-basedModel ...... 93 iv CONTENTS

7.2.3 Transaction-basedModel ...... 95 7.2.4 MixedOperation ...... 97 7.3 ExperimentalResults ...... 99 7.3.1 SimulationOverhead...... 100 7.3.2 ParallelPerformance ...... 102 7.4 LimitationsandOutlook ...... 105 7.5 Synopsis ...... 106

8 Processor Sleep Models 107 8.1 ProcessorSleepStates ...... 108 8.2 TheOpenRISCInstructionSetSimulator ...... 109 8.2.1 CacheCompiledSimulation...... 111 8.2.2 SystemCWrapper...... 112 8.3 ProcessorSleepModels...... 113 8.3.1 ISSSleepModel...... 113 8.3.2 DESSleepModel ...... 115 8.4 ExperimentalResults ...... 117 8.4.1 ProcessorActivityTracing ...... 118 8.4.2 SleepModelComparison ...... 119 8.4.3 ParallelPerformanceAnalysis ...... 121 8.5 LimitationsandOutlook ...... 123 8.6 Synopsis ...... 124

9 ParallelSystemCusingTime-DecoupledSegments 125 9.1 TheSystemC-LinkConcept ...... 126 9.2 SimulationController...... 127 9.2.1 Time-DecoupledCo-Simulation...... 128 9.2.2 SegmentScheduling ...... 129 9.2.3 VirtualSequentialEnvironment ...... 130 9.3 CommunicationInfrastructure ...... 132 9.3.1 Queue-basedCommunication...... 132 9.3.2 IMC-basedCommunication ...... 134 CONTENTS v

9.4 Simulation Structure and Composition ...... 136 9.5 ExperimentalResults ...... 137 9.5.1 SchedulingModeAnalysis ...... 138 9.5.2 ChannelLatencyNetworkAnalysis ...... 140 9.5.3 OpenRISCMulti-CorePlatform...... 141 9.6 LimitationsandOutlook ...... 144 9.7 Synopsis ...... 146

10 Conclusion 147 10.1Summary...... 147 10.2Outlook...... 149

Appendix 151

A SystemC/TLM Simulation Overview 151 A.1 SystemCCoreConcepts ...... 151 A.1.1 EventsandProcesses ...... 151 A.1.2 ModuleHierarchy ...... 152 A.1.3 CommunicationInfrastructure ...... 152 A.2 TransactionLevelModelling...... 153 A.2.1 BlockingTransportInterface...... 153 A.2.2 Non-blockingTransportInterface...... 154 A.2.3 DirectMemoryInterface...... 155 A.2.4 DebugInterface ...... 156

B The Virtual Components Library 157 B.1 ModellingPrimitives ...... 157 B.1.1 PortsandSockets ...... 157 B.1.2 PeripheralsandRegisters ...... 158 B.1.3 Properties ...... 158 B.1.4 Logging ...... 159 B.2 ComponentModels...... 160 B.2.1 MemoryModel ...... 160 vi CONTENTS

B.2.2 Memory-mappedBus ...... 160 B.2.3 Universal Asynchronous Receiver/Transmitter 8250 ...... 161 B.2.4 OpenCoresSPIController ...... 162 B.2.5 OpenCoresVGA/LCD2.0Core ...... 163 B.2.6 OpenCores10/100MbpsEthernet ...... 164

C Experimental Data 167

Glossary 183

List of Figures 187

List of Tables 189

List of Algorithms 191

Bibliography 193 Chapter 1

Introduction

Embedded systems are constantly evolving. From early single processor designs to modern heterogeneous multiprocessor systems, embedded systems have shown a tremendous increase in complexity in order to meet the computational needs of today. Consequently, development of these systems has become a daunting task, involving hundreds of embedded hardware and software engineers, often from different com- panies. They are facing design questions, such as "Which design makes the system fulfil its requirements optimally?" and "What is the best way to program and debug its complex HW/SW stacks?" Virtual Platforms (VPs) are tools that help answering these questions. In its core, a VP is a simulation of a complete hardware system, capable of execut- ing unmodified target software. Given that this simulation is itself again a software program, one can use traditional debuggers to gain non-intrusive insight into all de- tails of the simulated system. This enables inspection of memory contents, processor registers and even single interrupt lines without affecting the state of any compo- nent. Compared to hardware prototypes, VPs are also more cost efficient, given that debuggers are widely available today. Furthermore, their software nature enables additional uses of VPs beyond de- bugging. Experimenting with platform parameters, such as the size of a memory component or the clock frequency of a processor, helps developers to find the opti- mal system configuration for the given requirements. For a VP, this experimentation only involves minor changes in its configuration and then restarting the simulation. At most, these changes cost developers a couple of minutes, compared to days for hardware prototypes, allowing rapid design space exploration. Without VPs, hardware and software development used to be a serialised pro- cess. Hardware dependent software, such as device drivers and low-level kernel code, could only be written after the silicon design had been completed. In contrast, VPs can be made available to developers much quicker. Consequently, work on hardware dependent software, such as device drivers, can commence earlier and concurrently to the hardware development. This so-called HW/SW Co-design facilitates integration, reduces engineering cost, and shortens the time-to-market. However, with all the benefits of a software program, VPs also suffer serious short- comings. A major flaw can be found in the underlying simulation algorithm. In order to preserve temporal correctness, it operates purely sequentially. Consequently, the more complex the target system is, the slower the simulation will perform. Given the fact that the number of processors built into modern embedded systems is constantly increasing, VPs are facing a performance bottleneck. Reduced simulation speed directly translates to a reduction in productivity, a problem that virtual platforms were orig-

1 2 Chapter 1. Introduction inally designed to overcome. In order to keep them viable as design tools, solutions for the decreasing simulation speed of VPs are desperately needed. Suggested solutions to the performance bottleneck can be classified into two groups: simulate less and simulate parallel. The former approaches attempt to raise the abstrac- tion level of the VP, which is usually achieved via omission of unneeded modelling detail. Transaction Level Modelling (TLM) [26] is a popular example of this method- ology. The latter approaches propose transitioning to parallel simulation techniques. This appears as an attractive solution, given that multiprocessor workstation PCs are widely available today. Problems hindering their adoption lie for the most part in the design of the individual component models, which are not designed with parallelism in mind and are susceptible to parallel programming errors. This thesis aims at providing solutions for problems encountered during con- structions of high performance parallel simulators for embedded systems. It includes modelling techniques for parallel simulation environments and parallel simulation al- gorithms optimised for VPs. However, before the actual contributions of this work can be enumerated, it is important to first understand the background that makes them necessary. Therefore, the remainder of this chapter is structured as follows. First, Sec- tion 1.1 gives an overview about present and future embedded systems that need to be targeted by VPs. Section 1.2 gives an introduction into Electronic System Level (ESL) design before Section 1.3 and Section 1.4 present industry standard approaches and tools for VP development. Finally, Section 1.5 enumerates the contributions of this thesis and Section 1.6 concludes this chapter with a short summary and the outline of the remainder of this thesis.

1.1 Embedded Systems

Traditionally, an embedded system is defined as a computer system, which has been embedded into a larger mechanical or electrical system [211]. A popular example of such an early embedded system is the Apollo Guidance Computer [141, 211]. It was in charge of control and navigation of the spacecrafts employed during the Apollo program between 1969 and 1972. However, its single core Central Processor Unit (CPU) clocked at 2.048MHz is no match compared to the designs of today. Mod- ern multimedia platforms, such as the Samsung Exynos [156, 215], the Qualcomm Snapdragon [147, 214] or the Apple A9X [208, 176] demonstrate the remarkable de- velopment of embedded systems. Those platforms are now commonly equipped with multi-core processors operating between 1.6 GHz and 2.26 GHz [215, 214, 208], ded- icated Graphics Processing Units (GPUs) and Digital Signal Processors (DSPs) for audio/video processing and wireless communication. They are employed in the au- tomation, automotive, health care and telecommunication industry, but can gener- ally also be found in everyday consumer electronics, such as mobile-phones, smart- watches and TVs. From those early designs until today, embedded systems have been subject to constant improvement, driven by the demand for mobile communication, information and entertainment of our modern society. 1.1. Embedded Systems 3

ITRS Forecast: Processor Count

200 188 175 150 150 120 125 96 100 76 75 62 50 50 34 40 Number of Processors 22 26 25 0 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 Year

Figure 1.1: ITRS forecast: predicted processor count (CPU and GPU) in embedded consumer electronics (adapted from [85]).

This development was made possible by advancements in semiconductor technol- ogy. By downscaling the manufacturing process, one could construct ever increasingly complex designs with higher performance and energy efficiency - an effect predicted by Gordon Moore in 1965 [213, 127]. However, with manufacturing processes reaching natural boundaries, further increases in performance and energy efficiency appeared difficult. With increasing energy consumption and heat production, it became ap- parent that embedded processors hit the power wall, similar to what happened with desktop computers around the year 2000. A solution to this problem was the adop- tion of multi-core processor systems also in the embedded domain. This allowed to increase the computational power of embedded systems by means of parallel execu- tion, at the cost of increased software complexity. To cope with the energy limitation imposed on battery-powered embedded sys- tems, Heterogeneous Multiprocessing (HMP) systems are being built today. In con- trast to Symmetric Multiprocessing (SMP) desktop systems, HMP designs utilise pro- cessors of different kinds. For example, the ARM big.LITTLE architecture [7, 210] cou- ples high performing processors with energy efficient ones. This allows the embedded Operating System (OS) to balance between performance and energy consumption de- pending on the current conditions: while a smartphone is resting inside its owners pocket, it does not need high performance cores just to handle background tasks like fetching emails. Instead, the energy efficient cores are activated to handle these tasks at a slower pace so that the user can enjoy a longer battery lifetime. The trend of adding more and more specialised processors into a system to fulfil tasks in an energy efficient way is further supported by forecasts of the ITRS. As Fig- ure 1.1 shows, the number of CPU and GPU cores in embedded consumer electronics is expected to exceed 100 already by 2024. While the actual numbers are merely ed- ucated guesses, the trend shown in Figure 1.1 is already supported by the history of present architecture families, such as the aforementioned Samsung Exynos and Qual- comm Snapdragon. How to deal with increasing numbers of processors in embedded systems is a challenge for hardware-, software- and tool-developers in ESL design. 4 Chapter 1. Introduction

1.2 Electronic System Level Design

The demand for increased performance, flexibility and energy efficiency caused the adoption of HMP architectures in embedded system design. They are widely con- sidered the optimal design choice for high performance, yet energy efficient devices. However, their inherent complexity, brings forth new challenges, for which the tradi- tional embedded system design flow appears impractical. It starts with an informal specification document of the target system. Based on such a written document hard- ware development starts, including architecture and chip design and finally manufac- turing. Only once the first hardware prototype is available, the software development team can start writing firmware, device drivers and the actual applications. Projects based on such a serialised development process struggle meeting the in- creasingly competitive marketing schedules of today. Moreover, by only having an informal specification of the target system, bugs caused by misunderstandings be- tween hardware and software teams are likely to appear. Due to a missing feedback channel between both teams, the dimensioning of the hardware is mostly left to intu- ition, often leading to over- or under-provisioning of device resources, e.g. memory or cache sizes. All issues combined, the traditional design flow results in a longer development time, higher cost and ultimately a lower quality product. The research field of ESL design attempts to overcome those challenges by provid- ing developers with novel tools and methodologies. VPs are one of those tools. They replace the informal, written specification with an executable one, i.e., a simulation of the envisioned system. This offers multiple benefits:

• The VP serves as a golden reference for both hardware and software develop- ment. This eliminates the chance of misunderstandings and allows to safely refine the specification later on in the design flow should the become neces- sary. VPs have been shown to significantly reduce final system bring-up and validation times.

• HW/SW Codesign: Due to its early availability, a VP is the ideal vehicle for software development. Figure 1.2 illustrates how a VP allows the programming to start long before the first silicon is available, effectively enabling concurrent design of hard- and software. Moreover, a feedback channel becomes possible, where findings from the software space can be used to improve and optimise the hardware design.

• The VP offers enhanced debug capabilities via non-intrusive introspection into the entire system. Since the VP itself is a program, regular debuggers can be used to inspect and alter hardware details, down to individual registers and interrupt lines.

• Design Space Exploration is enabled by means of VPs. Platform parameters such as number of processors, clock speed and memory size can be investigated with little to no additional effort to find the best configuration in terms of per- formance, energy efficiency, flexibility and cost. 1.3. System Level Description Languages 5

Software Development

Device Firmware Applications Drivers

System Virtual Feedback System Specification Platform Validation Hardware Development

Chip Architecture Manufacturing Design

Figure 1.2: HW/SW Codesign: Virtual Platforms enable concurrent HW and SW development, HW design feedback and produce earlier results.

These benefits are by no means just theoretical. One could observe them in prac- tice when ARM introduced their 64-bit architecture (Aarch64) [6, 209] in 2011. While first silicon chips were only available in late October 2012 [5, 9], corresponding soft- ware was already available months before [115, 66]. This was enabled by means of the ARM FastModels [8], a VP created by ARM as the golden reference for Aarch64. To leverage these benefits, VPs must be available early in the design phase of a system. This might seem at odds with the ever increasing complexity of the systems that a VP must model and, indeed, writing a VP from scratch is a daunting task. A solution can be found when looking at the design process of hardware. Already in 2001, the ITRS foresaw that due to increased Non-Recurring Engineering (NRE) cost and shrinking project schedules, future designs would no longer be built from scratch [84]. Instead, component reuse became the key concept. It is based on - lectual Property (IP) blocks, which are pre-engineered, pre-verified and available off- the-shelf designs of subcomponents of a system, such as CPUs, GPUs, DSPs, memory controllers, buses and various I/O peripherals. Due to its overwhelming success, a similar design concept also found its way into mainstream VP development. Vendors supply models in combination with the actual designs in the form of simulation libraries. Design entry tools, such as PlatformArchitect [182] can then be used to combine models from multiple libraries into a single VP. To achieve interoperability between libraries of different vendors, a common simulation environment is needed. Commonly, such an environment is provided by the System Level Description Language (SLDL) used to construct a VP. The next chapter will give an overview about SLDLs being used today in ESL design.

1.3 System Level Description Languages

The adoption of SLDLs in embedded system design was mainly driven by the need to describe increasingly complex systems at abstraction levels higher than those offered by the traditional Hardware Description Languages (HDLs), such as [189] and VHDL [111]. While those HDLs are a perfect match for circuits with a few hundred 6 Chapter 1. Introduction

Language Applications Ancestor Year Reference

SystemC VP modelling, functional C++ 2000 [79,80] verification, high level synthesis SpecC high level synthesis, func- C 1999 [43,60] tional verification SystemVerilog HW modelling, functional Verilog 2004 [78, 81] verification

Table 1.1: Summary of SLDLs from industry and academia to a few thousand gates, they are unsuitable for fast simulation of complex systems as required for VP design. The search for suitable SLDLs brought forth a variety of requirements that need to be fulfilled by potential candidates:

• Simulation speed is a major concern. In order to be useful as a debugging tool, VPs must perform fast enough to allow interactive use and replay. Therefore, debugging efficiency of a VP is practically linked to simulation speed.

• Abstraction refers to the ability of an SLDL to incorporate designs modelled at different levels of complexity. Those abstraction levels may range from pure functional simulation to instruction- or even cycle-accurate models.

• Expressiveness is another key aspect. It refers to the ability of an SLDL to describe the intended behaviour of a model as efficiently as possible. Early availability can only be achieved if custom models can be assembled swiftly.

• Compatibility with existing designs and modelling tools accelerates adoption of an SLDL by widening the selection of deployable IP blocks in a design. If needed, adaptors must enable use of models from different providers.

• Availability refers to the cost of deploying an SLDL into a production environ- ment. Ideally, existing tools, such as compilers and debuggers can be re-used, thus avoiding tool acquisition and staff training costs.

It is no surprise that most of these key aspects can be found in SLDLs that are being deployed successfully in industry and academia today. Besides VP modelling, these SLDLs have also proven their usefulness in a variety of other application areas within the domain of ESL design, such as high level synthesis, functional verification, performance modelling and architectural exploration. The SLDLs that have been considered for this work are SystemC, SpecC and Sys- temVerilog. Table 1.1 gives a short overview about these SLDLs including their an- cestral language and main application areas today. A more detailed discussion with focus on VP modelling will be presented in the following. 1.3. System Level Description Languages 7

1.3.1 SystemC

Since its first prototype release in early 2000 [19, 67], SYSTEMC has become the de- facto industry standard for VP development. In 2005, it was officially standardised by IEEE [79] and received a matching reference implementation (SystemC 2.2) two years later. The current reference implementation (SystemC 2.3) was released in 2011 [80] and extended the former version with support for TLM [26], which was previously only available separately via external libraries [142]. At its core, SYSTEMC is a C++ class library and therefore benefits from the rich expressiveness and high performance of the language. Complex algorithms, e.g. for compression or encryption, can directly be incorporated into SYSTEMC models by us- ing already established external C/C++ libraries. This reduces modelling effort and enables rapid prototyping. Furthermore, SYSTEMC does not need external tools for sim- ulation, since the simulation engine is part of the executable program. Only regular C/C++ compilers and debuggers are needed, which greatly increases the availability of SYSTEMC based platforms. Development of SYSTEMC is steered and promoted by the Open SystemC Initiative (OSCI) – now Systems Initiative [1]. However, initial contributions originate from many different companies and universities, such as Synopsys, IMEC, Infineon and the University of California, Irvine [67].

1.3.2 SpecC

The origins of SPECC can be traced back to the work of Professor Daniel Gajski of the University of California, Irvine [60, 55], which later also influenced the design of SYSTEMC [67]. Today, it is promoted and supported by the SpecC Open Technology Consortium (STOC) and enjoys frequent research publications [34, 32] as well as best- paper awards [35] on major Electronic Design Automation (EDA) conferences. Besides its closeness to academia, it has mainly been adopted by Japanese semiconductor and systems companies, such as Toshiba and Hitachi [191]. SPECC is a superset of ANSI-C [95]. Similarly to SYSTEMC, SPECC borrows from the rich expressiveness and performance of its ancestral language. However, a special pre- compiler [27] must be deployed to convert SPECC models into C code, which can only then be compiled into executable programs. As a result of this dependency, support for external libraries and certain language features, e.g. pointers, is limited [35]. Due to their shared origin, SPECC and SYSTEMC share many similarities. Both em- ploy a discrete event simulation approach and produce stand-alone simulators in the form of executable programs. Both languages support modelling at multiple levels of abstraction, ranging from a purely functional level down to cycle accuracy. Despite of these similarities, SPECC has failed to gain high adoption rates within the EDA industry. While the point can be made that support from Synopsys signifi- cantly pushed the popularity of SYSTEMC, it cannot be denied that the additional tool requirements and language limitations of SPECC imposed a major hurdle in terms of availability and industry adoption. 8 Chapter 1. Introduction

1.3.3 SystemVerilog

SystemVerilog was created based on a donation of the Superlog language to Accellera in 2002 [148]. After another donation concerning verification functionality from the OpenVera language of Synopsys, SystemVerilog became an official IEEE standard in 2005 [78] as a system level hardware description and verification language. Since then, SystemVerilog received frequent updates and refinements to keep up with the complexity of hardware projects. It is steered and promoted by Accellera, with its latest release stemming from 2012 [181]. SystemVerilog was created as an answer to the rising demands of system level verification that came as a result of the ever increasing complexity of modern hard- ware designs. It is based on Verilog [189] and extends it with verification facilities, such as assertions and random number generators, as well as support for object ori- ented programming models. Due to its inheritance, SystemVerilog is a popular SLDL among hardware design teams that can draw upon their existing competence in Ver- ilog. Furthermore, since neither SYSTEMC nor SPECC offer a reliable way to produce silicon based on their designs, SystemVerilog plays an important role in ESL design, specifically regarding its ability to create test benches. However, its close relation to Verilog also imposes limitations towards the use of SystemVerilog as a basis for VP design. Firstly, integration of high performance Instruction Set Simulator (ISS) appears difficult, since those are usually based on C/C++ [18, 8]. Secondly, since SystemVerilog is mostly directed towards low abstrac- tion levels, simulation performance is reduced.

1.3.4 Summary

ESL design covers a broad field of topics, ranging from performance modelling, archi- tectural exploration and virtual prototyping to hardware verification and synthesis. Until now, no single tool or language has been identified to target all of those top- ics combined in a flexible and efficient way. With increasingly complex designs and more aggressive development schedules, designers rely on multiple SLDLs, as each one excels in its unique application fields. Since the focus of this thesis is concerned with the acceleration of VP simulation performance, it appears only natural for it to be based on the SLDL whose primary use case today is VP modelling, i.e. SYSTEMC. While SPECC also has a strong focus on this field, its limited application in the industry makes it a less attractive target for research. The main application field of SystemVerilog is assertion-based hardware verification and it therefore lacks the required support that facilitates VP design. For example, it does not support incorporation of C/C++ based behavioural models for processors and I/O for serial or Ethernet connectivity. Consequently, the rest of this thesis will be considered with fast simulation tech- niques for VPs based on the SYSTEMC SLDL. However, many of the challenges encoun- tered in SYSTEMC platforms can also be expected to be present in SPECC simulators and consequently might be overcome using the techniques outlined here. 1.4. Anatomy of a Virtual Platform 9

Virtual Platform

Tools Simulator

SW Debugger Virtual Screen Models Simulation

Kernel Event HW Inspector IP Process IP Loop Virtual Keyboard IP Scheduler SLDL IP Channel Internet Uplink Management Filesystem

Figure 1.3: Anatomy of a VP

1.4 Anatomy of a Virtual Platform

Contemporary VPs installations provide a variety of tools beside the full platform simulator. Furthermore, the simulator itself consists of many different software layers and component libraries. While the structure of a VP will vary slightly depending on its use case, this section identifies a subset shared by most VPs and establishes a common nomenclature. Figure 1.3 depicts this structure. On the top level, a VP is composed of the platform simulator and an accompany- ing toolset. In the following, these components will be described in detail.

1.4.1 Supporting Toolset To be useful as a debugging tool, a VP needs to provide more tools beside just the system simulator and a SW debugger. This toolset varies depending on the intended use case. For example, working with a VP running the Linux kernel [186, 24] requires support for creation and inspection of virtual file systems created and deployed dur- ing simulation. A VP for an Internet of Things (IoT) device would additionally require support for emulation of an Internet connection and so on. An overview of VP tools relevant for this thesis is given below:

• An In-Circuit Debugger is used to test and debug programs running on the VP. Similar to the operation of regular SW debuggers, such as GDB [49, 177] and Mi- crosoft Visual Studio Debugger [212, 123], it supports features like breakpoints, single stepping and memory inspection that aid in identifying and removing SW bugs. However, in case an OS kernel is the debug target, additional support for virtual address translation and system calls need to be provided that would otherwise be managed by the host OS. Typical In-Circuit Debuggers in EDA include rGDB [58], Synopsys PDBG [139] and Lauterbach TRACE32 [108]. 10 Chapter 1. Introduction

Figure 1.4: Synopsys Virtualizer [183] (left) and Windriver Simics [46] (right)

•A HW Inspector is a tool to visualise the internal state and state changes within a VP in a non-intrusive way. Being able to inspect individual registers, cache lines, memories and interrupt lines without altering any system state is one of the major benefits of a VP compared to a classical HW prototype. It is not un- common to see HW inspection and in-circuit SW debugging features merged into a single tool. Popular examples for this within the EDA industry are Syn- opsys Platform Architect MCO [182, 183] and Windriver Simics [113, 47]. Both tools are illustrated in Figure 1.4.

• Internet Uplink tools usually operate in the background and provide local net- work or Internet connectivity for the Ethernet or WIFI components modelled in the VP. They are crucial for development of software for IoT devices and fre- quently offer fault injection functionality to facilitate testing of device drivers. Most EDA tool providers utilise either SLiRP [98, 57] or TUN/TAP [101, 100] to provide their VP with network connectivity.

• Virtual Filesystem tools enable access and modification of the persistent stor- age data used by a VP. Next to an Internet connection, adding files directly into a filesystem is an efficient way of getting programs into a simulation for de- bugging or testing. For platforms running the Linux kernel, these tools usually support the ext2, ext3 and ext4 filesystems. A set of filesystem modification tools is offered by libguestfs [89], for example.

• Virtual Input/Output devices allow for interaction with a VP. A fundamen- tal example of this is a virtual Universal Asynchronous Receiver Transmitter (UART) that prints simulation output to and receives input from a terminal window on the host. Additionally, most modern smart devices, such as phones and tablets, typically employ touch screens, which also need to be replicated in a simulation. VNC [11] is a frequent choice among EDA tool providers to connect virtual screens and keyboards to a VP. 1.4. Anatomy of a Virtual Platform 11

1.4.2 Platform Simulator

The platform simulator can be considered the most important component of a VP. Contrary to HDL simulators, a SYSTEMC based VP does not need external tools to manage the simulation process. Instead, this task is handled by the simulation kernel, which is compiled together with the individual component models to form a single executable simulation program. An overview of platform simulator components, as shown in Figure 1.3, is given below:

• The SLDL defines the interfaces between different component models and the simulation kernel. Communication between models in SYSTEMC is done using TLM interfaces or synchronised channels that allow simultaneous read and write accesses. The interface between a model and the kernel is defined by the SYSTEMC Application Programming Interface (API), e.g., wait calls to advance simulation time and spawn calls to create new simulation processes.

• Models describe the actual behaviour of the VP components. They are usually written in C/C++ and available as off-the-shelf components by hardware ven- dors to facilitate their reuse in different platforms. Typical models include ISSs, buses, memories and I/O components, such as UARTs and Ethernet cards. Be- fore these models can be used in a SYSTEMC VP, they need to be embedded into a wrapper class, which acts as a bridge between the model library and the SLDL.

• The simulation Kernel has the task of conducting the simulation. For VPs a Discrete Event Simulation (DES) scheme has proven beneficial and has therefore also been adopted by SYSTEMC. Internally, the kernel utilises a process scheduler to trigger the execution of simulation processes driven by an event loop. To facilitate the exchange of data between different simulation models, a channel manager is deployed and offers a standardised way of communication.

When using the OSCI reference implementation of SYSTEMC, the process scheduler may only set up a single process at a time for execution1. This frees VP developers from worrying about concurrent accesses to shared resources by different simulation models. However, it also inhibits utilisation of the explicit parallelism of the modelled HW by means of modern multi-core workstation PCs. In Section 2.1, the fundamental simulation algorithm of SYSTEMC is presented and the interactions between events and the process scheduler are outlined in detail. An in depth explanation about other internals of the SYSTEMC kernel including the rele- vant TLM communication interfaces can be found in Appendix A. On top of that, a familiarity with C/C++ is assumed. References such as Meyers [120, 121] and Strous- trup [179, 180] may otherwise be consulted.

1 At the time of writing, this is also the case for other commercial solutions, e.g., from Synopsys [182]. 12 Chapter 1. Introduction

1.5 Thesis Contributions and Outline

This thesis presents novel techniques that enable construction of high-performance, parallel simulators for ESL design. At its core are two parallel simulation engines for SYSTEMC based simulators: SCOPE and SYSTEMC-LINK. Both approaches target the performance bottleneck of current state-of-the-art VPs, i.e., the sequential simulation approach employed in all major SYSTEMC implementations today [1, 182]. However, in order to provide useful solutions for ESL design, it is not sufficient to provide parallel simulation engines alone. Problems raised by introducing parallelism can sometimes only be solved by combining support from those engines with novel modelling primitives. To that extend, three new modelling primitives are introduced: flexible time-decoupling, exclusive memory access and processor sleep states. Together with SCOPE and SYSTEMC-LINK, these primitives form a basis for construction of high perfor- mance VPs for current and next generation systems. In summary, the contributions of this thesis are as follows:

SCope: A parallel SYSTEMC kernel is presented that accelerates simulation speed by taking advantage of contemporary multi-core host architectures. It retains compati- bility with existing VPs by adhering to the SYSTEMC API set forth in the IEEE standard 1666 [80]. For optimal performance, the simulation operates in a time-decoupled fashion, meaning that parts of the simulation are allowed to simulate ahead of time.

SystemC-Link: A SYSTEMC co-simulation engine is introduced that is capable of cou- pling different SYSTEMC-based simulators to form a single VP with a shared memory address space. Similar to SCOPE, time-decoupling is deployed to boost simulation per- formance. Additionally, SYSTEMC-LINK provides a virtual sequential environment that enables use of non-thread-safe models in a parallel simulator.

Flexible Time-Decoupling: This modelling primitive addresses the issue of unpre- dictable inter-thread communication. Time-decoupled simulators naturally require ahead-of-time knowledge about transactions that are exchanged between different time zones to retain temporal correctness. E.g., data received from a component that is ahead in time may not be acted upon until the local time has caught up.

Exclusive Memory Access: Optimistic atomic memory accesses can often not natively be translated to atomic operations on the host, since x86 lacks support for exclusive loads and stores. This becomes especially problematic when transitioning into a paral- lel simulation environment. Therefore, a modelling primitive is presented that retains functional correct operation of exclusive memory accesses in such cases.

Processor Sleep States: While parallel simulation approaches appear beneficial for high load scenarios, they also impose significant overhead during phases with little to no simulation activity. However, modern architectures will frequently turn off pro- cessors to save energy. To exploit this, a modelling primitive is presented that further boosts parallel simulation performance by skipping over idle simulation phases. 1.6. Synopsis 13

1.6 Synopsis

After giving an overview on embedded systems and trends in embedded system design, this chapter introduced VPs as an important tool for embedded SW devel- opment. The simulation speed degradation that many next generation platforms are encountering already today was pointed out. It resembles the core motivation for this thesis. Furthermore, this chapter briefly visited important aspects of state-of-the-art VP construction: the SLDLs relevant today for ESL design and the underlying struc- ture that comprises a modern VP. Finally, the contributions of this thesis beyond the state-of-the-art were enumerated. The remainder of this work is structured as follows. First, Chapter 2 presents the DES algorithm employed by SYSTEMC in detail, before discussing issues that arise when replacing it with a parallel one, such as the emergence of race conditions and loss of deterministic execution. Subsequently, Chapter 3 outlines other scientific work in the domain of parallel and ESL simulation with a focus on VP design. Next, the VPs used throughout this thesis are introduced in Chapter 4. A detailed discussion on each aforementioned contribution of this work can be found in Chapters 5 – 9. Chapter 10 concludes this work and gives an outlook on potential future work in the domain of parallel simulation for ESL design. Additionally, a Glossary lists all mentioned abbreviations and acronyms. Ap- pendix A briefly visits selected concepts of the SYSTEMC including extensions for trans- action level modelling. Appendix B presents the Virtual Components Library (VCL), which contains transaction level models for commonly used components, such as memories, buses and peripherals. Finally, Appendix C holds raw measurement data and additional information for selected experiments. 14 Chapter 1. Introduction Chapter 2

Background

The previous chapter introduced VPs as state-of-the-art tools for ESL design and un- derlined the problem that the sequential SYSTEMC performance does not scale with increasing design complexity. To better understand the contributions of this thesis in the domain of parallel SYSTEMC simulation, it is crucial to first layout the sequential approach deployed by current implementations and discuss issues that are likely to arise when adopting parallel simulation strategies. To that extent this chapter discusses the following background information: First, Section 2.1 explains the fundamental DES algorithm with focus on SYSTEMC specific aspects. Next, Section 2.2 points out how changes in this approach can negatively impact deterministic properties of VPs. Problems related to parallel execution of non-thread-safe code are identified in Section 2.3 before Section 2.4 introduces the synchronisation problem for parallel simulators. Finally, Section 2.5 concludes this chapter with a brief summary.

2.1 Discrete Event Simulation

On a fundamental level, a SYSTEMC simulation is composed of model state, processes and events. State represents properties of the simulation models, such as registers, signal values, current power consumption, etc. It is intended to be modified by pro- cesses, which execute code to model state changes over time. These processes can not be executed directly and therefore must be bound to events. Events represent state changes at discrete points in time and are more formally introduced by Definition 2.1. Definition 2.1 An event e describes a simulation state change at a particular point in time. It is modelled as the triple e = (te, Se, De), with Se ∪ De = {p1, ..., pn} being sets of processes to be executed once the event occurs at timestamp te. Events alone can not alter simulation state directly, since only processes are ca- pable of executing code and thereby model component behaviour. Instead, processes must link themselves to events in order to signal a request for execution at timestamp te. Events and processes are linked by means of sensitivity sets: Se and De represent static and dynamic sensitivity sets, respectively. Processes are allowed to alter their dynamic sensitivity set, however, static sensitivity is fixed upon process creation. The different kinds of sensitivity are more formally introduced in Definition 2.2.

Definition 2.2 A process p is called sensitive to an event e, if p ∈ Se ∪ De. Conse- quently, p is called statically sensitive to e if p ∈ Se and, respectively, it is called dynamically sensitive to e if p ∈ De.

15 16 Chapter 2. Background

Internally, the SYSTEMC kernel keeps a list (EQ) of upcoming events sorted in as- cending order by their associated timestamps te. To advance the simulation, the event with the smallest timestamp e is fetched from EQ and simulation time tsim is set to te. Advancing simulation time in a discrete fashion by jumping from event to event and skipping over idle phases is the fundamental concept behind DES. Once an event with a suitable te has been fetched, the scheduler moves all its sen- sitive processes into the ready queue (RQ). If there are multiple events with identical te, all are fetched and their processes are handled in the same way. In the context of SYSTEMC, this procedure is referred to as triggering:

Definition 2.3 An event e can be triggered causing all processes p ∈ Se ∪ De to be scheduled for execution. This also clears the dynamic sensitivity set of e and removes e from EQ. More formally: RQ = RQ ∪ Se ∪ De, De = ∅, te = ∞. After triggering, the scheduler begins executing the processes from RQ in a co- operative manner. This means that individual processes have to explicitly yield ex- ecution in order to allow others to run. When using the reference implementation of SYSTEMC from OSCI [1], only a single process is allowed to execute at any given time. This frees VP designers from worrying about concurrent accesses to shared re- sources by different simulation models. However, this also inhibits the utilisation of the explicit parallelism of SYSTEMC processes by means of multi-core workstation PCs. Simulation initially starts with an empty event queue EQ = ∅ and a ready queue start start filled with startup processes RQ = {p1 , ..., pN }, which are defined by the compo- nent models. It is the task of these startup processes to fill EQ with events modelling component behaviour. Events are added to EQ by means of notification:

Definition 2.4 An event e can be notified using a time offset tnotify,e. This causes e to be triggered at timestamp te = tsim + tnotify,e. If tnotify,e = 0 this is called a delta notification and all processes p ∈ Se ∪ De are executed in the next delta cycle. Definition 2.4 states that it is possible to notify events without delay. In order to impose a structural order on events occurring at the same time, SYSTEMC introduces the concept of delta cycles. Events notified using delta notifications are guaranteed to not be triggered until all events from the current delta cycle have finished and the next delta cycle has begun. An exception to this rule are immediate notifications. They are sometimes employed instead of delta notifications, due to their smaller performance overhead. Immediate notifications are introduced in Definition 2.5 as follows:

Definition 2.5 An immediate notification refers to directly triggering an event e with- out prior notification. Sensitive processes p ∈ Se ∪ De are directly scheduled for exe- cution in the same delta cycle.

Besides using notification as described in Definition 2.4 and Definition 2.5, pro- cesses may also communicate with each other using channels. Channels support synchronised read and write operations and offer processes an event ec to react to value changes. The simulation kernel keeps a list of channels (CQ) at all times. Defi- nition 2.6 defines channels more formally: 2.1. Discrete Event Simulation 17

Start

Evaluation WQ = RQ ∀ ∈ WQ RQ ≠ ∅? RQ ∅ run(p) p Phase no = yes

delta = delta + 1 Update Phase update(c) ∀ c ∈ CQ

Notification EQ ≠ ∅? e = next(EQ)t sim = t e trigger(e) Phase no yes

Done

Figure 2.1: Simplified SYSTEMC simulation loop

Definition 2.6 A channel is modelled as the triple c = (rc, wc, ec), with rc being the value returned when the channel is read. When the channel is written, the new value is stored in wc and the event ec receives a delta notification if wc 6= rc.

To avoid processes execution order conflicts between processes reading and writ- ing the channel in the same delta cycle, written values are cached and only applied at the end of any given delta cycle during a channel update:

Definition 2.7 An update of a channel c refers to the process of applying new channel values written during a previous delta cycle, i.e., rc = wc. Only this makes them visible for read operations during subsequent delta cycles.

With the help of Definitions 2.1–2.7 it is now possible to give a simplified overview of the simulation loop of SYSTEMC as illustrated by Figure 2.1. Conceptually, the loop is broken down into three phases. First, the evaluation phase takes place. All pro- cesses from RQ are moved into a scheduler internal waiting queue WQ and are then executed. Meanwhile, new processes may be added to RQ due to immediate notifi- cations. Consequently, this phase may need to be repeated multiple times until there are no more processes scheduled for execution, i.e., until RQ = ∅. During the update phase the kernel synchronises all of its channels and advances the delta cycle counter. Finally, the notification phase triggers the next event from EQ and advances simulation time tsim. Since this causes new processes to become ready for execution, the loop is repeated with another evaluation phase. Simulation ends once all events from EQ have been handled or, optionally, when tsim exceeds a user defined maximum. 18 Chapter 2. Background

i = i + 1 run

notify which runs i = 1 first?

run 1? print i 2?

Figure 2.2: Nondeterministic simulation exposing a process execution order depen- dency between processes p1 and p2

2.2 Deterministic Simulation

Determinism is a desirable attribute for most development tools and VPs are no ex- ception to this. Being able to reliably reach the same state in consecutive simulation runs eases bug hunting sessions where usually programs are repeatedly executed to isolate errors. During design space exploration, a deterministic VP can deliver more reliable information on system characteristics, such as execution time, cache utilisa- tion and power consumption. As with most major software projects, VPs incorporate many sources of potential nondeterminism, too. These common sources are presented in the following:

• A dependency on User Input frequently results in nondeterministic simulation behaviour. Input reaches a VP usually via an interrupt driven UART. Already minor delays during typing can cause noticeable changes in the interrupt han- dler mechanisms of the target OS, significantly altering simulation timing.

• Intended nondeterminism via use of a Pseudo Random Number Generators (PRNGs). Use of these components aims at strengthening the cryptographic capabilities of a system, but also voids deterministic simulation by construction.

• Iteration on unordered data or data ordered by pointer value or creation time- stamp results in nondeterministic execution. An example of this from the C++ domain would be applying the sort function to a vector of pointers.

Beyond these common sources, a SYSTEMC based VP also needs to take into account potential nondeterminism introduced by the simulation engine. If a simulation is written in a way that its output depends on the order in which the scheduler executes the simulation processes, the result is undefined. This is illustrated by Figure 2.2: it presents a simulation consisting of three pro- cesses (p0, p1 and p2) with a shared variable i, which is first initialised by p0 to 1. 2.3. Race Conditions 19

Thread A Thread B Value Thread A Thread B Value read 0 read 0 increment 0 increment 0 write 1 read 0 read 1 write increment 1 increment 1 write 1 write 2 1

Figure 2.3: Data race: lack of synchronisation between intended atomic regions of two threads sometimes leads to erroneous results.

Subsequently, p0 triggers event, which in turn causes p1 and p2 to be scheduled for execution. In such cases, the SYSTEMC standard leaves it open for implementation which process should execute first1. It only enforces that, should this situation occur multiple times, the result shall always be the same, even over multiple simulation runs. As a consequence, it is undefined whether p2 runs first (producing the output "1") or last (producing the output "2"). Dependencies on process execution order are by no means just theoretical. Pre- vious work [166, 206] has already identified a similar issue within an event queue model shipped with OSCI SYSTEMC– a model frequently used as the basis for many transaction level memory-mapped buses. It has been shown that this dependency could be exploited to cause any VP using this model to crash, just by inverting the process execution order [166, 206]. Process execution order dependencies become an even greater problem when mi- grating towards parallel simulation environments. Considering the example from Figure 2.2 again, a naive implementation could issue processes p1 and p2 for parallel execution, since they both become runnable at the same delta cycle. This not only causes the execution order of p1 and p2 to become unpredictable, but also creates a race condition around the global variable i. Protection from race conditions, e.g. via proper synchronisation, is not enforced by SYSTEMC and their prevention therefore becomes the task of the employed parallel simulation engine.

2.3 Race Conditions

The lack of strong requirements for proper shared data synchronisation between dif- ferent SYSTEMC processes makes VPs susceptible to race conditions. A race condition arises whenever the correct execution of a piece of software depends on the timing of processes or threads. Previous work [133] identifies two types of race conditions: data races and general races. Data races are concerned with the atomicity of operations. With the exception of specific compiler known functions, most hand written C/C++ code does not execute

1 IEEE Standard 1666-2011, SystemC Language Reference Manual [80], Chapter 4.2.1.2 20 Chapter 2. Background

Thread A Thread B Value Thread A Thread B Value lock 1 lock 1 read 1 read 1 increment 1 square 1 write 2 write 1 unlock lock 2 lock unlock 1 read 2 read 1 square 2 increment 1 write 4 write 2 unlock 4 unlock 2

Figure 2.4: General race: without enforcing a specific order between threads access- ing shared data, program execution may become nondeterministic. atomically. As a consequence, operations on data shared between two threads must be properly synchronised or may otherwise lead to unintended results.

Definition 2.8 A data race is encountered when two or more processes perform a nonatomic memory operation on shared data. Data races cause programs to behave nondeterministically and lead to incorrect program execution and general failures.

Figure 2.3 gives an example for data races in the sense of Definition 2.8. The programmer intended to have each of the two threads increment a global variable once and therefore expects to see the result 2 at the end. While this happens in the case depicted on the left-hand side of Figure 2.3, a different result can be observed on the right-hand side. Due to scheduling delays outside of the control of the programmer, the store operation of thread A gets delayed, causing it to be executed only after the load of thread B. Consequently, thread B only sees the original value and, like thread A, also computes the increment of "0" instead of "1" as intended. Finally, both threads store the value "1" back to memory. This happens, because the instructions sequence used to increment the value is not atomic. Two solutions exist to overcome this issue. Firstly, one can use increment oper- ations that are guaranteed to execute atomically, such as the lock.xadd instruction in the x86 architecture [83], or one of the compiler known atomic functions [50]. Sec- ondly, a mutex can be used. By convention, this mutex must first be locked before it is allowed to operate on shared data associated with it. Once the operation has completed, the mutex must be unlocked before any other thread can lock it again. The lock- and unlock-operations are provided by the programming environment and are guaranteed to execute atomically. General races are fundamentally similar to process execution order dependencies in SYSTEMC. In this case, the atomicity of code regions is guaranteed, however, their order of execution is not. Depending on which thread is allowed to enter its critical region first, the output of the program might change. General races rarely trigger bugs, but still cause program execution to become nondeterministic. 2.4. Synchronisation Problem 21

Definition 2.9 A general race is encountered when the behaviour of a program de- pends on the order of atomic memory operations performed on shared data by two or more processes. A general race is a failure in programs intended to be deterministic.

Having introduced data- and general races, it is now possible to formally specify race conditions for general parallel programs in Definition 2.10.

Definition 2.10 A program is subject to a race condition if it is subject to either a data race (c.f. Definition 2.8), a general race (c.f. Definition 2.9), or both.

An example for a general race as specified in Definition 2.9 is given in Figure 2.4. Two threads try to lock a mutex in order to be allowed to operate on shared data. Upon successful acquisition, thread A increments that value, while B squares it. When both threads operate in parallel, the result depends on which thread locks the mutex first, i.e., if thread A runs first, the result will be 4, otherwise 2. It is the responsibility of the programmer to enforce the correct ordering between two racing threads. In the context of SYSTEMC, any simulator exposing a dependency on process execution order will be subject to a general race in a parallel simulation environment. However, the inverse is not true: VPs may still be subject to general races, even if no dependency on process execution order exists. This is due to the fact that sequential simulators might still deploy multiple processes, e.g., for networking or external debuggers.

2.4 Synchronisation Problem

Naive approaches to parallel simulation may now simply decide to execute SYSTEMC processes in parallel, potentially protecting global data structures from race conditions using mutexes or splitting them up among the threads. However, such approaches quickly run into the problem of handling simulation time differences between threads advancing at different speeds, e.g., due to load imbalance or insufficient host pro- cessing resources. This effect creates different time zones within a simulator and complicates temporally correct inter-thread communication due to the appearance of causality errors:

Definition 2.11 A causality error is encountered whenever a simulator has to process an event a after it has already triggered another event b, with ta < tb so that tsim > ta holds true.

Processing events not in ascending timestamp order can result in individual events to be triggered too late, as described in Definition 2.11. Doing so results in past state being affected by future actions, a reversal of cause and effect, hence the name. An example for the occurrence of an causality error is given in Figure 2.5. It is assumed that the two processes A and B execute in parallel and advance their own simulation time using timed notifications on the events A0, A1 and B0, B1, respectively. After executing for 3 s, process A has reached simulation timestamp 50ns, while pro- cess B is at 20 ns. Subsequently, B issues a notification on event A2, requesting process 22 Chapter 2. Background

Process A simulation time Process B A1 causality 50ns Notification error notify(30ns) A2 40ns B2

30ns notify(20ns)

notify(20ns) 20ns A0 B1

10ns B0 notify(10ns) notify(20ns) 1s 2s 3s 4s notify(10ns) wall-clock time

Figure 2.5: Causality Error: in order to trigger event A2 at timestamp 30 ns, process A would need to go back in time.

A to be executed at timestamp 40 ns. However, A has already advanced too far into the future to satisfy the request of B since it has already triggered A1. More formally, < a causality error is encountered because tA2 tA1 . Strategies to avoid causality errors can be classified into two groups: conserva- tive and optimistic. Conservative approaches employ rigid synchronisation schemes to prevent causality errors from occurring. This achieves complete temporal correctness but at the cost of reduced parallel performance due to synchronisation overhead. Op- timistic approaches do not directly prevent causality errors from occurring. Instead, the simulation is rolled back to a known-good state, once an error is detected. Parallel performance of optimistic approaches strongly depends on the frequency of errors and the efficiency of the rollback mechanism. Due to the complex nature of modern VPs, rollback strategies appear infeasible, which is why optimistic approaches are rarely encountered in reality2. Among conservative approaches, different synchroni- sation strategies have emerged:

• Synchronous approaches utilise global synchronisation points, for example at the end of a quantum or a delta cycle. Each process needs to wait at such a synchronisation point, until every other process has also reached it. Figure 2.6 visualises this synchronisation scheme for the example given in Figure 2.5. In this case, process A has to wait at a synchronisation barrier and cannot trigger event A1, allowing process B to notify event A2 before it is too late.

2 The term optimistic is not used consistently throughout literature. For example, [90] uses it to describe what [53, 30, 125] would call an asynchronous conservative approach. 2.4. Synchronisation Problem 23

barrier synchronization

Process A A0 A2 A1

Process B B0 B1 B2

0ns10ns 20ns 30ns 40ns 50ns simulation time

Figure 2.6: Conservative parallel simulation using synchronous event processing

lookahead

Process A A0 A2 A1

Process B B0 B1 B2

0ns10ns 20ns 30ns 40ns 50ns simulation time

Figure 2.7: Conservative parallel simulation using asynchronous event processing

• Asynchronous approaches allow individual processes to simulate ahead in time, but limit this using a lookahead. Figure 2.7 applies this synchronisation scheme ∆ to the example from Figure 2.5. It assumes a lookahead of tla = 10ns and therefore pauses process A at timestamp 35ns, since B is still only at 25ns. This leaves enough time for B to issue a notification of event A2 to be triggered at 40 ns. For asynchronous approaches to work correctly, timed notifications must ∆ be issued sufficiently ahead of time, i.e., tnotify,e > tla.

The actual implementation of both, synchronous and asynchronous algorithms depends on the nature of the simulation host. If shared memory is available, barriers and mutexes can be utilised to construct both kinds of synchronisation schemes. In case of a distributed system, simulation processes need to synchronise with each other using message passing. In the context of VP development, the choice between a parallel and distributed simulation approach also depends on the properties of the target system. For example, a VP modelling an SMP system with a single global memory and address space is not suited very well for a distributed approach. If the memory of a processor is simulated by a different process, possibly running on a different host than the virtual processor that wants to access it, every memory operation (i.e, each read, write and fetch operation) would need to be done using message passing. However, transmission of messages is much slower than pointer accesses using shared memory, defeating the point of distributed simulation for performance in the first place. 24 Chapter 2. Background

2.5 Synopsis

Applying parallel simulation techniques to existing VPs is a tremendously difficult task and this chapter has given more background information as to why this is the case. Firstly, the discrete event simulation fundamentals of SYSTEMC were introduced. Already here the first challenges become apparent: the simulation is represented as a discrete list of events, which has to be processed in-order to prevent causality errors. How can this naturally sequential approach be parallelize optimally to best utilise modern multi-core workstations and achieve good performance? Current SYSTEMC implementations offer VP designers certain guarantees of deter- minism, for example with regards to process execution order. It is likely that parallel approaches will not be able to uphold those in all circumstances. Which tools and modelling primitives must therefore be available or developed to support determin- ism in modern parallel VPs? Probably the biggest challenge to overcome for parallel simulation techniques is the legacy code base of most model providers. Since SYSTEMC developers from the beginning never had to worry about concurrency related issues, how can these legacy models be protected from race conditions? Providing answers to only a subset of these questions appears futile. Novel ap- proaches in the domain of parallel SYSTEMC for ESL design must not only provide good parallel performance, but also deterministic modelling primitives that help VP designers to incorporate models into their design that were not designed with paral- lelism in mind. It is the goal of this thesis to provide both. Chapter 3

Related Work

Although the principles of Parallel Discrete Event Simulation (PDES) have been well researched for over thirty years now, it has still not found its way into mainstream ESL simulation. The main problems hindering its adoption have been outlined in Chapter 2. Yet, the combination of multi-core workstation PCs being widely available today and the need for fast simulation techniques for complex multi-core platforms makes PDES still appear as an attractive field of research. This chapter provides a general overview over past and present work from this research field. After briefly visiting traditional parallel simulation approaches in Sec- tion 3.1, it focuses on parallel simulation of SYSTEMC based VPs. To that extent, dif- ferent PDES approaches are discussed in Section 3.2 and relevant publications are enumerated. Subsequently, Section 3.3 outlines alternative approaches to accelerate ESL simulations such as fast instruction set- and hybrid simulation. This chapter is concluded with a brief summary in Section 3.4.

3.1 Traditional Parallel Simulation

One of the foundational works on PDES was presented by Chandy et al. in 1979 [30]. In their work, the authors present a distributed simulator using an asynchronous event processing scheme. It uses messages to communicate timestamp and lookahead information between the simulation processes in the absence of a central scheduler. Later, in 1985, the first optimistic approach to PDES is presented by Jefferson [86]. He introduces the concept of virtual time, which is today more commonly referred to as simulation time, and imposes it onto a distributed simulation. In case of causality errors the system uses antimessages to initiate a rollback to a known good state. Subsequent work in the domain of PDES builds upon and extends these early de- signs. For example, the problem of deadlock detection and prevention in distributed simulators is discussed by Misra et al. [125]. In 1988, Chandy et al. [29] extend previous synchronous and asynchronous simulators with conditional event execution. Events that can be guaranteed not to lead to causality errors with other processes are con- sidered safe to execute, consequently boosting performance of distributed simulators with infrequent process interactions. At this point it became apparent that PDES is a difficult problem and that the implementation of efficient synchronisation mechanisms is highly error-prone. Con- sequently, Bagrodia et al. [14] propose to have a programming language dedicated for writing efficient parallel simulations. For the same reason, Nicol et al. [137] present a method that assembles a distributed simulator out of simpler sequential segments.

25 26 Chapter 3. Related Work

In the following years, many reports were published that summarised the state of the art in PDES and established a common nomenclature [51, 52, 136, 54, 33]. At that time, most research focused on distributed simulation, probably because multiproces- sor computers were not readily available until the turn of the century. One of the first works considering both, parallel and distributed simulation hosts, was presented by Fujimoto [53] in 1999.

3.2 Parallel SystemC

The initial version of SYSTEMC was released in 2000 [19, 67]. At that time, parallel simulation was not yet a concern and deterministic operation was given preference. Because of this and the fact that PDES has proven to be challenging, its first prototype operated strictly sequentially. Also with subsequent update releases, such as SYSTEMC 2.2.0 in 2005 [79] and SYSTEMC 2.3.0 in 2011 [80], this did not change. It became the task of researchers to propose parallel simulation techniques that could be retroactively applied to simulators that were assuming sequential operation. This section gives an overview about these techniques. First, Section 3.2.1 and Section 3.2.2 present related work proposing parallel SYSTEMC simulators utilising synchronous and asynchronous approaches, respectively. Section 3.2.3 is dedicated to distributed simulators utilising message based synchronisation schemes similar to traditional PDES. Next, Section 3.2.4 investigates techniques, where parts of the sim- ulation are offloaded to special accelerators, such as GPUs and Field Programmable Gate Arrays (FPGAs). Finally, related work is summarised in Section 3.2.5.

3.2.1 Synchronous Simulation

In the context of SYSTEMC, synchronous parallel simulation can be achieved using a straightforward approach. Considering the SYSTEMC simulation algorithm depicted in Figure 2.1, it is sufficient to parallelize the execution of processes during the eval- uation phase to achieve synchronous behaviour. However, during that phase any process may access shared state, such as model attributes or the simulation context, so special precaution must be taken to protect it from race conditions. One of the first synchronous parallel SYSTEMC kernels was presented by Chopard et al. [37] in 2006. Simulation processes are distributed onto host threads by the use of node modules, which serve as a structuring element within the object hierarchy of SYSTEMC. Subsequent work by Combes et al. [40] improves on this approach by re- laxing the synchronisation overhead under certain circumstances. In 2009, Ezudheen et al. [143] published a synchronous parallel SYSTEMC kernel that utilises a work steal- ing algorithm in order to achieve automatic load balancing – a task that previous work left to the programmer. In the following year the first open source simulation kernel parsysc was released by Krikun et al. [104, 103]. Contrary to previous approaches, this one uses Intel Thread Building Blocks (ITTB) as the underlying multithreading library, instead of the more commonly used POSIX Threads (pthreads). 3.2. Parallel SystemC 27

In 2011, Yang et al. [218] utilised a synchronous PDES algorithm in order to in- crease simulation performance of the SYSTEMC-based Metropolis system design envi- ronment [15]. A similar approach was also taken by Liao et al. [110] to accelerate cycle-level simulation in UNISIM [12, 192]. However, synchronous parallelization does not necessarily need to happen on delta cycle level as shown by Chung et al. using the SimParallel SYSTEMC simulator [38]. In their approach the authors execute all processes in parallel that respond to a global clock signal. Until this point all previous work only considered gate level SYSTEMC simulation. But with TLM gaining increased popularity in the ESL modelling community due to its higher abstraction level and the resulting performance gains, new approaches were investigated that are better suited for those simulators. One challenge is the compar- atively low simulation work load during single delta cycles, which is insufficient to offset the parallelization overhead introduced by synchronous PDES. As a solution, Ventroux et al. extend their parallel SYSTEMC kernel SCale [193] with a new modelling primitive. It ensures that all processes synchronise their TLM quantum so that they are executed during the same delta cycle [194]. Another challenge for TLM based simulators is that components frequently ex- change messages using Interface Method Calls (IMCs) and as such are prone to race conditions when sender and receiver of a message operate in parallel. Previous work chose to limit inter-thread communication to predefined channels, such as ports and signals in SYSTEMC. For IMCs this is not possible without altering model code. To cir- cumvent this issue, Schumacher et al. extend the parallel SYSTEMC kernel PARSC [165] with so-called containment zones [168, 169]. Components residing in the same zone are simulated sequentially and thereby avoid race conditions. IMCs between compo- nents operating in parallel in different zones are automatically intercepted by a zone gate and subsequently forwarded in a thread-safe manner to the receiver.

3.2.2 Asynchronous Simulation

As already noted in the previous section, synchronous parallel SYSTEMC approaches generally struggle to achieve good performance for transaction level simulators. Be- cause they are optimised for speed, there is only little simulation activity during each individual evaluation phase. This motivated research on asynchronous PDES ap- proaches that can parallelize beyond the delta cycle level. In this context, a common strategy is to group simulation activity from multiple cycles together to form a quan- tum. The quanta for each simulation component can then be efficiently executed in parallel. However, since each quantum may start and end at different timestamps, advanced time synchronisation strategies are needed to avoid causality errors. One possible approach to this challenge is to leave the management of time to the simulation designer. In TLM/T [195], Viaud et al. propose to assign each simulation component its own local time. Messages sent between components receive the local time of the sender and it is up to the receiver to handle proper time synchronisation. This work is later extended by Mello et al., proposing a new modelling style called TLM-DT [119]. It extends the TLM standard with new facilities to manage synchro- 28 Chapter 3. Related Work nisation of distributed time, as well as with new interfaces to propagate local time information between sender and receiver. Lookahead-based approaches allow individual simulation processes to run ahead in time, but limit the local time differences between any two processes. But if inter- process communication cannot be stated sufficiently ahead of time, such approaches incur timing errors when compared to a sequential simulation. In spite of that, the re- duction of synchronisation overhead enables efficient parallel execution, even if simu- lation load is distributed over multiple delta cycles. The first approach of this category was presented by Jones in 2011 [90] for a synthetic target system. Subsequent work focused on lookahead-based parallel simulation of Network-on-Chip (NoC) based architectures [149, 150], since network routing and transmission delays make ahead- of-time communication straightforward. In 2013, Moy introduced the sc-during primitive [129, 128]. It allows developers of a VP to execute a task in parallel and asynchronously to the main simulation, but guarantees that the simulation will wait if the task has not finished after a given du- ration. Furthermore it provides facilities that allow tasks to interact with the SYSTEMC simulator in a thread-safe way. However, in order to take advantage of sc-during, extensive redesign of the VP code is required. The last category of asynchronous PDES approaches utilise static source code analysis methods to identify simulation processes that are safe to execute in parallel, e.g. because they never access shared data. For the domain of SPECC, Chen et al. present an extension to the SPECC kernel that allows processes to be executed out of simulation time order [34, 32, 35]. Subsequent work attempts to introduce this concept also to SYSTEMC based VPs, e.g. by Liu et al. [112]. This idea has already been investigated for gate level SYSTEMC simulations by Savoiu et al. back in 2002 [159, 160]. However, static source code analysis based approaches appear difficult in real life scenarios, given that IP vendors provide their models generally in precompiled form and without access to the source code. Moreover, the frequent use of pointers in transaction level and functional models further complicates this analysis.

3.2.3 Distributed Simulation

Distributed simulation approaches generally distribute the simulation load among multiple host processes, often on different hosts. Consequently, shared resource man- agement and scheduling is not available, leaving the task of load-balancing to the simulation developer. Partitioning a VP for distributed simulation is not a trivial task, since communication latencies between different hosts or even between two proces- sors on the same host must be taken into account for optimal performance. Early research in this field with a focus on SYSTEMC was presented by Meftali et al. in 2004, proposing charge-balancing [116] to improve simulation speed of a distributed SYSTEMC simulator using Simple Object Access Protocol (SOAP) based communication interfaces [118, 117]. In the following year, Cox presented RITSim [41], an architec- tural simulator based on SYSTEMC and Message Passing Interface (MPI). In this work, partitioning is performed automatically at compile time based on profiling informa- 3.2. Parallel SystemC 29

tion gathered during the elaboration phase of regular SYSTEMC. A similarly message passing based approach is presented by Huang et al. in 2008. In their work, the authors describe the SCD library [77], which implements a master-slave approach to distributed SYSTEMC simulation and relies on a custom Design Space Exploration (DSE) framework [187] for partitioning. Simulation time is managed by the master, control- ling the local time of the slaves via time advancement messages. In ArchSC [69, 70] Hao et al. utilize novel channel primitives to abstract commu- nication between different simulation segments. Using so-called parallel channels the authors achieve fast simulation speeds, especially for gate- and logic-level simulators. On the other side of the spectrum of abstraction, Niaki et al. present a distributed SYSTEMC approach for the modelling of entire systems as communicating processes within the ForSyDe framework [134, 157]. Khaligh et al. propose a similar approach to distributed SYSTEMC [154], but focus on models that can adapt different abstraction levels to better balance performance and accuracy needs [155, 153]. Good performance of distributed simulators not only depends on an optimal load balancing, but also on efficient time synchronisation and data transportation between the different hosts. In that context, Peeters et al. propose a framework for distributed TLM simulators that offers designers the choice between two synchronisation modes, offering either high precision or high throughput [146]. Subsequent research follows the idea of a relaxed synchronisation scheme in favour of improved simulation speed at the cost of temporal inaccuracies. An example for this is presented by Sauer et al.: CoMix [158] is an interface library for the modelling of distributed VPs that uses a loose synchronisation scheme between its peers for optimal simulation performance. That an efficient time synchronisation needs not always come at the cost of a re- duction in simulation accuracy is demonstrated by Schumacher using the distributed TLM framework disSC [164]. In his work the author cleverly uses a priori commu- nication information to hide communication latencies between simulation hosts. A regular lookahead scheme is utilised to allow individual simulators to operate ahead of time. Routing delays within the demonstrated VP are exploited to deduce the ideal point in time to check again for transactions from peers. Outside of that, peers can operate independently. However, a drawback of this approach is the lack of a return path, making it impossible to model errors such as dropped or tainted transactions.

3.2.4 Accelerator Supported Simulation

Besides offloading simulation activity to other processors or hosts, a subset of the related work attempts to utilise coprocessors that are installed into the host system to further accelerate simulation. These coprocessors - or accelerators - include GPUs, FPGAs and in some cases even Application Specific Instruction Set Processors (ASIPs). In order to efficiently use these as accelerators, two main challenges have to be over- come: firstly, the coprocessors usually operate on a different instruction set, which makes a dynamic translation from the typical x86 code or recompilation of the C/C++ code necessary. Secondly, one must carefully decide which simulation processes to execute on the accelerator and which on the host. Control flow dominated code usu- 30 Chapter 3. Related Work ally performs best on the main processor or an dedicated ASIP, while highly parallel workloads can best be accelerated using GPUs and FPGAs. Most workstation PCs today include discrete GPUs, so it comes as no surprise that the majority of related work regarding accelerator supported SYSTEMC simulation targets them. Their foundation is formed by the two major programming libraries CUDA [140, 135] and OpenCL [185, 178]. A comparison between these two in the context of SYSTEMC is provided by Bombieri et al. [23]. SCGPSim [131, 130] by Nanjundappa et al. represents one of the first attempts to accelerate SYSTEMC using GPUs. The authors combine a source-to-source trans- lator with a custom simulation kernel in order to convert C/C++ code of eligible SYSTEMC processes into CUDA code, which is then compiled for and executed on the GPU. While SCGPSim only supports gate level simulators, subsequent work by Sinha et al. [172] extends the fundamental approach to support transaction level simulators and provides heuristics to identify SYSTEMC processes suitable for execution on GPUs. Vinco et al. present SAGA [196], which further improves on SCGPSim by reducing synchronisation events via static scheduling. FPGAs and custom ASIPs appear as another attractive target for offloading sim- ulation processes that run inefficiently on the host processor. Sirowy et al. present an approach, where selected SYSTEMC processes are first compiled into a MIPS-like byte code. This code can then be executed on an FPGA using a soft microproces- sor called accelerator engine [173, 175]. Subsequent work [174] improves on this by introducing heuristics that dynamically move SYSTEMC processes between processor and FPGA at runtime depending on simulation load. A custom ASIP for accelerated execution of SYSTEMC simulators is presented by Ventroux et al. [193]. Up to 64 of these ASIPs can be combined to form the RAVES hardware architecture. Combined with a custom SYSTEMC kernel, which implements a synchronous parallel simulation approach, RAVES achieves significant performance boosts in gate level and synthetic benchmarks. Finally, approaches exists that have ported SYSTEMC entirely to other microarchitec- tures. Kaouane et al. present SysCellC – an implementation for SYSTEMC that exploits the parallel architecture of the IBM Cell processor [94]. Furthermore, the Intel Single- Chip Cloud Computer (SCC) also appears as an attractive hardware platform for par- allel SYSTEMC, due to its 48 distinct physical cores [76]. Roth et al. investigate this and propose several different synchronous [152, 149] and asynchronous [151] frameworks for parallel SYSTEMC on the SCC.

3.2.5 Summary of Parallel SystemC Approaches

Over the past decade numerous works have identified the sequential operation as the central performance bottleneck of SYSTEMC and investigated parallel simulation technologies that can keep pace with the rising complexity of modern systems. De- pending on the intended application, these approaches differ in key aspects, such as their handling of simulation time and state. 3.2. Parallel SystemC 31

Author Year State Time Driver Abstraction References

Meftali et al. 2004 distributed synchronous CPU RTL [116] Cox 2005 distributed asynchronous CPU RTL [41] Chopard et al. 2006 centralised synchronous CPU RTL [37] Viaud et al. 2006 centralised asynchronous CPU transaction [195] Combes et al. 2008 centralised synchronous CPU synthetic [40] Huang et al. 2008 distributed synchronous CPU RTL [77] Kaouane et al. 2008 distributed synchronous Cell synthetic [94] Ezudheen et al. 2009 centralised synchronous CPU synthetic [143] Hao et al. 2009 distributed synchronous CPU synthetic [69, 70] Khaligh et al. 2009 distributed asynchronous CPU transaction [154, 155] Sirowy et al. 2009 centralised synchronous FPGA RTL [173, 175, 174] Krikun et al. 2010 centralised synchronous CPU RTL [104, 103] Schumacher et al. 2010 centralised synchronous CPU RTL [165] Mello et al. 2010 centralised asynchronous CPU transaction [119] Nanjundappa et al. 2010 centralised synchronous GPU RTL [131, 130] Yang et al. 2011 centralised synchronous CPU synthetic [218] Liao et al. 2011 centralised synchronous CPU synthetic [110] Jones 2011 centralised asynchronous CPU transaction [90] Peeters et al. 2011 distributed synchronous CPU transaction [146] Chen et al. 2012 centralised asynchronous CPU system [34, 32, 35] Sinha et al. 2012 centralised synchronous GPU system [172] Vinco et al. 2012 centralised synchronous GPU system [196] Roth et al. 2012 distributed asynchronous SCC transaction [152, 149] Schumacher et al. 2013 centralised synchronous CPU transaction [168, 169] Roth et al. 2013 centralised asynchronous CPU transaction [149, 150] Moy 2013 centralised asynchronous CPU transaction [129, 128] Niaki et al. 2013 distributed synchronous CPU system [134] Chung et al. 2014 centralised synchronous CPU transaction [38] Sauer et al. 2014 distributed asynchronous CPU transaction [158] Ventroux et al. 2014 centralised synchronous ASIP RTL [193] Weinstock et al. 2014 centralised asynchronous CPU transaction [207, 202, 205] Schumacher 2015 distributed asynchronous CPU transaction [164] Ventroux et al. 2016 centralised synchronous CPU transaction [194] Weinstock et al. 2016 centralised asynchronous CPU transaction [204]

Table 3.1: Overview of related work in parallel SYSTEMC simulation 32 Chapter 3. Related Work

Abstraction Explanation

synthetic The simulation does not describe anything meaningful. It is rather intended to test and demonstrate the performance and capabilities of the underlying approach. RTL The simulation describes a subsystem or a single component on Register Transfer Level (RTL) with cycle accuracy. SYSTEMC ports and signals are frequently used. transaction The simulation models an entire system including processors, memories and buses. Communication is largely handled using TLM transactions and sockets. system The simulation describes an entire system or a cluster of sys- tems. This is often modelled as a set of processes using a mix of signals and transactions for communication.

Table 3.2: Explanation of abstraction levels used in Table 3.1

Table 3.1 gives an overview about major entries in this field of research. It lists authors, bibliographic references, approach details and publication year, starting from the early inception of SYSTEMC and ranging until today. The additional details have been chosen to best differentiate the strengths of the individual approaches, while still keeping them comparable. Moreover, they highlight a shift in research, as applications of SYSTEMC changed over time, e.g., with the standardisation of TLM in 2009. The field State indicates whether the simulation resides within a single address space, or whether it can be distributed among multiple processes and hosts. The field Time states whether the approach uses global synchronisation points to manage sim- ulation time, or whether it is allowed to advance asynchronously. The main driving processing element of the simulation is referred to by the field Driver. The abstraction level targeted by the experimental evaluation of the related work is given in the field Abstraction. Possible abstractions are explained in Table 3.2.

3.3 Fast Instruction Set Simulation

Fast simulation speed does not only depend on the performance of the simulation engine, but also on its models. A well designed VP that is employed as a software development and debugging tool is expected to spend a majority of its runtime inside the ISS. Its task is to interpret the binary target instruction and perform the requested operation based on the current state of the processor model. Given the constantly growing complexity of modern embedded software, fast ISS operation is universally desirable for debugging of target binary code. This is true especially for uniprocessor systems, where it can be argued that parallel simulation would be of little benefit. In case of simulators for Multi-Processor Systems-on-Chip (MPSoCs), fast instruction 3.3. Fast Instruction Set Simulation 33 set simulation combines well with parallel simulation approaches by improving per thread performance. Traditional instruction set simulation operates in three steps. First, the instruction is fetched from memory. Then it is decoded according to the target instruction set spec- ification to identify the kind of operation that is to be performed. Once this is known, the instruction can be executed, which usually means alteration of internal model state, such as registers in case of an arithmetic and logic operations, or external state such as memory or I/O registers in case of load and store instructions. Afterwards, the pro- gram counter is advanced and the entire fetch-decode-execute sequence is repeated for the next instruction. While this approach allows a straightforward implementation, it has the draw- back of inefficient execution of program hotspots. Fetching and decoding need to be repeated every time, even if the same instructions have already been handled before, e.g., while executing a program loop. The resulting performance of traditional inter- pretative ISSs is therefore insufficient for deployment in a VP. However, researchers have identified different approaches to speed up instruction set simulation, which will be swiftly presented in the following.

3.3.1 Compiled Simulation

Compiled ISSs shift the overhead of repeatedly fetching and decoding instructions from run time to compile time. Only instruction execution is performed at run time, allowing significantly higher simulation speeds in comparison to interpretative ISSs. However, the binding between target application and ISS forms a natural drawback of this approach: even minor changes in the application require a re-generation of the ISS, increasing turnaround times during software development and debugging. Initial work in compiled simulators was presented by Mills et al. in 1991 [124] and Zivojnovic et al. in 1995 [220, 219]. Subsequent work by Leupers et al. [109] and Braun et al. [25] allows retargetable generation of simulators based on HDL and Architecture Description Language (ADL) processor models. Hybrid simulation [99] is presented by Kraemer et al. as a special kind of compiled simulation. It allows switching between interpreted simulation and direct native ex- ecution at run time. Similar to conventional compiled simulation approaches, hybrid simulators also require a simulator compiler that generates target application specific binaries for execution on the simulation host. This approach was later refined by Gao et al., improving performance estimation while in native mode [56], and Jovic et al., adding support for extensible cores, such as Tensilica Xtensa [93]. Another variant of compiled simulation is proposed by Nohl et al. [138]. In their approach, the authors move compilation of target instructions back to run time. Once a program counter location is encountered for the first time, it gets decoded and a pointer to an appropriate handling function is stored together with pointers to its operands. Whenever the instruction is then encountered again, this information is reused for execution, effectively skipping costly fetch and decode operations. This approach combines the benefits of compiled simulation, i.e. no extra fetch and decode 34 Chapter 3. Related Work steps, with those of traditional interpretative simulation, i.e. the ability to simulate execution of arbitrary target programs without recompilation.

3.3.2 Dynamic Binary Translation Dynamic Binary Translation (DBT) is the de facto standard for fast and accurate in- struction set simulation within the EDA industry. It is used by virtually all relevant embedded software development and design exploration tools, including ARM Fast- Models [8], Synopsys Virtualizer [183], Windriver SIMICS [46] and QEMU [16]. The core idea behind DBT is the translation of target binary code into a sequence of host executable instructions on the fly. While initial translation is costly, efficient reuse of translation results during program hotspots yields significant performance gains compared to traditional interpretative approaches. Initially, DBT was used to accelerate execution of interpreted languages, such as LISP and BASIC [13]. However, with full system simulators like Shade [39] for SPARC and Embra [216] for MIPS, DBT has found its way into mainstream instruction set simulation around the turn of the century. In 2008, Helmstetter et al. presented SimSoC [71, 20], a DBT based ISS for the ARMv6 instruction set that allows straight- forward integration into a SYSTEMC based VP using standard TLM interfaces. Until today, DBT technology has been constantly improved, either by reducing compilation overhead, optimizing generated code or enhancing caching heuristics. For example, Jones et al. propose the use of large translation units for program hot paths [190, 88] and Böhm et al. suggest offloading translation tasks to parallel worker threads [22, 2], improving performance of DBT for multi-core systems. Finally, approaches exist that attempt to improve the timing and functional cor- rectness of DBT simulators. Since regular DBT operates on basic block level, instruc- tion accuracy usually resembles the highest representable precision level. However, research by Schnerr et al. [161] and Böhm et al. [21] attempts to make DBT suitable for scenarios where cycle accuracy is required, e.g., during architectural exploration.

3.4 Synopsis

Over the past decade many traditional parallel and distributed simulation approaches have been applied to SYSTEMC simulation with varying degrees of success. It appears obvious that a VP exposes a different degree of parallelism than a gate level simula- tion and therefore cannot be expected to perform equally well within a synchronous simulation environment. Table 3.1 indicates a trend towards a combination of TLM and asynchronous PDES for optimal performance. It should come at no surprise that the work presented in this thesis follows a similar path. It contributes novel, scal- able parallelization techniques and models specifically tailored for industry-level VPs incorporating high performance ISSs, as those have been unheeded before. Chapter 4

Target Platforms

A major differentiator between the related work presented in the previous chapter and the technologies outlined in this thesis is the ability to support realistic VPs, such as they are used within the industry today. In this context, a VP is considered realistic, when it can efficiently be used as a tool to test and debug target multi-core software. At least, this requires the presence of multiple ISSs, transaction level models for buses and memories, as well as a supporting front-end toolset (c.f. Section 1.4.1). This chapter outlines three VPs that fulfil these requirements. First, the EURETILE platform is presented in Section 4.1, followed by the GEMSCLAIM platform in Sec- tion 4.2. Both platforms have been developed in the context of publicly funded re- search projects that aim to introduce new and innovative architectural features. The intended project goals are described in the corresponding sections as well. The final platform presented in Section 4.3 resembles an SMP multi-core system based on the OpenRISC architecture. It is sufficiently detailed and fast enough to support operation of the Linux operating system kernel, further substantiating the claim for realism.

4.1 The EURETILE Platform

The European Reference Tiled Architecture Experiment (EURETILE) project [144, 145] started in January 2010 with the goal of investigating massively parallel tiled archi- tectures as a driver for brain-inspired computing. The project addresses scientific and industrial applications, which are expected to require incessantly growing com- putational, memory and communication resources. To cope with these challenges in an energy efficient way, EURETILE adopts strategies from nature. Applications are represented using a hierarchical network of processes, similar to the anatomical hi- erarchical layers of the human brain: neural columns, cortical areas and neocortex. The resulting programs naturally expose a high degree of parallelism without a de- pendency on centralised state. Consequently, tiled many-core platforms appear as an ideal execution environment, given their inherent scalability and power efficiency. Two of these many-tiled platforms have been developed in the course of the project to assess the applicability and efficacy of the proposed approach. One represents high-performance computing and is based on the x86 architecture, while the other one addresses embedded applications and employs a custom Reduced Instruction Set Computer (RISC) core. For the remainder of this work, only the embedded variant will be considered. Its virtual representation is called Virtual EURETILE Platform (VEP) and its structure and models are described in Section 4.1.1. The software stack and corresponding toolchain is presented in Section 4.1.2.

35 36 Chapter 4. Target Platforms

Network Channel TLM Connection

Interrupt Signal Tile Txyz

T011 T111 T211 T311 TIMER AIC

T001 T101 T201 T301 UART RISC

T010 T110 T210 T310 MEM DNP

T000 T100 T200 T300

Figure 4.1: The EURETILE platform in a 4 × 2 × 2 configuration

4.1.1 Virtual EURETILE Platform

The VEP consists of a configurable number of computational tiles that are connected in a 3D grid. Each tile has bidirectional network connections to its six direct neigh- bours in 3D space (i.e., left, right, front, back, top and bottom). Edge tile connections are wrapped around to the opposite side, effectively forming a torus network. The resulting structure is illustrated in the left hand side of Figure 4.1. Conceptually, a total of 64 × 64 × 64 tiles can be supported. However, due to host memory limitations, the maximum number of tiles is currently constrained to 8 × 6 × 4, i.e., 192 tiles. At the core of each computational tile is a custom RISC core, built using LISA [31, 75] and Synopsys Processor Designer [139]. It resembles a 32 bit Harvard architecture and is based on a five stage pipeline with support for bypassing and interlocking. Furthermore, it features auto-increment addressing, conditional execution and inter- rupt support. The employed RISC model operates at cycle accuracy. It is connected to local memory and peripheral components using a memory mapped . A dedicated Distributed Network Processor (DNP) is used to manage connections and transfer data to peer and remote tiles [3]. It handles fragmentation and routing of messages towards their destination, even if it is multiple hops away in the 3D grid. The remainder of the tile is assembled out of inexpensive off-the-shelf compo- nents, such as an UART and a system timer. Interrupt management between these peripheral components, RISC core and DNP managed by an ARM Advanced Inter- rupt Controller (AIC) with support for nested interrupts [10]. The entire design of a computational tile is depicted in the right hand side of Figure 4.1. Load, store and fetch operation of the RISC core can either be handled using memory pointers or transactions. All other communication is always handled using TLM transactions. Furthermore, the VEP is designed to operate in a deterministic fashion, independent of the chosen communication method. This is also supported by the results of the SYSTEMC determinism testing tool SCandal [166, 167, 206]. 4.1. The EURETILE Platform 37

Distributed Compilation EURETILE Operating Application and applications System Layer Tools Generation Linking

per-tile Binary

Process Networks

Operating System

Firmware Drivers feedback performance

Virtual EURETILE Platform

Figure 4.2: The EURETILE software generation toolchain

The EURETILE toolchain incorporates the VEP not only for testing and debug- ging, but also for reporting feedback on application performance back to the mapping and software synthesis tools. To that extent, the VEP can also be used to simulate persistent or transient hardware errors, as they can be expected to occur for such a massively tiled platform. By means of fault injection, various fallback and recovery strategies of the software stack can thus be tested and examined.

4.1.2 EURETILE Software and Toolchain EURETILE uses Kahn Process Networks (KPNs) as its model of computation [162]. All applications must first be represented as a set of independent processes that com- municate with each other exclusively using First In – First Out (FIFO) channels. Dis- tributed application layer tools [163] are used at the next step of the software gen- eration toolchain. They identify feasible mapping candidates for all processes of all concurrent applications. Subsequently, the per-tile DNA-OS [68] is readied based on the outputs of the previous stage (i.e. number of processes assigned to a tile). The resulting code is then compiled using the regular compilers and linked with hardware specific software, such as boot code and device drivers. Once executable binaries have been created for each tile in the system, the VEP is invoked in order to gather performance data via simulation. This data can then be feed back to the mapping tools in order to improve process mapping or strengthen resilience. The whole procedure is illustrated in Figure 4.2. Using the EURETILE methodology, a distributed two-dimensional Fast-Fourier Transform (FFT) application for 32 data points was created. The resulting application consists of 82 processes, including a generator for data initialisation, a consumer for data collection and output, as well as 80 processes that correspond to the FFT stages 38 Chapter 4. Target Platforms and butterfly computations. To handle inter tile communication, one extra task needs to be added per tile, amounting to a total of 98 tasks in a typical 4 × 2 × 2 mesh configuration. In subsequent chapters, this application is referred to just as FFT. A second application for the VEP was created only by using bare metal software, such as firmware and device drivers for the interrupt controller, DNP and system timer. Its purpose is to execute typical network traffic patterns like scatter, exchange and gather at a high frequency. Thereby, it effectively acts as a DNP and driver stress test thanks to its high ratio of communication to computation. Subsequent chapters refer to this application as Presto.

4.2 The GEMSCLAIM Platform

The Greener Mobile Systems by Cross Layer Integrated Energy Management (GEM- SCLAIM) project [36] operated from September 2012 until August 2015. It addresses the problem of the rising power and energy consumption in contemporary embed- ded HMP systems and proposes techniques and optimisations to limit power intake and reduce overall energy consumption. To address these issues, the project takes a cross-layer approach, ranging from an energy aware and optimising compiler, run- time system and OS down to hardware. It thereby introduces novel optimisation techniques that are designed to aid programmers to create more efficient software. The GEMSCLAIM compiler and runtime support source code annotations that al- low programmers to define optimisation goals, such as high performance or low en- ergy consumption [92, 91]. Moreover, it supports adaptive task granularity control in order to optimise system performance and power intake dynamically at runtime [188]. GEMSCLAIM employs a custom embedded OS kernel that is capable of energy- optimised task mapping and scheduling on HMP systems [59]. It provides energy aware resource management via service level agreements based on component level energy monitoring facilities that are directly built into the hardware. Compiler, runtime and OS rely on load, energy and power data, which is pro- vided by a distributed network of hardware sensors. Such sensors are attached to processors, buses, memories and other major components within the platform. They are administrated by a central Load and Energy Monitor (LEM) [114], which also provides a streamlined API for the OS to access component level sensor data. The project follows the principles of HW/SW codesign: to allow an early start for development and testing of the software stack, a GEMSCLAIM Virtual Platform (GVP) has been created that mimics the behaviour of the actual hardware. Furthermore, it offers a set of statically configurable parameters suitable for early design space exploration. Besides others, these parameters include the number, type and clocks of processors as well as the size of global and local memories. The GVP is presented in Section 4.2.1 in detail. After that, Section 4.2.2 introduces the GEMSCLAIM tools and software environment and concludes with a set of benchmarking applications. 4.2. The GEMSCLAIM Platform 39

IMEM IMEM IMEM IMEM

RISC VLIW RISC VLIW

SCPD UART SCPD UART SCPD UART SCPD UART

Load & Energy Sensor MEM SYS TIMER LEM TLM Connection Interrupt Signal

Figure 4.3: The GEMSCLAIM platform using two RISCs and VLIWs

4.2.1 GEMSCLAIM Virtual Platform

The GVP is depicted in Figure 4.3. It incorporates a configurable number of RISC and Very-Long Instruction Word (VLIW) cores. Two different types of processors have been selected to demonstrate the applicability of the proposed approach in HMP systems as they are frequently encountered today, such as ARM big.LITTLE [7, 210], for example. Furthermore, this design choice also allows to widen the spectrum of potential optimisations on the compiler and OS level. The RISC core operates on a mixed 16/32 bit instruction set and is based on a five stage pipeline with support for bypassing and interlocking. The VLIW has four execution units, each operating on 32 bit instructions, resulting in 128 bit instruction words. It is also based on a fully bypassed five stage pipeline. Both cores support interrupts and offer exclusive memory access via an atomic swap operation. They are built using LISA [31, 75] and Synopsys Processor Designer [139]. As shown in Figure 4.3, the processor cores are organised in clusters. Each clus- ter holds one processor (either RISC or VLIW) with dedicated instruction memory (IMEM) and a fast local scratchpad memory (SCPD). Both memories have a default size of 16 MiB. Each core further has access to its own UART for external I/O. Clusters are connected to the rest of the system using an uplink to a shared system bus. This bus hosts a global shared data memory (MEM), a system timer, the LEM [114] and a system configuration device (SYS). The latter is used by the processors to determine at runtime the number of present cores and their individual type and core id. 40 Chapter 4. Target Platforms

The LEM sensor network is also fully modelled in the GVP using TLM. Load, en- ergy and power sensors are attached to various components in the system and connect to the LEM using a NoC like interconnect. These sensors are polled at regular inter- vals by the LEM, which aggregates this data and provides it to the OS using a register interface in combination with a 1KiB internal memory. While load from the indi- vidual components can be derived from incoming and outgoing transactions, power and energy data cannot simply be measured in a VP. Instead, various approaches from the field of ESL power estimation [170, 171, 199] are used to derive this data for processors, memories and buses. For example, the model used for estimating power and energy consumption of a Dynamic Random Access Memory (DRAM) block during a measurement quantum with duration tquantum is shown in Equations 4.1 and 4.2.

PDRAM = nread · Pread + nwrite · Pwrite + Pleak (4.1)

EDRAM = PDRAM · tquantum (4.2)

The model counts the number of reads nread and writes nwrite serviced by the memory during a measurement quantum. To compute power and energy estimations, it further relies on calibration coefficients for DRAM power intake during operation Pread and Pwrite, as well as static leakage power Pleak. These have been measured on a Zynq 7020 development board and are listed in Equation 4.3.

Pread = 1050mW Pwrite = 1005mW Pleak = 585mW (4.3) Power and energy estimation for the processor cores work on instruction level and are based on previous work by Wang et al. [199, 28]. During an offline calibration, average power and energy consumption for each instruction type is measured. Using this information, the processors can report their own average power and accumulated energy by keeping track of all instructions executed during a measurement quantum. The GVP features two modes of operation: full and fast simulation. In full simu- lation mode, all memory operations of the processors are performed using transac- tions. While this results in a reduced simulation performance, it allows to accurately capture the number of operations serviced by the individual components (nread and nwrite). This is a fundamental requirement of the linear power model described in Equation 4.1. The full simulation mode is therefore required when assessing energy optimisation features of the compiler, runtime and OS using the GVP. In fast simulation mode, all fetch, load and store operations of the processors are abstracted using memory pointers, as long as they are not directed towards I/O registers. This allows optimal simulation performance, but disables capturing of com- ponent utilisation data. This mode is therefore best suited for debugging scenarios, where applications are iteratively executed in quick succession and accurate sensor readings are not required. 4.3. The OpenRISC Platform 41

4.2.2 GEMSCLAIM Software Environment

Software development for the GEMSCLAIM platform is done in plain C. Functions can be annotated using compiler directives, stating optimisation goals, such as short runtime or low power consumption. These directives are then acted upon by the INSIEME source-to-source compiler [92, 91, 188]. Transformed code is the passed to the backend compilers, which were created using Synopsys Compiler Designer [31, 75, 139]. The resulting object file is then linked together with the OS kernel and INSIEME runtime library to produce an executable binary. The compilation procedure must be executed twice, once for the RISC and once for the VLIW processors. Using this tool flow, the following two applications have been created. They are intended to be used as benchmarks when assessing simulation performance of the GVP in a typical use-case scenario.

• LEM driver stress test: This application lets both RISC and VLIW execute com- putationally intensive synthetic processes, while continuously requesting LEM energy data. Once each process has depleted its energy budget, the application is terminated using a LEM interrupt.

• Ocean current simulator: This is a port of the SPLASH-2 [217] ocean program for the GEMSCLAIM platform, which predicts current changes within a cubic kilometre of ocean over time. To that extent, it makes heavy use of local scratch- pad memories as well as multi-core synchronisation primitives provided by the OS, such as barriers and spinlocks.

4.3 The OpenRISC Platform

The goal of the OpenRISC project [184] is to develop a flexible and powerful RISC Instruction Set Architecture (ISA) under the free GPL license1. Supported by such a license, several open-source Verilog implementations of the OpenRISC ISA have been created and are publicly available [97, 105]. Softcore designs of OpenRISC have been used in Samsung TV devices [17] and NASA satellites [102, 17]. However, as of the time of writing, no OpenRISC silicon has been manufactured. OpenRISC was traditionally designed as a unicore system. Consequently, the ref- erence System-on-Chip (SoC) implementation ORPSoC [96] only features a single pro- cessor. Multi-core platforms first required extensions of the original ISA. In 2012, a revision introduced core identification registers and support for optimistic exclusive memory access using load-linked/store-conditional operations [87]. In combination with a new multi-core enabled interrupt controller and tick timer, the first OpenRISC MPSoC became possible. A corresponding VP is presented in Section 4.3.1, with the accompanying software stack being outlined in Section 4.3.2.

1 GNU General Public License (GPL) – http://www.gnu.org/licenses/gpl.html 42 Chapter 4. Target Platforms

TLM Connection Interrupt Signal

MPIC

Open Open Open Open RISC RISC RISC RISC

RAM ROM UART ETH FB SPI

Figure 4.4: OpenRISC MPSoC in a quad-core configuration

4.3.1 OpenRISC Virtual Platform

Figure 4.4 depicts the OpenRISC Virtual Platform (ORVP) in a quad-core configura- tion. The processors are connected to shared global memory using a memory mapped bus, forming a regular SMP system. Beside the 128 MiB system memory (RAM), a smaller 1MiB read only memory (ROM) is present, which can be used for firmware or device tree data. Naturally, memory size, address map and number of processors can be configured via a configuration file. Two options exist for the ISSs that model the processors. Firstly, there is the refer- ence ISS or1ksim [18], which has been wrapped into a SYSTEMC module for use in a VP. Secondly, an in-house ISS called or1kiss was developed that achieves higher simula- tion performance thanks to just-in-time cache-compiled simulation technology [138]. Both models have been tested to behave identically by means of instruction tracing. Inter-processor communication is realised using the Multi-Processor Interrupt Controller (MPIC). With the MPIC the processors can send wakeup interrupts to idle cores, e.g., to perform load balancing on OS level. Interrupts from peripheral devices are always broadcast to all processors, which individually have to mask interrupts they are not interested in. The decision which processor is responsible for handling which interrupt is performed statically by the OpenRISC firmware or OS kernel. The ORVP contains a plethora of virtual I/O components to allow flexible access to the system and interaction between the virtual and host environment. A virtual Ethernet device (ETH) interfaces with the host using TUN/TAP [101, 100] and thereby enables local network and Internet access from within the virtual environment. Sup- port for persistent storage is realised using Multimedia Cards (MMCs) connected via a virtual Serial Peripheral Interface (SPI) bridge. Video output is possible via a frame- buffer device (FB). Finally, a UART is available for serial I/O. 4.3. The OpenRISC Platform 43

All peripheral components were modelled after real IP blocks that are used in the industry today. On the one hand, this strengthens the claim for realism of ORVP. On the other hand, it also allows to reuse device drivers from the official Linux kernel sources and therefore eases software development effort. Details about the IP blocks and corresponding models can be found in Appendix B. Benefits of the ORVP are twofold. Firstly, each core supports source level de- bugging via rGDB [58] to facilitate debugging of hardware dependant software. In fact, thanks to the ORVP, a critical bug in the virtual memory and cache synchronisa- tion subsystems of the OpenRISC Linux port has already been identified and fixed.2 Secondly, in the absence of silicon, OpenRISC multi-core platforms have always been constrained to single or dual core configurations due to the size of the employed FPGA. Using the ORVP, the number of processors is freely selectable, thereby allow- ing assessment of the scalability of parallel OpenRISC programs.

4.3.2 OpenRISC Software Environment The ORVP was designed as a typical embedded multi-core platform running the Linux OS kernel [186, 184]. Consequently, two compiler toolchains are required: one for building bare-metal software and the other for user space applications running on Linux. For the bare-metal case and kernel builds, GCC 4.9.2 [62] is used, in combi- nation with Newlib [197] as the supporting C runtime library. The linux toolchain is based on GCC 4.9.1 and the MUSL C library [48]. Both toolchains support modern programming standards such as C11 [64] and C++14 [61]. The ORVP runs an OpenRISC port of the Linux kernel version 4.3.0 [106] with SMP support for up to 32 cores. User space binaries are based on BusyBox 1.23.2 [198] and provide access to common UNIX command line utilities, such as ls, cd, mkdir and so on. Linux applications may further take advantage of the framebuffer to output images or video via the Simple DirectMedia Library (SDL), version 1.2 [107]. Finally, ORVP runs a Telnet service, which allows login to the platform via the network. Thanks to the Linux platform, the ORVP has access to a wide range of applications that can be used for benchmarking without porting overhead. For this work, a set of 16 benchmarks has been selected, including well-known industry applications, such as coremark and dhrystone. Table 4.1 enumerates the chosen benchmarks including their input parameters, split into sequential and parallel workloads. The set of par- allel benchmarks consists of applications from the SPLASH2 suite [217] and parallel versions of the programs coremark and dhrystone. All of them have been configured to utilise four threads, while the sequential workloads are single-threaded. Special care needs to be taken for the benchmark procedure, since boot time and time spent typing commands must not be included, rendering a plain measurement of the runtime of the VP insufficient. To only measure the wall-clock execution time of the benchmark application, an augmented version of the Linux time command is employed. It uses OpenRISC semihosting instructions to retrieve the current host time in addition to the simulation time of SYSTEMC.

2 see OpenRISC Linux repository – https://github.com/openrisc/linux/pull/6 44 Chapter 4. Target Platforms

Benchmark Type Problem Size & Program Input

boot sequential Kernel boot from reset to login sleep sequential System idle phase for 10 s fibonacci sequential Calculate f ib(10000) dhrystone sequential 1000000 iterations dhrystonex4 parallel Four dhrystone instances coremark parallel Uses pthread backend

mandelbrot parallel 4threads, itermax = 1000 barnes parallel 16384 particles fmm parallel 256 particles ocean-cp parallel 258 × 258 grid size ocean-ncp parallel 258 × 258 grid size radiosity parallel -batch -room raytrace parallel teapot mesh volrend parallel head-scaleddown4 volume water-nsquared parallel 512 molecules water-spatial parallel 512 molecules

Table 4.1: Benchmark programs for ORVP including problem size

4.4 Synopsis

This chapter has presented three different VPs, modelling platforms that range in type from experimental (VEP) over research (GVP) to contemporary (ORVP). All of them allow development, testing and debugging of target specific software and enable design space exploration on system level. VEP and GVP have been used for HW/SW codesign throughout the course of their corresponding research projects. Although the presented VPs have not been used outside of an academic environment, their capabilities still render them sufficiently comparable to VPs deployed by the industry today. Parallel SYSTEMC simulation techniques discussed in the following chapters are benchmarked against at least one of the VPs that were presented here. It should be noted that none of the simulation models employed within the pre- sented VPs have been written in a thread-safe way, i.e., access to shared state from different SYSTEMC processes has not been protected with mutexes. Although source code of all simulation models is available, its modification shall not be allowed. This is done to keep the comparison fair and to uphold the claim to realism, since most IP blocks from industry are only provided in closed form and can thus not easily be modified for protection from such races. Chapter 5

Time-Decoupled Parallel SystemC

When attempting parallelization of a SYSTEMC based VP, developers face a wide va- riety of possible approaches. First, it must be decided whether the parallelization should take place at the SLDL level at all. If most of the simulation time is consumed inside a single model, it may be more efficient to attempt parallelization of its inter- nal algorithms instead. Otherwise, i.e., if simulation load is balanced evenly among all participating models, parallel simulation of the employed SLDL appears feasible. Chapter 3 identifies the parallel simulation at the SLDL level, i.e. the parallelization of the simulation kernel, as an attractive approach [34, 90, 103, 131, 165, 169, 172, 193]. Construction of a parallel SYSTEMC kernel is no easy task, given its inherent com- plexity and legacy APIs that need to be supported in order to be compliant with the standard [80]. A frequent choice among researchers is therefore to extend the existing OSCI kernel implementation with discrete features or modelling primitives that en- able parallel simulation. A well known example of such an approach is PARSC [165], which achieves synchronous parallel simulation by executing SYSTEMC processes on multiple host threads, when they are active during the same delta cycle. However, asynchronous simulation approaches appear more difficult to realise using the OSCI SYSTEMC kernel as a foundation. Asynchronous approaches need not only to be concerned with functional errors caused by race conditions, but also tem- poral errors, which emerge as a result of the asynchronicity itself. Since the OSCI implementation lacks the facilities to detect and prevent those issues, it was decided to write a new SYSTEMC kernel from scratch instead. The result – the parallel SYSTEMC kernel SCOPE [207] – is presented hereinafter. First, Section 5.1 outlines the fundamen- tal structure of SCOPE, including a description of the synchronisation mechanism and load distribution strategy. Communication primitives for thread-safe and temporal correct communication between asynchronous parts of the simulation are introduced in Section 5.2. Section 5.3 evaluates performance gains using SCOPE in the context of the Virtual EURETILE Platform. The chapter concludes with a discussion of current limitations of SCOPE in Section 5.4 and a summary in Section 5.5

5.1 Asynchronous SystemC Simulation

In the past, synchronous parallel SYSTEMC approaches have suffered from a lack of parallelizable simulation activity during any given delta cycle. Consequently, the idea was formed to parallelize beyond this delta cycle limit. Grouping simulation ac- tivity from multiple subsequent delta cycles is expected to yield enough parallelizable

45 46 Chapter 5. Time-Decoupled Parallel SystemC

Sequential Operation delta cycles

Thread 1 ISS #1 ISS #2 ISS #1 ISS #2 ISS #1 ISS #2 ISS #1 ISS #2

I II III IV V wall-clock time

Synchronous Operation

Thread 1 ISS #1 ISS #1 ISS #2 ISS #1 ISS #1 saved time Barrier Barrier Barrier Thread 2 ISS #2 Barrier Barrier ISS #2 ISS #2

I II III IV V wall-clock time

Asynchronous Operation

Thread 1 ISS #1 ISS #1 ISS #1 ISS #1 1-I 1-II 1-III 1-IV saved time

Thread 2 ISS #2 ISS #2 ISS #2 ISS #2 2-I 2-II 2-III 2-IV wall-clock time

Figure 5.1: Performance of synchronous and asynchronous simulation approaches for two ISSs with fluctuating execution times

workload to offset temporal and functional synchronisation overhead. Consequently, SCOPE implements such an asynchronous parallel simulation approach. The benefits of an asynchronous approach are further exemplified by Figure 5.1. It illustrates, how different simulation kernels would schedule the execution of two ISS models with fluctuating execution times and the resulting performance gains. During the first delta cycle (I), both ISSs are active and can run in parallel using synchronous and asynchronous approaches as well. However, the second and third delta cycles depict a problem often encountered in SYSTEMC. For reasons outside of the control of the kernel, simulation of the second ISS is delayed to the next cycle, e.g., due to a wait call in a bus interconnect model. This forces a serialisation for the synchronously operating kernel, since it cannot parallelize beyond the delta cycle boundary. Asynchronous approaches do not have this problem and are free to issue the simulation of both models in parallel, as long as temporal correctness can be guaranteed. During the following two delta cycles (IV and V), execution time of the ISSs starts to fluctuate heavily, potentially caused by a change in the instruction mix. Delta cycle execution time of the synchronous approach is limited to the slowest ISS, while the asynchronous one does not need to wait and is free to continue with the next cycle. Note that delta cycles in the asynchronous approach exist per thread and can no longer be compared to the cycles of sequential or synchronous kernels. Overall, the asynchronous approach performs better than the synchronous one in the depicted scenario, since it offers higher flexibility and lower synchronisation overhead. 5.1. Asynchronous SystemC Simulation 47

Interestingly, most VPs already include asynchronous behaviour in the form of TLM quanta: instead of synchronising with simulation time after every cycle, a pro- cessor model usually executes thousands of cycles in a chunk and only synchronises with SYSTEMC time afterwards. The amount of time one is allowed to run for asyn- chronously is called a quantum. Unfortunately, each model decides for itself during which delta cycle its quantum will execute. This likely causes the potentially paral- lelizable simulation activity to be spread over multiple cycles and renders most syn- chronous approaches futile, just as described in the previous example of Figure 5.1.

5.1.1 Simulation Loop and Context

Fundamentally, SCOPE employs a statical and user-configurable number of threads N that operate within a shared memory address space. Each thread i executes its own simulation loop and maintains its own state, such as its local simulation time ti. These threads are loosely synchronised: local times of the threads do not need to be kept synchronous, but their maximum difference is limited to a constant user-definable ∆ lookahead tla. Therefore, each thread is only allowed to simulate ahead in time up to a limit timestamp tlim,i as defined by Equation 5.1. This is also referred to as the lookahead constraint hereinafter.

∆ ∆ ti < tlim,i = min tj + tla = min tj + tla 0 ≤ i < N (5.1) 0≤j

Extended notify phase

Start eval update notify tlim,0=min(t0,t1)+∆tla

yes no t0 < tlim,0?

Thread 0

t1 < tlim,1? yes no

eval update notify tlim,1=min(t0,t1)+∆tla

Thread 1 Extended notify phase

Figure 5.2: SCOPE parallel simulation loop with extended notification phase

5.1.2 Simulator Partitioning

In SYSTEMC, all state is hierarchically organised in modules. Therefore, VP partitioning and the assignment of individual partitions to contexts and thereby threads is also done on a per-module basis. Since all events and processes must be instantiated within a module, they similarly inherit the context of their parent and are assigned to the same thread as it. Thread assignment changes are propagated hierarchically: if a module gets moved to another thread, its child-modules, events and processes are also transferred. Having a module and its children grouped on the same thread avoids race conditions, since processes frequently operate on data of their parent module. Consequently, a module forms a virtual sequential environment for its child processes in which accesses to shared state need not be explicitly synchronised1. Modules can be manually or automatically assigned to threads. In the manual case, the developer must programmatically set the affinity property am of module m to the desired thread ID. While this is usually done during elaboration, SCOPE also sup- ports setting module affinity at runtime, effectively allowing custom dynamic load balancing algorithms. If changes to VP code are not possible, module affinity can also be specified using a text file that links modules to threads using their full hierarchical SYSTEMC name. All modules that have not been assigned to a thread – either directly using the affinity property or indirectly via a parent module – are automatically as- signed to the first thread, creating an initial virtual sequential environment that spans the entire simulator and must be explicitly split up. Defaulting to a safe configuration facilitates porting VPs to SCOPE and helps preventing accidental races.

1 Shared state may still be subject to general races, but not data races – c.f. Section 2.3 5.1. Asynchronous SystemC Simulation 49

Algorithm 5.1: Dynamic load rebalancing procedure in SCOPE Data: A list of movable modules M, a list of non-movable modules F and per-thread scheduling queues Qi

1 foreach i ∈{0,..., N − 1} do ∅ 2 Qi ← /* start with empty scheduling queues */ 3 end

4 foreach m ∈ F do

5 Qam ← Qam ∪ m /* assign non-movable modules first */ 6 end

7 sort M by lm descending; /* assign modules with high load first */

8 foreach m ∈ M do 9 s ← 0; /* shortest queue index */ 10 lmin ← ∞; /* length of shortest queue */ 11 for i ∈{0,..., N − 1} do /* search for shortest queue */ 12 l ← ln ; /* compute length of Qi */ n∈Qi 13 if l

For the automatic case, the developer needs to mark a module as movable for it to be considered for initial partitioning and subsequently for load balancing at runtime. During elaboration, movable modules will be assigned to threads in a round-robin fashion. Rebalancing happens after fixed user-definable intervals tbalance. It has the goal of minimising thread idle time or, more specifically, the time each thread spends spinning in the extended notification phase. At runtime, the kernel will compute the load lm caused by each module m by accumulating the cycles spent executing all of its processes. At the end of each interval tbalance, a rebalancing will commence. It is always coordinated by the simulation context of the first thread. The procedure is started once all other threads have completed their previous notification phase and signalled their readiness for rebalancing. Movable modules are redistributed between the scheduling queues Qi of all threads with the goal of an evenly distributed load, also taking into account the load of non-moveable modules. This rebalancing procedure is shown in Algorithm 5.1. The initially empty schedul- ing queues are first populated with modules that have been fixed to a thread using their affinity property am. Afterwards, assignment of all movable modules com- mences. The algorithm picks the module with the highest measured load first and assigns it to the shortest queue, i.e., the queue with the least amount of work so far. 50 Chapter 5. Time-Decoupled Parallel SystemC

Regardless of whether the automatic or the manual partitioning approach is used, expert knowledge is still required to achieve an even balance and consequently opti- mal simulation performance. The reason for this is that VP designers need to specify module thread assignments or manually mark selected modules as movable. While further automation appears feasible in this area, integrity of the virtual sequential environment is given precedence, as it protects the simulation from elusive errors caused by data races within model code. Consequently, the decision was made that any offloading of simulation activity to other threads must be explicitly stated. To achieve good performance, a rule of thumb is to offload modules containing ISSs, as those are usually the busiest components within a VP.

5.2 Cross-thread Communication

While the asynchronous simulation approach offers high simulation performance with low synchronisation overhead, it also introduces new challenges. Communi- cation between components residing on different threads becomes especially difficult, since both execute in parallel and in different time zones. Consequently, SCOPE of- fers a series of modelling primitives that facilitate cross-thread communication by responding to the following three key challenges:

• Functional correctness is threatened whenever non-thread-safe models are ac- cessed from parallel threads. To avoid such situations, it must be ensured that all code responding to cross-thread communication, such as sensitive processes or callback functions, executes in the context of the same virtual sequential en- vironment as the receiver.

• Temporal correctness must be uphold in order to avoid causality errors, such as future transactions affecting past simulation state (c.f. Definition 2.11). By enforcing all cross-thread communication to be stated sufficiently ahead of time, it is guaranteed that the designated arrival timestamp of a messages has not yet elapsed in the context of the receiving thread.

• Determinism is achieved by designing the message transmission mechanism in ∆ a way that does not depend on the time difference between two threads ti,j. For example, it must be guaranteed that messages are always forwarded to the receiver in the same order in successive simulation runs. This enables repro- ducible simulator behaviour and facilitates debugging.

In SYSTEMC, communication is handled via dedicated channels such as ports and signals, which are implemented based regular SYSTEMC events and event lists. SCOPE additionally offers remote events and remote event lists, which are capable of handling notification requests originating from outside the virtual sequential environment, i.e. from another thread. For a typical VP, TLM sockets and interfaces must also be considered. SCOPE focuses on the TLM blocking transport interface and augments regular TLM simple target sockets with cross-thread communication capabilities. 5.2. Cross-thread Communication 51

5.2.1 Remote Events Remote events allow remote notification and thus enable threads to trigger execution of SYSTEMC processes on other threads in a thread-safe and temporal correct manner. However, remote events can only trigger execution of processes assigned to the same thread. Remote sensitivity, i.e., assigning remote processes for execution upon local event triggers, is not allowed. Since remote events may receive multiple notification requests at the same time from its own thread and any number of other threads, the decision when to trigger requires careful elaboration. Unlike it is done with regular events, the trigger decision for remote events cannot be made ad-hoc. When a remote event is scheduled to trigger at some point in time, the possibility of a future cancellation by another thread must be taken into consider- ation, since that other thread might be lagging behind in simulation time. Consider the following example: a remote event e has been assigned to thread i. Thread j issues a remote notification at tj < ti and thread i later requests e to be cancelled again at ti. Since the cancellations is issued at a later point in simulation time, the expected result is that the event does not trigger. However, since time decoupling is active, it might happen that the cancellation is seen by e before the notification, if ti is reached before tj in real time. Since it has not yet been notified, e would just ignore the cancellation in this case. Finally, to the surprise of the user, the subsequent notification would erroneously go through and cause all sensitive processes to execute. In order to correctly handle multiple notification and cancellation requests origi- nating from different time zones at varying points in real time, remote events need to keep a history of requests made to them. The decision whether to actually trigger the event is postponed to the trigger time te, when all requests must have been received. Such requests are more formally introduced in Definition 5.1 and form the basis for the subsequently presented trigger decision algorithm.

Definition 5.1 A remote request re ∈ {notify, cancel} of an event e symbolises a call ∆ for action from a thread i. A notification request re triggers e at tr = ti + tnotify.A ∆ cancellation request re invalidates any other notifications qe with tq ≤ tr = ti + tla.

Requests consist of either notification or cancellation actions and timing specific ∆ data, such as the desired trigger time tr = ti + tnotify. Whenever a remote event e receives a request re, it is added to its request history He. Furthermore, if the request is a notification request, the event is also scheduled to trigger at te = tr. Note that tr is not based on the local timestamp but rather on that of the remote thread that has issued the request. Upon reaching te in local time, a trigger decision algorithm is executed to decide whether to actually schedule sensitive processes for execution. Notifications can be aborted using cancellation requests. Just as notification requests, cancellation requests must be stated sufficiently ahead of time to take effect. A can- cellation request re causes all notification requests with an earlier timestamp to be ignored. While the event will still be triggered in this case, no sensitive processes will be scheduled for execution. Only notification requests that have not been cancelled or overridden with earlier trigger times re actually cause processes to execute. 52 Chapter 5. Time-Decoupled Parallel SystemC

remote notification remote cancellation event trigger

Thread 1 t1 (a) valid remote (b) notification ignored (c) in-time remote ✓ notification due to previous one cancellation with ∆tnotify ≥ ∆tla triggering earlier skips trigger Thread 2 t2

simulation time ∆tla ∆tla ∆tla

∆tnotify ∆tnotify ∆tnotify

Figure 5.3: Remote events and remote notification

Possible operations on remote events are illustrated by Figure 5.3. It shows three different application scenarios that need to be taken into account during the final trigger decision algorithm:

(a) A remote notification on an inactive remote event e causes it to trigger on the ∆ ∆ second thread if stated sufficiently ahead of time, i.e., tnotify ≥ tla. (b) Remote notification override: an event cannot be notified multiple times. In that case, the notification with the earliest trigger time always wins.

(c) Remote cancellation: a remote event can be returned to an inactive state, if ∆ cancelled sufficiently ahead of time, e.g. at timestamp tcancel ≤ te − tla.

The trigger decision algorithm for remote events is shown in Algorithm 5.2. Upon every possible trigger time te of remote event e, it is used to determine, whether to actually perform the trigger, i.e. schedule processes sensitive to e for execution. To make this decision, the algorithm looks at the history of requests made to the event starting at the previous time the event was triggered until the highest timestamp that ∆ all threads must have reached by now, i.e., tlim = ti − tla (c.f. Equation 5.1). This also implies that all remote requests must be stated sufficiently ahead of time by at least ∆ tla in order to be considered by this algorithm. First, all requests are ordered by their corresponding tr with the earliest request appearing at the front of the list. The algorithm then proceeds to iterate over all re- quests and identifies the timestamp tact at which the event should actually trigger. Initially it is assumed that e does not trigger, so tact is set to ∞ (c.f. line 3 in Algo- rithm 5.2). When a request is encountered that produces a notification (or an earlier notification if the event is already active), tact is updated accordingly (c.f. lines 8 and 9). Once the algorithm finds a cancel request, it returns e to an inactive state by set- ting tact back to ∞ (c.f. lines 6 and 7). This procedure is continued until all requests from the relevant history list have been checked. If tact afterwards matches the current local time ti, the event is supposed to trigger and schedules all sensitive processes for execution (c.f. lines 12 – 15). Otherwise no action must be taken, because the event has been cancelled before or is waiting for another trigger at te > ti in the future. Correct application of ahead of time notification and cancellation must be enforced by the simulation kernel in order to prevent spurious errors. For timed notifications 5.2. Cross-thread Communication 53

Algorithm 5.2: Trigger decision algorithm for remote events

1 Function TRIGGERDECISION(RemoteEvent e) ∆ 2 requests ← He[te ... ti − tla] ; /* extract relevant request history */ 3 tact ← ∞; /* assume e is initially inactive */ 4 while requests 6= ∅ do 5 r ← extract earliest from requests; 6 if r is cancel then 7 tact ← ∞; /* remote cancellation of e */ 8 else if tr < tact then /* r must be a notify request */ 9 tact ← tr; /* remote notification (override) of e */ 10 end 11 end

12 if tact = ti then 13 RQ ← RQ ∪ Se ∪ De ; /* trigger e (c.f. Definition 2.3) */ 14 De ← ∅; 15 te ← ti; 16 end this task is trivial, since they are always stated ahead of time and therefore already in- ∆ ∆ corporate this concept by design. SCOPE only needs to check if tnotify ≥ tla and emit an error otherwise. However, delta and immediate notifications according to Defini- tions 2.4 and 2.5 are incompatible with time decoupling and are therefore disabled for remote events. The case for remote cancellation error checking presents itself to be more difficult. A cancel request that has been issued too late cannot be detected at te, since the trigger decision algorithm filters out all requests in its recent past up ∆ to ti − tla. Consequently, detection must happen the next time the remote event is triggered. An error message is generated if a cancel request is found within the re- ∆ quest history since the last time e was triggered te and te − tla. Because an event may remain inactive after it has been triggered once, not all incorrect ahead of time cancellations can be identified this way. Therefore, a final detection attempt must be performed during the destructor of the event at the end of the simulation.

5.2.2 Remote Event Queues

Remote event queues are fundamentally similar to remote events and regular SYSTEMC event queues. They are used to trigger the execution of simulation processes at distinct points in time from a remote thread. It is the task of the remote event queue to handle these remote notifications deterministically and in a temporal and function correct fashion. Consequently, remote sensitivity is not supported. Remote event queues differ from remote events in their ability to be notified multiple times, without overriding previous notifications. Regular cancel operations are removed, since it is semantically unclear which notification is supposed to be cancelled. In their place, a 54 Chapter 5. Time-Decoupled Parallel SystemC

remote notification remote cancellation event queue trigger

Thread 1 t1 (a) multiple notifications (b) in-time remote ✓ ✓ ✓ remain active despite of cancellation skips all earlier notifications future triggers Thread 2 t2

simulation time ∆tla ∆tla

∆tnotify ∆tnotify

Figure 5.4: Remote event queue notifications and cancellations new cancel-all operation is introduced. If a remote event queue receives a cancel-all request, it must invalidate all pending notifications and return to an inactive state. Because of the cancel-all operation, the order in which notification and cancella- tion requests are made to the queue is relevant. For example, it must be taken care that a delayed cancel request does not invalidate existing notifications originating from a thread that operates ahead of local time. Figure 5.4 illustrates two such appli- cation scenarios that the trigger decision algorithm for remote event queues must be prepared to handle:

(a) Multiple remote notifications: unless explicitly cancelled, every notification re- quest should result in the remote event queue triggering and sensitive processes ∆ ∆ to be scheduled for execution if tnotify ≥ tla.

(b) Remote cancellation requests must take care that all previous (in the context of simulation time, not real time) notification requests are ignored, even if they have not yet been stated due to the time decoupling.

Remote event queues can reuse request objects according to Definition 5.1. For every notification and cancellation request made to a remote event queue q, a cor- responding object rq is inserted into the history Hq. If the request is a notification request, the event queue is also scheduled to trigger at tq = tr. The history Hq forms the basis of the trigger decision algorithm displayed in Algorithm 5.3, which is run at each trigger timestamp tq. It is the task of this algorithm to identify if the request that caused q to trigger has been invalidated by a preceding cancellation request. Similar to how it is done for remote events in Algorithm 5.2, this algorithm also checks the relevant recent history (c.f. line 2). This includes all requests stated between the previous notification request ∆ ∆ at tq − tla and the latest possible time a request could have been received at ti − tla. The algorithm then continues by iterating over all requests, thereby going backward in time until all requests have been checked (c.f. lines 3 and 4). Should it encounter a cancel request in the recent history, the remote event queue must not be triggered and the routine is aborted (c.f. lines 5 and 6). Otherwise the algorithm proceeds to trigger the queue by scheduling all sensitive processes for execution according to Definition 2.3 and updating its own trigger time tq (c.f. lines 9–11). 5.2. Cross-thread Communication 55

Algorithm 5.3: Trigger decision algorithm for remote event queues

1 Function TRIGGERDECISION(RemoteEventQueue q) ∆ ∆ 2 requests ← Hq[tq − tla ... ti − tla]; 3 while requests 6= ∅ do 4 r ← extract latest from requests; 5 if r is cancel then 6 return ; /* preceding cancel found, abort trigger of q */ 7 end 8 end

9 RQ ← RQ ∪ Sq ∪ Dq ; /* no cancel found, perform trigger of q */ 10 Dq ← ∅; 11 tq ← ti;

Error detection for requests that are issued sufficiently ahead of time is handled in a similar fashion to remote events. Remote notifications are considered erroneous ∆ ∆ if tnotify < tla and are identified at run time when the request is stated. A cancel-all ∆ request r is considered erroneous if stated too close to tq, i.e. if r ∈ Hq[tq − tla ... tq]. Erroneously stated cancel-all requests must be detected during the trigger and at the end of the simulation during the remote event queue destructor.

5.2.3 Blocking Transport Interface

The previously introduced modelling primitives remote event and remote event queue allow components on different threads to communicate. However, in order to take ad- vantage of them, they must be manually inserted into VP or model source code. To reduce this programming effort, these modelling primitives have been added to all implementers of the TLM Blocking Transport Interface, effectively allowing transac- tion objects to be passed across threads without source code modifications.

void socket::b_transport( tlm_transaction & tx, sc_time & delay);

Listing 5.1: TLM Blocking Transport Interface

Listing 5.1 shows the TLM Blocking Transport Interface. It is used by TLM sockets to pass a transaction object from an initiator, such as a processor, to a target, such as a memory or bus. The target is then required to either act upon the request or, if it is an interconnect, forward it to the designated target. The transaction object tx encodes the type of operation (read or write), the target address and a buffer that either holds the data to be written or serves as storage for the results of a read operation. The non-negative delay parameter, henceforth called ∆ttx, annotates the local time offset of tx. If a processor runs ahead of simulation time, e.g. during a TLM quantum, ∆ttx must be used to indicate that the transaction is sent ahead of time. The target may now either decide to synchronise with simulation time by calling wait(∆ttx ) or is free 56 Chapter 5. Time-Decoupled Parallel SystemC

initiator target initiator target (thread 1) (thread 1) (thread 1) (thread 2)

simulation time t1 = 1us t1 = 1us ∆tla = 50ns t2 = 1049ns

local time local time offset offset b_transport(tx, 100ns) b_transport(tx, 100ns) +0ns +0ns +0ns +1ns

∆ttx = 100ns ∆ttx ← 50ns

process tx process tx

∆ttx ← 200ns ∆ttx ← 150ns b_transport(tx, 150ns) b_transport(tx, 200ns) +1ns +0ns +0ns +90ns

∆ttx = 200ns ∆ttx ← 110ns

wait(∆ttx) wait(∆ttx)

+200ns +200ns

+200ns simulation time t1 = 1200ns t1 = 1200ns t2 = 1249ns

Figure 5.5: Classic and time decoupled TLM transaction timing

to also respond ahead of time, possibly increasing ∆ttx to account for the time needed to service the request. The left hand side of Figure 5.5 presents an example of a typical TLM transaction including its timing behaviour using regular SYSTEMC. At timestamp t1 = 1µs an initiator sends a transaction to a target, for example a network router forwarding a packet to its peer. Local packet processing and routing calculation is assumed to take 100 ns. Since local time synchronisation incurs a significant performance penalty, the router model instead forwards the packet ahead of time by setting ∆ttx = 100ns before invoking the TLM blocking transport interface function call. During transaction transmission no time passes and the target receives the packet while simulation time is still at 1 µs. The receiving router is now free to process the packet further, but must notify the sender once reception has completed. Assuming this process takes another 100 ns, ∆ttx is increased again and the interface function returns 200 ns ahead of time. Once the sender receives the response it calls wait to synchronise with simulation time and thereby concludes the transmission process at timestamp t1 = 1200ns. This timing behaviour changes once time-decoupling using the SCOPE kernel comes into play. The right hand side of Figure 5.5 illustrates the case where a transaction ∆ is transmitted between two threads, whose local times differ by 49ns < tla = 50ns. Like in the first example, the network router sends a packet in the form of a TLM transaction. It is transmitted ∆ttx = 100 ns ahead of time to account for local pro- cessing and routing calculations. In order to bridge the time gap between the two threads, 50 ns of the annotated local time offset are consumed. This allows advancing the reception timestamp from 1 µs to 1050 ns. The lookahead constrained (c.f. Equa- 5.2. Cross-thread Communication 57

b_transport(tx, ∆ttx) TLM Initiator TLM Target

Augmented TLM Target Socket (1) b_transport is (2) no (3) remote call? (6) Remote (7) SystemC (8) Transaction Queue Thread Pool yes tx1 ∆ttx1 etx1 w1 tx2 ∆ttx2 etx2 recv w2 tx3 ∆ttx3 etx3 w3 (5) (4) ... w4 wait etx w5

(9) wakeup

Figure 5.6: Augmented TLM target socket in SCOPE tion 5.1) guarantees that this timestamp has not yet elapsed in the context of the ∆ second thread, due to tla = 50ns and t1 = 1 µs. Consequently, the transaction is re- ceived 50ns ahead of time. The receiver increases ∆ttx again by 100 ns to account for local processing of the packet before returning the successful reception response back to the sender. The backward path works similar to the forward path and consumes another 50 ns off of ∆ttx. Note that this is done even though thread 2 is already ahead ∆ 2 in time, since deterministic operations must not rely on the relative time delta ti,j. ∆ ∆ The response is received by thread 1 at t1 = 1100ns with ttx = 150ns − tla = 100 ns. After calling wait the transmission is once more completed at t1 = 1200 ns.

5.2.4 Augmented TLM Target Socket

Augmented TLM target sockets make use of remote events and remote event queues to implement the blocking transport interface for cross-thread communication as pre- sented in the previous section. They are based on the tagged and non-tagged ver- sions of the standard TLM simple target sockets and are automatically used when compiling against the SCOPE SYSTEMC headers. Consequently, no VP source code mod- ifications are required. Recompilation using SCOPE is sufficient to take advantage of parallel simulation. This section outlines the design of augmented TLM target sockets. Due to their similarity, only the non-tagged version is presented. The implementation for tagged sockets follows from this inherently.

2 Since threads likely operate at different speeds on the host computer, local times ti and tji advance in ∆ an uncontrollable fashion. Consequently, this is also true for derived values, such as ti,j. 58 Chapter 5. Time-Decoupled Parallel SystemC

Figure 5.6 presents an overview about the flow of a transaction through an aug- mented socket. A regular simple target socket allows SYSTEMC model wrappers to register a callback function, which will be invoked once the socket receives a blocking transport request (1). Augmented sockets begin by first checking the origin of the transaction. If the request is issued from within the same thread as the socket has been assigned to, no precautions must be taken and the call is forwarded normally (2), ultimately resulting in invocation of the user defined callback function (3). If the transaction is sent from a different thread, it must take a more elaborate route in order to guarantee functional and temporal correctness before continuing on to potentially unsafe model code. First, a remote event etx is created, which works as a wakeup mechanism to allow the calling process to yield until its transport request has been fulfilled. The transaction tx, the local time offset ∆ttx and the wakeup event etx are then put into a remote transaction queue for further processing (4). The calling process yields control by waiting on etx (5). The remote transaction queue (6) uses a remote event queue in order to wakeup a receiver process recv. Once recv becomes active it extracts all transactions with the smallest timestamp from the queue (7). Note that recv cannot be used directly for forwarding, since the user code might block, e.g. by calling wait, potentially halting the entire transport mechanism. Consequently, each transaction is passed to a free worker thread w that handles his task instead of recv (3). Once the callback returns (8), w notifies etx (9) and the calling process resumes. Since w executes on the same thread as the model, no data races can occur and functional correctness is retained. To ensure temporal correctness, the local time offset ∆ttx must be translated be- tween the time zones of the sender ti and the receiver tj. Since the difference between ∆ ti and tj is unknown, the relative time delta ttx must first be converted to an absolute timestamp Ttx, which refers to the intended reception time of transaction tx:

Ttx = ∆ttx + ti (5.2) ∆ Using Ttx, w can calculate the new value of ttx in the context of its local time tj:

∆ ttx ← Ttx − tj (5.3)

The activation time of w, i.e. tj, is known; w becomes active as a consequence of ∆ ∆ thread i inserting tx into the remote transaction queue at time ti with tnotify = tla:

! ∆ tj = ti + tla (5.4)

Thus, the update computation of ∆ttx from Equation 5.3 can be simplified:

∆ ttx ← Ttx − tj using Equation 5.3 ∆ ← Ttx − (ti + tla) using Equation 5.4 ∆ ∆ ← ttx + ti − ti − tla using Equation 5.2 ∆ ∆ ← ttx − tla (5.5) 5.3. Experimental Results 59

Equation 5.5 shows that the updated value of ∆ttx in the context of thread j only ∆ depends on its original value during sending and the constant lookahead tla. It is therefore sufficient to store ∆ttx instead of Ttx in the remote transaction queue as it is already shown in Figure 5.6. For the backward path (8 and 9 in Figure 5.6), another ∆ update must be performed to convert ttx from time zone tj back to ti. Since the ∆ ∆ ∆ remote event etx is notified using tnotify = tla, the updated value of ttx is again computed using Equation 5.5. During each update, temporal correctness is ensured by ∆ ∆ asserting ttx ≥ tla, so that the result is non-negative. Otherwise, an error message ∆ ∆ is issued, informing the developer to either reduce tla or increase ttx. Besides functional and temporal correctness, deterministic operation presents it- self as another desireable attribute of augmented TLM target sockets. Unfortunately, being a subcomponent, sockets alone can not enforce determinism on a system level. Instead, an implementation shall be considered sufficient as long it does not intro- duce any new nondeterministic behaviour. Potential sources of nondeterminism in the context of TLM target sockets are the local time delta ∆ttx as well as the remote transaction queue. Since the queue might receive transactions from multiple threads, the order at which transactions arrive must be assumed random. However, to en- sure that outgoing transactions are always reported to the receiver in the same order, an ordering constraint must be established. Transactions are first sorted by their in- tended reception time Trx. Should multiple transactions with identical Trx exist, they are sorted by originating thread ID. This leaves the local time offset ∆ttx as the only remaining source of potential nondeterminism. As shown in Equation 5.5, ∆ttx only ∆ depends on the sending timestamp ti and the constant lookahead tla. It follows that all modifications to ∆ttx performed by the augmented socket are of a deterministic nature, if the sender also behaves deterministically, i.e, it sends every transactions at the same time ti in consecutive simulation runs. It can therefore be concluded that the augmented TLM target socket does not introduce any new nondeterminism and therefore retrains the grade of determinism of the VP.

5.3 Experimental Results

To study the potential benefits of asynchronous parallel simulation over sequential and synchronous approaches, multiple benchmarks based on the Virtual EURETILE Platform (VEP) have been conducted. The VEP has been chosen since its tiled ar- chitecture forms a suitable representation of future computation platforms and offers plenty opportunity for parallelization. In total, three simulator variants are studied:

• OSCI: this variant is based on the OSCI SYSTEMC kernel version 2.3 [1, 80]. It represents the state-of-the-art sequential simulation technique.

• parSC: this variant employs the PARSC SYSTEMC kernel [165] including the legaSCi extensions [169]. It represents an synchronous parallel simulation approach.

• SCope: this variant takes advantage of the SCOPE SYSTEMC kernel as described in this chapter and represents an asynchronous parallel simulation approach. 60 Chapter 5. Time-Decoupled Parallel SystemC

Tiles on Thread 0 Tiles on Thread 1 Tiles on Thread 2 Tiles on Thread 3

Tiles on Thread 4 Tiles on Thread 5 Tiles on Thread 6 Tiles on Thread 7

y z

x

2 Threads 4 Threads 8 Threads

Figure 5.7: Thread partitioning for a 4x4x4 VEP configuration

The performance of each simulator variant is explored using two distinct applica- tion scenarios, which are described in the following:

• presto: the presto application is used in this scenario. It exposes a high communi- cation-to-computation ratio and is therefore well suited for studying the perfor- mance impacts of the proposed cross-thread communication facilities. The VEP is configured to operate in a 4 × 4 × 4 configuration and simulation duration is set to Tpresto = 100 ms to allow a full iteration through presto.

• fft: this scenario focusses on the fft application, which exposes a low commu- nication-to-computation ratio and is therefore well suited for highlighting par- allelization efficacy of a simulator variant. To allow all calculations to finish, simulation duration is set to Tfft = 1487 ms. As with presto, this scenario also employs a 4 × 4 × 4 configuration for the VEP.

Three simulator variants and two application scenarios yield a set of six exper- iments. Each experiment was repeated a fixed number of times and only averages are reported. Detailed information about the runtime of single experiment iterations, repetition count and simulation host can be found in Appendix C.

5.3.1 Experiment Setup For each competing approach, a variant of the VEP has been created by compiling and linking the unmodified VP source code to the corresponding kernel and its header files. For the SCOPE variant, two additional steps were necessary: lookahead iden- tification and partitioning. Since the VEP resembles a tiled architecture, it appears natural to perform partitioning on a tile level. To that extent, every tile was marked as movable (c.f. Section 5.1.2), resulting in the initial partitions for two, four and eight 5.3. Experimental Results 61

DNP @ (0/0/0) DNP @ (0/1/0) DNP @ (1/1/0)

Rx Tx Rx Tx Rx Tx

routing

+0nsb_transport(tx, 400ns) +0ns idle wait(∆ttx) +0ns return ∆ttx = 400ns +0ns +400ns wait(∆ttx) routing +400ns b_transport(tx, 400ns) +400ns +600ns wait(∆ttx)

+800ns +600ns +800ns return ∆ttx = 200ns ... wait(∆ttx)

simulation time +800ns

Thread 1 Thread 2

Figure 5.8: DNP transaction timing threads as shown in Figure 5.7. The reason for these distribution patterns is the order in which the VEP instantiates its tiles within its 3D coordinate system. It starts by creating tiles along the x-axis, resulting neighbouring tiles to be assigned to different threads in a round-robin fashion. This causes layers to be formed along the y-z plane so that tiles from the same plane are also simulated by the same thread. The next step for creation of the SCOPE variant of the VEP is lookahead iden- ∆ tification. The lookahead tla is constrained by the ability of the simulation to state communication between threads ahead of time. In this case, cross-thread communica- tion happens between neighbouring tiles that have been assigned to different threads, e.g., by a DNP sending a packet to one of its peers. The timing for the transmission of such a packet is illustrated by Figure 5.8. In the illustrated example, the DNP of tile (0/0/0) wishes to sent a packet to tile (1/1/0), using (0/1/0) as a relay. Since (0/0/0) and (0/1/0) are simulated on the same thread (c.f. Figure 5.7), regular transaction timing is applied. For the final hop, i.e., from (0/1/0) to (1/1/0), the remote transaction mechanism needs to be used, since the receiver operates on a different thread. Internally, DNPs consist of a transmitter (Tx) and a receiver (Rx) as shown in Figure 5.8.3 Transmitters are responsible for route calculation and forwarding the transaction to the peer which is next on the route of a packet. Routing calculations require 400ns, allowing all packets to be sent ∆ttx = 400ns ahead of time. This enables a maximum lookahead of ∆ tla = 200ns. Receivers then handle arriving packets and pass them on to their local transmitter for further forwarding or report successful reception via an interrupt to the on-tile RISC processor. Arriving packets are always processed in-time, enforcing an additional wait(∆ttx) call. Because of this call, timing behaviour outside of the DNP is independent of whether the packet was transmitted locally or across threads.

3 A more detailed description of the DNP internals is presented by Ammendola et al. [3] 62 Chapter 5. Time-Decoupled Parallel SystemC

Application scenario: presto Application scenario: fft

OSCI 502.49 OSCI 14071.93

parSC 594.19 parSC 17361.05

SCope 526.04 SCope 15471.38

0 120 240 360 480 600 0 3600 7200 10800 14400 18000 runtime (s) runtime (s)

Figure 5.9: Sequential simulation performance in presto (left) and fft (right)

5.3.2 Sequential Performance The first set of experiments analyses the sequential performance of the different ap- proaches. To that extent, all three variants of the VEP have been executed, but the parallel simulators have been limited to use only 1 thread. Figure 5.9 reports the runtime for the presto scenario (left-hand side) and the fft scenario (right-hand side). In both scenarios, the OSCI variant using regular sequential simulation operates fastest and consequently yields the shortest runtime. In presto, it outperforms the SCOPE and PARSC variants by 4.5% and 15.4%, respectively. Even better results are observed in the fft scenario: here, OSCI offers 9.1% shorter runtimes than SCOPE and 18.9% shorter runtimes than PARSC. This is no unexpected result: OSCI represents the state-of-the-art in sequential ESL simulation and has received constant performance improvements over the years by the EDA community in order to satisfy consumer demands for ever higher simulation speeds. Another reason for the slower simulation speeds of the parallel approaches is the fact that they still perform synchronisation operations on state that is considered to be shared, although there is no other thread it could be shared with. Performance for the PARSC simulator is hit especially hard, since its synchronous approach enforces synchronisation at every delta cycle, whereas SCOPE does this only once its current ∆ lookahead interval of tla = 200ns is depleted, allowing synchronisation free opera- tion for approximately 20 delta cycles4. A possible workaround for the slow sequential performance of both parallel approaches would be an alternate code path that falls back to regular sequential operation when single threaded operation is detected. Finally, this first set of experiments allows evaluating the implementation quality of SCOPE. Unlike PARSC, which is based on OSCI kernel, SCOPE has been implemented from scratch and did not benefit from years of optimisation by the EDA community. However, by achieving simulation speeds close to the industry standard in a worst case situation, i.e, with no parallelization potential, the implementation of SCOPE ker- nel must be seen as sufficiently optimised in order to compete in scenarios featuring realistic VPs as they are employed by the industry today.

4 ∆ delta cycles ≈ CPU clock cycles = T · fCPU = 200 ns · 100MHz = 20 cycles 5.3. Experimental Results 63

Application scenario: presto Application scenario: fft

SCope SCope (8 threads) 70.67 (8 threads) 1606.48 SCope SCope (4 threads) 129.47 (4 threads) 3448.50 SCope SCope (2 threads) 266.49 (2 threads) 7586.49

parSC parSC (8 threads) 283.69 (8 threads) 6849.37 parSC parSC (4 threads) 330.67 (4 threads) 8534.11 parSC parSC (2 threads) 432.85 (2 threads) 12109.08 OSCI OSCI (1 threads) 502.49 (1 threads) 14071.93 0 120 240 360 480 600 0 3600 7200 10800 14400 18000 runtime (s) runtime (s)

Figure 5.10: Parallel simulation runtime in presto (left) and fft (right)

5.3.3 Parallel Performance

The second set of experiments is considered with parallel performance of the pro- posed simulation kernel. To that extent, the limitation to only use one thread has been lifted and the presto and fft application scenarios are repeated for the SCOPE and PARSC simulator variants. The comparison with OSCI allows assessment of the poten- tial performance gains possible beyond the state-of-the-art. Furthermore, by putting SCOPE into competition with PARSC, evaluation of asynchronous parallel simulation techniques against synchronous ones becomes possible. The six benchmarks are repeated using two, four and eight threads for the PARSC and SCOPE variants. The measured parallel runtime is presented in Figure 5.10, in- cluding the sequential runtime for OSCI taken from the previous experiments. Initial observation shows that both parallel simulators outperform the sequential one as soon as they are allowed to use more than one thread. For example, using two threads, the PARSC variants shows a 14% shorter runtime for both, the presto and fft scenarios. Furthermore, it can be seen that simulation performance continuously improves, the more threads are used. Since both application scenarios employ a 4 × 4 × 4 VEP configuration, i.e, 64 tiles in total, enough parallelizable work is present to offset syn- chronisation overhead. This trend is expected to continue when using more threads as long as the simulation host processor provides enough physical cores with linked data caches. Since this work employs a simulation host with an octa-core Intel i7-5960X processor, eight threads have been chosen as a maximum.5 When comparing the PARSC and SCOPE variants directly, it can be seen that SCOPE generally provides shorter runtimes than PARSC for the same amount of threads. For both scenarios, the fastest simulation performance is achieved using the SCOPE variant

5 More details regarding the simulation host machine and processor can be found in Appendix C. 64 Chapter 5. Time-Decoupled Parallel SystemC

Application scenario: presto Application scenario: fft 10x 10x baseline baseline 8x parSC 8x parSC 6x SCope 6x SCope

4x 4x Speedup Speedup

2x 2x

0x 0x 1 2 4 8 1 2 4 8 Threads Threads

Figure 5.11: Parallel simulation speedup in presto (left) and fft (right) with eight threads. To put the performance gains into context, speedup values were derived based on the runtimes shown in Figure 5.10 and using the OSCI variant as a baseline. Figure 5.11 presents the results. In presto SCOPE achieves speedups of 1.9×, 3.9× and 7.1× for two, four and eight threads, respectively. In the same scenario, PARSC reaches 1.2×, 1.5× and 1.8×. The re- sults differ from those previously presented by Schumacher et al. [169] for two reasons: (i) a different simulation host is used; (ii) clock sharing [169] is no longer available in the most recent version of the VEP. Because of this, the number of clock events to be handled raises linearly with the number of tiles simulated. While event processing can proceed in parallel when using SCOPE, the synchronously operating PARSC performs this task during its sequential notification phase, thereby creating a bottleneck. A similar behaviour can be seen in the fft application scenario. Because of its synchronous nature, PARSC cannot process update and notification phases in parallel and needs to perform synchronisation at every delta cycle, i.e., every tclock = 10ns at fRISC = 100 MHz. In contrast, SCOPE is able to simulate the on-tile RISC processor ∆ freely for tla = 200 ns before resynchronising with the other threads. Consequently, SCOPE achieves higher simulation speeds than PARSC. Finally, the super-linear speedup of 8.8× in fft using eight threads is noteworthy. A possible reason for this effect can be found in the combination of the cache architecture of the host system and the characteristics of the fft application. The effective first level cache size accumulates to 8 · 32 KiB = 256 KiB, because a total of eight cores is being used. For the second level, a combined size of 8 · 256 KiB = 2 MiB becomes available. Consequently, more of the working set of the VEP can be accommodated at the same time, dramatically reducing memory access times. This proves to be especially beneficial for the fft application: due to its computational intense nature, simulation of the ISS dominates execution time and greatly benefits from faster access, e.g., to the data structures holding the simulated RISC register bank and pipeline registers. While the presto application also benefits from this effect in a similar way, it also spends a comparably higher amount of time handling simulation of the DNP than fft. Since presto constantly uses the remote transaction mechanisms, it incurs synchronisation overheads that offset the benefits of a larger accumulated cache. 5.3. Experimental Results 65

Application scenario: presto 10x presto, 2 threads presto, 4 threads 8x presto, 8 threads fft, 2 threads 6x fft, 4 threads fft, 8 threads

Speedup 4x

2x

0x 1 2 3 4 5 10 20 30 40 50 100 200 Lookahead (ns)

∆ Figure 5.12: Speedup with varying lookahead tla

5.3.4 Lookahead Analysis

The final set of experiments investigates the impact of the lookahead on the parallel simulation performance of SCOPE. Twelve lookahead values from the interval ranging between 1ns and 200ns have been selected. For each lookahead value, the SCOPE variant of the VEP executes the presto and fft application scenarios using two, four and eight threads, resulting in a total of 72 experiments. The runtime of each experiment is then compared to that of the OSCI variant and a speedup is derived. The results are summarised by Figure 5.12. The global trend shows that higher lookahead values yield better performance, due to less synchronisation operations. Another general observation is that the fft scenarios reach higher speedups than their presto counterparts for the same looka- head. Since presto makes heavy use of the DNP, it also faces higher performance penalties due to the cross-thread communication overhead than fft, which in turn uni- ∆ versally achieves higher speedups for every tested value of tla. This confirms the observations from the previous set of experiments and implies that the nature of an application needs to be taken into account during lookahead identification. Furthermore, it can be seen that speedups initially rise slowly until a lookahead ∆ of tla = 10ns is reached. From that point on, all experiments show a steeper per- formance increase. To understand this effect, it is important to recall that the clock of all on-tile RISC processors is fclock = 100 MHz. Consequently, each ISS executes one −1 instruction every tclock = fclock = 10ns of simulation time. Lookahead values smaller than that generate multiple synchronisation operations per simulated instruction. For ∆ example, with a lookahead of tla = 5 ns, synchronisation occurs at ti + 0ns, ti + 5ns, ti + 10 ns and so on. However, the ISS is only active at ti + 0ns and ti + 10ns, leaving the timestamp at ti + 5 ns without any meaningful activity. Such idle cycles are the ∆ reason why runtimes of experiments with tla < tclock show reduced performance. 66 Chapter 5. Time-Decoupled Parallel SystemC

∆ Beyond tla = 10 ns, speedup values for the fft application scenario quickly ap- ∆ proach their maximum. A decoupling of two clock cycles, i.e tla = 2tclock = 20ns, is already sufficient to achieve approximately linear speedups of 1.8×, 3.9× and 7.6× for two, four and eight threads, respectively. For the latter case, superlinear behaviour ∆ can be observed starting from tla = 3tclock = 30ns. Finally, the presto application scenario allows another assessment of the overhead introduced by the remote transaction mechanism. Despite running with the highest number of threads and consequently benefiting from the largest effective cache size, the experiment using eight threads exposes the worst parallel efficiency of 88% at maximum lookahead. For the same lookahead, the experiments with two and four threads reach efficiencies of 95% and 97%, respectively. The reason for this lies in the number of DNP connections that must use the remote transaction mechanism. Because of the partitioning along the x-axis, the variants using two and four threads incorporate only two remote connections per tile, i.e., one for the left and one for the right peer as shown in Figure 5.7. However, when moving on to eight threads, the VEP starts to additionally partition along the y-axis (c.f. Figure 5.7), producing four remote connections per tile. Doubling the amount of remote connections in this experiment presents a possible reason for its reduced parallel efficiency, given the high communication-to-computation ratio of the presto application scenario.

5.4 Limitations and Outlook

The SCOPE simulation kernel is a prototypical implementation of the SYSTEMC stan- dard [80] using asynchronous parallel simulation technology. Thus it lies in its nature that the implementation is not complete. Instead the focus has been set on supporting high level modelling primitives, such as TLM sockets and transport interfaces, since those are crucial for the construction of realistic VPs. Consequently, implementation of low level data types, such as logic values that are typically found in gate level simulators, has been omitted. Leaving missing implementations open for future work, this section investigates conceptual issues that currently restrict the application domain of SCOPE and therefore imposes requirements on VPs that must be met before the benefits of asynchronous parallel simulation can be reaped. In the context of realistic VPs, the following four central constraints have been identified:

(a) Cross thread communication must be stated ahead of time. This is a crucial precondition for enabling time-decoupling between the individual simulation threads as required by asynchronous simulation approaches. Within the VEP the routing calculation delay has been exploited for this purpose and it can be expected that any NoC based architecture will expose similar opportunities. However, for more traditional embedded SMP or HMP designs, no such cal- culations are performed, since they are usually interconnected using crossbar topologies. In those cases, the TLM quantum can be exploited: ISSs frequently 5.4. Limitations and Outlook 67

execute multiple instructions as a chunk to improve performance. Such a quan- tum is generally executed ahead of time, thereby offering an integration path.

(b) VP partitioning and lookahead selection is done manually. Both tasks require expert knowledge about VP internals. Especially partitioning poses a challenge, since it is driven by multiple objectives. First, partitioning needs to produce an even load distribution among all threads for optimal performance. While this task is supported by the built-in load balancer of SCOPE, identification of mov- able modules is still left for the developer. In that context it is important to place simulation models that access shared data in an unsynchronized fashion on the same thread to avoid race conditions. Second, partitioning defines, which TLM connections must make use of the remote transaction mechanism and thereby indirectly impacts the maximum lookahead. Future work might investigate pro- filing tools that aid developers in this task by automatically identifying high latency TLM connections most suitable for decoupling.

(c) No support for non-blocking communication across threads. While rarely en- countered in commercial solutions, the non-blocking transport interface of TLM is used in circumstances where precise modelling of the timing of a transaction is required. Generally speaking, the non-blocking interface dictates two sep- arate interface function calls for sending a request and fetching the response. Since those calls are not allowed to block, i.e., call wait, the proposed remote transaction mechanism in the augmented TLM target socket cannot be used (c.f. Figure 5.6, step 5). Furthermore, crossing non-blocking calls from different threads expose the VP to potential deadlocks. Possible solutions to this problem are investigated in Chapter 9 of this thesis. Alternatively, the non-blocking- to-blocking interface conversion mechanism of TLM simple target and initiator sockets can be used as a workaround.

(d) The direct memory interface cannot be used across threads. The preferred way of ISSs to communicate with memories, e.g., for fetching instructions or read- ing and writing data, is to use memcpy in combination with a plain C pointer to memory. Distribution of such pointers from the memory to the processor models is orchestrated by the TLM direct memory interface in a non-blocking fashion. Because of this, the proposed remote transaction mechanism cannot be used, similar to the non-blocking transport interface. Furthermore, by having processors access shared memory via pointers from different threads, race con- ditions within the simulated memory are created, when to ISSs access the same address at the same time. Since this voids any claim to deterministic behaviour of a VPs, use of the direct memory interface across threads has been disallowed.

Since none of these constraints are violated by the VEP, integration with SCOPE was swift and performed with reasonable effort. However, realistic VPs might additionally require overcoming the challenges posed by issues (a) and (d). Possible solutions to both are presented and discussed in Chapter 6. 68 Chapter 5. Time-Decoupled Parallel SystemC

5.5 Synopsis

This chapter has presented the SCOPE SYSTEMC kernel, which extends the industry standard SYSTEMC SLDL with an asynchronous parallel simulation approach realised using time-decoupling. Functional and temporal correctness as well as deterministic execution are retained even for non-thread-safe legacy models given all cross-thread communication exclusively uses regular SYSTEMC channels. Experiments with a realistic VP taken from the EURETILE project show promis- ing results, achieving approximately linear speedups over the state-of-the-art sequen- tial SYSTEMC kernel from OSCI. When comparing to synchronous parallel simulators, SCOPE showed significantly improved performance, resulting in 24 – 63% shorter run- times on average in comparable scenarios. Integrating SCOPE into an already existing VP was possible with reasonable effort, since only partitioning and recompilation were required, thanks to compliance with the SYSTEMC standard. In conclusion, it can therefore be said that asynchronous par- allel simulation techniques appear as a feasible solution to the problem of simulation speed degradation in modern multi- and many-core VPs. Productivity gains from 2× to 8× faster simulators should prove sufficient to offset low initial integration costs. Apart from code optimisations, several general opportunities still exist to improve upon the presented work in the future. For example, partitioning is a manually driven process that requires insight into the structure of a VP. While this can conveniently done on a SYSTEMC module basis, information regarding which model to offload to another thread is left to the expertise of the developer. A supporting toolset that can provide profiling information about the simulator would be helpful in this case. Moreover, full compliance with the remaining TLM interfaces would be desirable. Chapter 6

Flexible Time-Decoupling

While the asynchronous parallel SYSTEMC simulation approach of SCOPE has proven to be an effective way to accelerate simulators for many core architectures, it still ex- poses some tough prerequisites hindering its adoption in more mainstream VPs. Out of those, the most critical one is the requirement to state communication ahead of time in order to bridge the gap between time-decoupled threads. The GEMSCLAIM Virtual Platform, for example, falls victim to this and would first require a modifi- cation of the processor models before being compatible with SCOPE. However, such a manual adaption of processor communication patterns within a VP often requires reengineering, a costly and time consuming process, which stands at odds with the benefits VPs are supposed to bring to the table in the first place. This chapter presents flexible time-decoupling [202, 205] as a potential solution: by relaxing timing constraints for cross-thread transactions, flexible time-decoupling en- ables parallel simulation of models without a priori communication knowledge and facilitates integration of the SCOPE kernel with existing VPs. At its core, it represents a tradeoff between simulation accuracy and determinism on the one hand, and high simulation speed and compatibility on the other. As such, the SCOPE kernel has been augmented to support flexible time-decoupling as an optional feature in the form of different simulation modes. Original timing behaviour is retained in the accurate mode. Relaxed timing is permitted in the deterministic and fast modes. The remainder of this chapter is dedicated to an in depth discussion and evalua- tion of those simulation modes. First, Section 6.1 gives an overview about the exten- sions of SCOPE before enumerating the changes to simulation timing implied by each mode. Next, Section 6.2 presents novel communication primitives introduced to the SCOPE kernel in order to support flexible time-decoupling in existing VPs in a seam- less fashion. Section 6.3 investigates the interplay between flexible time-decoupling and temporal decoupling as encouraged in TLM. Subsequently, the proposed design is put to the test using synthetic and realistic experiments based on the GEMSCLAIM Virtual Platform (GVP) in Section 6.4. Finally, the chapter concludes with a discussion of limitations and potential future work in Section 6.5 and a summary in Section 6.6.

6.1 Simulator Operation Modes

The goal of flexible time-decoupling is to relax timing constraints for cross-thread transactions while retaining temporal correctness and preventing causality errors. The main constraint of concern here is that the local time offset of a transaction at time ∆ ∆ of sending must not be smaller than the predefined lookahead, i.e. ttx ≥ tla, for

69 70 Chapter 6. Flexible Time-Decoupling both the forward and backward paths of the TLM blocking transport interface (c.f. Sec- tion 5.2.4). In that context, ∆ttx is considered an attribute of the third-party simulation model and is assumed immutable in realistic scenarios, for example because of insuffi- cient source code access or lack of expertise in the underlying design. However, many ESL models only communicate synchronously with simulation time [139, 82]. VP de- ∆ ∆ signers are consequently forced to set tla = ttx = 0, thereby effectively serialising the simulation and voiding any potential performance gains. To overcome this obstacle, the SCOPE SYSTEMC kernel was extended to support different operation modes that define the degree in which the kernel is allowed to alter model communication timing to support time-decoupling. A simulator operating in either of these modes is considered to make use of flexible time-decoupling in the remainder of this work. Three distinct simulation modes are introduced:

• Accurate Simulation Mode forbids altering model timing. This ensures iden- tical simulator behaviour compared to the sequential OSCI reference kernel. A runtime error will be issued if communication is not stated sufficiently ahead of time and would thus cause a violation of the lookahead constraint. This is the default mode and lets SCOPE operate as described in Chapter 5.

• Deterministic Simulation Mode alters model timing for cross-thread commu- nication. These alterations are performed in a reproducible fashion to support construction of deterministic simulators for use with target software debuggers.

• Fast Simulation Mode only alters model timing for cross-thread communication as much as necessary to cross time zones. Fast mode simulations may behave nondeterministically and are therefore allowed to make use of the TLM Direct Memory Interface (DMI), even across threads for optimal performance.

The tagged and non-tagged versions of the TLM augmented simple target sock- ets have been modified to support these simulation modes for flexible cross-thread communication. Similar to the remote transaction mechanism (c.f. Section 5.2.4), the modifications work under the hood and simulators can seamlessly take advantage of them just by compiling against the SCOPE TLM headers. Timing alterations are ex- clusively performed via modification of the local time offset ∆ttx and all extra time added is recorded as the per transaction timing error ∆εtx. When using flexible time-decoupling, any transaction tx stated at timestamp ti ∆ in violation of the lookahead constrained is automatically postponed to ti + tfw. ∆ In this context tfw refers to the artificial delay optimally selected to satisfy timing constraints by the currently active simulation mode. It is then possible to send such ∆ ∆ a postponed transaction at ti with ttx = tfw with a sufficient local time offset. ∆ The procedure is repeated on the return path, adding another artificial delay tbw as defined by the simulation mode. The produced timing error of tx is stored with the ∆ε ∆ ∆ transaction as tx = tfw + tbw. Furthermore, every augmented socket s keeps an ∆ε ∆ε accumulated timing error s = tx∈s tx of every flexible transaction that passed through it. In the following, the new flexible time-decoupling simulation modes and P their impact on simulation timing are explained in detail. 6.1. Simulator Operation Modes 71

initiator tx complete tx (thread 0) send

∆ttx = 0 ∆ttx = 0

target tx process tx tx (thread 0) receive respond

initiator tx complete tx (thread 0) send ∆ttx = ∆tla ∆ttx = ∆tla

target tx process tx tx (thread 1) receive respond

simulation time t0 ∆tfw = ∆tla ∆tp ∆tbw = ∆tla ∆εtx = 2∆tla

Figure 6.1: Modified transaction timing in deterministic mode

6.1.1 Deterministic Simulation Mode

When switched into deterministic simulation mode, SCOPE no longer issues a runtime ∆ ∆ ∆ error for remote transactions with ttx < tla. Instead, ttx is dynamically adjusted so that it meets the lookahead constraint. Additionally, the adjustment has the goal of re- taining the grade of determinism present in the employed simulation models and not introduce new randomness, e.g., by relying on local thread time ti. This can effectively ∆ ∆ be assured, if the chosen time adjustments tfw and tbw are constant. Assuming de- terministic behaviour on both sender and receiver side, i.e., ∆ttx = const, choosing the adjustment as described by Equation 6.1 satisfies both requirements while keeping the transaction timing error ∆εtx minimal.

∆ ∆ ∆ ∆ ∆ ∆ tfw = max( tla, ttx) tbw = max( tla, ttx) (6.1) When using the deterministic timing adjustments as presented by Equation 6.1, the augmented TLM target socket also has to update the transaction timing error ∆εtx. The calculation is performed according to Equation 6.2:

∆ ∆ ∆ ∆ tla − ttx if ttx < tla ∆εtx = (6.2) (0 otherwise The timing adjustment procedure for the deterministic simulation mode is illus- trated in Figure 6.1. The upper half of the figure presents a typical transaction for ESL with a zero-delay, while the lower half shows the same for a scenario with two threads. The procedure works as follows: before the transaction is forwarded to the ∆ ∆ ∆ε ∆ ∆ remote thread, SCOPE checks if ttx < tla. In that case, tx is set to tla − ttx and ∆ ∆ ttx is set to tla. The transaction is then forwarded to the target like normal. On the receiving side, most target models increase ∆ttx by a fixed value of time required to process the request, henceforth denoted as ∆tp, before returning ahead of time. 72 Chapter 6. Flexible Time-Decoupling

initiator tx complete tx (thread 0) send ∆ttx = 0 ∆ttx = t0 - t1

target tx process tx tx (thread 1) receive respond

∆ simulation time t0 tp ∆εtx =∆tfw = t0 - t1

Figure 6.2: Modified transaction timing in fast mode when t0 > t1

However, some models may also choose to leave ∆ttx unchanged and instead call wait to account for ∆tp, thereby immediately synchronising with local simulation time. In both cases, once the target returns, a second adjustment for ∆ttx is performed so that ∆ ∆ ttx ≥ tla also holds for the return path. The mechanism is the same as on the for- ward path: ∆ttx is increased if necessary and the extra time is accounted by increasing ∆εtx. Targets, which do not call wait and instead return ahead of time, perform bet- ter in this situation. They increase ∆ttx instead, which in turn reduces ∆εtx and also avoids the overhead that is caused during time synchronisation. Figure 6.1 presents a worst-case scenario, where ∆ttx must be adjusted twice for ∆ε ∆ the forward and return paths, resulting in an accumulated tx = 2 tla. Note that if the target had not called wait to account for ∆tp, the timing error would have been reduced by that amount. After the transaction response has been returned to the sender, the transaction timing error ∆εtx denotes the total extra time taken by the process for transmission of the transaction tx, which would not have occurred in a sequential simulation or between senders and receivers located on the same thread.

6.1.2 Fast Simulation Mode

The second new simulation mode that has been added to SCOPE is called fast mode. It trades deterministic cross-thread communication for an increase in simulation speed and a reduction of the transaction timing error. Considering the deterministic simula- ∆ ∆ tion mode described in the previous section, increasing ttx always to tla on both forward and backward path corresponds to a worst-case assumption. This is only necessary, when upon sending the transaction, the target is maximally ahead of the ∆ initiator by tla and, in turn, when sending the response, the initiator is maximally ahead of the target. However, experiments have shown that this is rarely the case. In realistic cases, it can be assumed that the local time difference between two ∆ threads ti,j = ti − tj is approximately constant for the entirety of a transaction.A positive value indicates that the initiator on thread i is ahead in time. In this situa- tion, it is not necessary to increase ∆ttx, since the intended arrival timestamp of the transaction cannot yet have elapsed in the context of the target on thread j, since it is lagging behind. A transaction timing adjustment is only necessary on the return path in order to get back into the time zone of thread i, which is still ahead. However, since 6.1. Simulator Operation Modes 73

initiator tx complete tx (thread 0) send ∆ttx = t1 - t0 ∆ttx = 0

target tx process tx tx (thread 1) receive respond

simulation time t0 ∆εtx =∆tfw = t1 - t0 ∆tp

Figure 6.3: Modified transaction timing in fast mode when t0 < t1

∆ the time difference ti,j is known, the adjustment can be chosen as small as possible in order to minimise ∆εtx. This situation is illustrated in Figure 6.2. A complementary case exists when the target is ahead in time of the initiator, i.e., ∆ ti < tj and thus ti,j < 0. Figure 6.3 shows how the timing adjustment occurs on the forward path, but only with the minimum amount necessary to bridge the time gap, ∆ ∆ i.e, tfw = − ti,j. A further timing adjustment on the backward path is not necessary, since the target is ahead in time of the initiator. Equation 6.3 summarises the timing adjustment on the forward i → j and backward j → i paths for both cases:

∆ ∆ ∆ ∆ ∆ ∆ tfw = max(− ti,j, ttx) tbw = max( ti,j, ttx) (6.3) Given the timing adjustments as described in Equation 6.3, the transaction timing error ∆εtx can be derived as shown in Equation 6.4:

∆ ∆ ∆ ∆ | ti,j|− ttx if ttx < | ti,j| ∆εtx = (6.4) (0 otherwise The introduction of zero-delay cross-thread communication requires modification of the remote transaction mechanism and the underlying remote events and remote ∆ < ∆ event queues. Notifications with tnotify tla are no longer rejected in fast mode. Instead, a notification for an remote event e by thread i is processed as normal by its ∆ thread j if tj < te = ti + tnotify. During this procedure thread j receives a time lock, prohibiting it from advancing its local time until the notification has completed. The fast simulation mode features lower timing errors on average compared to ∆ε < ∆ ∆ε ∆ the deterministic one, with tx tla versus tx = 2 tla, respectively. Furthermore, optimised communication primitives and protocols become available in fast mode that yield higher simulation performance. These are discussed in the following section. However, it should be noted that the fast simulation mode cannot guarantee deter- ∆ ∆ ministic timing behaviour, since tfw and tbw both depend on the local time differ- ∆ ence ti,j, which in turn depends on how much runtime each simulation thread gets allocated by the host computer. Altering simulation behaviour might incur problems in certain use-cases, such as during the debugging of a race condition within the sim- ulated software. In such cases, the ability to exactly reproduce the error in subsequent simulation runs is paramount, so it is suggested to revert back to deterministic mode. 74 Chapter 6. Flexible Time-Decoupling

6.2 Flexible Inter-thread Communication

Dropping the requirement for deterministic cross-thread transaction timing allows the construction of new modelling primitives, such as zero-delay remote notifications or remote signals, and enables the use of the TLM DMI across threads. Their design is directed towards a reduction of transaction timing errors and increased simulation performance. This section first introduces remote event zero-delay notifications and remote signals before delving into a discussion on the use of DMI in fast mode and thereby trading simulation determinism and timing accuracy for speed.

6.2.1 Zero-delay Remote Notifications The fast simulation mode enables zero-delay transportation of remote transactions with ∆ttx = 0 across threads when time zones of sender and receiver match, i.e., when ti = tj. This functionality requires a modification of the remote event and remote event queue primitives, which are used by the remote transaction mechanism within the ∆ ∆ augmented TLM target sockets. When a zero-delay notification with tnotify < tla is requested while operating in fast mode, a best effort notification is performed by trying to trigger the event as close as possible to the desired time. Considering a ∆ ∆ scenario where thread j issues a notification with 0 ≤ tnotify < tla for event e, which is maintained by thread i, three cases must be distinguished:

• Regular Notification: this case occurs when te > ti and is guaranteed whenever ∆ ∆ tnotify ≥ tla. No timing error is incurred in this situation. However, execution will be nondeterministic in fast simulation mode.

• Zero-delay Notification: this case occurs when te = ti. An attempt to insert e into the event queue of thread i at timestamp te is performed. If this is not possible, a delayed notification is performed instead.

• Delayed Notification: this case occurs when te < ti and the intended trigger time of e has already elapsed in the context of thread i. Instead, a best effort notification is performed, incurring a notification timing error ∆εe.

In order to support delayed and zero-delay notifications, the extended notification phase of SCOPE needs to be adapted. Figure 6.4 presents the changes in the context of Figure 2.1 and Figure 5.2. Internally, SCOPE keeps notified events in three distinct event queues in order to differentiate between events that have received immediate notifications EQimm, delta notifications EQdelta and timed notifications EQtimed. As with regular SYSTEMC, event processing begins with triggering all immediately notified δ events. Should none exist, a new delta cycle is started by incrementing cycle and delta notifications are subsequently handled. Finally, timed notifications are processed. All remote notifications, including zero- ∆ delay notifications with tnotify = 0, are internally represented as timed notifications. Consequently, all operations on EQtimed need to be protected from race conditions. 6.2. Flexible Inter-thread Communication 75

Immediate Notification Delta Notification Timed Notification Processing Processing Processing

no no no EQimm ≠ ∅? δcycle++ EQdelta ≠ ∅? EQtimed ≠ ∅? end of simulation

yes yes yes from updatefrom phase

trigger(e) ∀ e ∈ EQimm trigger(e) ∀ e ∈ EQdelta acquire λi tlim,i ← min tj + ∆tla j≠i

e notification phase complete ← next(EQtimed) release λi

yes no e release λi trigger(e) ti ← te te < tlim,i? EQtimed ←EQtimed ∪

Figure 6.4: Extended notification phase of SCOPE for zero-delay notifications

λ λ Each thread i holds a mutex i for this purpose. Conceptually, holding i grants the owner the ability to insert events and remote events into EQtimed. Furthermore, any λ thread can prevent thread i from advancing its local time ti by acquiring i. As shown in Figure 6.4, a thread must first acquire its lock before it can process timed events. Should that cause the local time ti to advance beyond its limit tlim,i, the lock is released for a short time to give slower threads a chance to notify remote events. In order to λ avoid thread starvation, a fair ticket-based spinlock is used to implement i. When a thread j performs a zero-delay notification of remote event e on thread i ∆ < ∆ with tnotify tla, it proceeds as follows: 1. Issue a runtime error if not operating in fast simulation mode λ 2. Acquire i in order to prevent thread i to advance its time any further ∆ ∆ 3. Determine te as close as possible to tj + tnotify, i.e., te = max(ti, tj + tnotify) ∆ε ∆ 4. Determine the remote notification error e = tj + tnotify − te

5. Insert e into EQtimed with trigger timestamp te λ 6. Release i

The remote notification error ∆εe denotes timing error when the designated trigger ∆ time has already elapsed in the context of the target thread, i.e., ti > tj + tnotify. If the notifying thread is ahead in time or time synchronous with the target thread, this ∆ error will be zero. This is also the case if tnotify is sufficiently large to bridge the time gap between both threads and allows the event to be triggered at the requested timestamp. However, should thread i have advanced beyond this point, the event will instead be triggered at ti, similar to a delta notification. 76 Chapter 6. Flexible Time-Decoupling

remote event queue trigger ts ... regular signal Clock ts = max(ti , tj) 5ns 10ns write vs CPU generator driver store vs vs 0 1 ... model read vs

driver stage

Figure 6.5: Remote signal including driver stage between processor and clock model

6.2.2 Remote Signals

Many VPs employ SYSTEMC signals to model interrupt lines between peripheral com- ponents and processors. A signal can hold a value, usually of Boolean type to resem- ble a digital value, and provides an event notification when this value changes. The notification happens in the next delta cycle, triggering sensitive processes to execute and eventually notifying processor models about a rising or falling interrupt signal edge. Since no time passes between a change of a signal value and the event notifica- tion, no deterministic and temporal accurate implementation for signals across time- decoupled threads can be provided. However, in fast operation mode, SCOPE converts regular SYSTEMC signals to remote signals, providing a best effort replacement that guarantees temporal and functional correctness in such cases. Temporal correctness requires that any value written to the remote signal by thread i at ti is not seen by thread j before tj = ti, i.e., not before it has caught up. Should a change be visible any earlier, i.e., at tj < ti, it would be possible for future signal changes to affect past simulation state, resulting in causality violations. Figure 6.5 shows how a signal driver is used to buffer write operations to the signal and bring them in temporal order. A remote event queue is used to store future signal values including the timestamp ts at which they should become visible to the signal. A driver SYSTEMC process is sensitive to the queue and extracts the next value from it before writing it to the actual signal. Determination of the trigger times for the signal driver ts works on a best effort basis, similar to zero-delay remote notifications. Assuming a thread i writes the value vs at timestamp ti into the signal s, ts is calculated so that the non-negative signal ∆ε timing error s = ts − ti becomes minimal. If thread i is ahead in time, the signal must see vs no sooner than at ts = ti. However, if thread i is lagging behind, the value λ is transmitted as soon as possible at ts = tj. In the latter case, the time lock j must be acquired to prevent thread j from advancing any further before vs has been written. Computation of ts and ∆εs is summarised by Equation 6.5 in the following:

> ∆ε 0 if ti tj ts = max(ti, tj) s = (6.5) (tj − ti otherwise Driver and event queue functionality are both transparently built into regular signals when using SCOPE to facilitate integration with existing VPs. However, they only become active when cross-thread operation in fast mode is detected. 6.2. Flexible Inter-thread Communication 77

6.2.3 Remote Direct Memory Access The TLM Direct Memory Interface (DMI) offers a standardised way for a proces- sor and memory model to communicate using pointers instead of IMCs. Given that memory operations like load, store and fetch are used intensely, avoiding the com- parably high overhead of transmitting transaction objects yields significantly higher simulation speeds at the cost of reduced timing accuracy. DMI pointers provided by a memory component are always associated with an address range they are valid for and their distribution and invalidation is orchestrated by a well defined protocol.1 Using DMI in a parallel simulation environment exposes a set of key challenges that need to be addressed in terms of determinism and functional correctness:

(a) DMI callbacks are not thread safe. Before a processor model is allowed to ac- cess memory data using a DMI pointer, it must first ensure that the pointer is still valid. After handing out such a pointer, a memory component may at any time invalidate it, or alter the address range or access type it is applicable to. Consequently, a processor model must keep a list of all DMI pointers that are currently available to use and this list needs to be carefully updated as simu- lation progresses to ensure functional correctness. While this is usually not a problem in sequential simulators, parallel implementations require such a list to be protected from concurrent accesses, since invalidations might be triggered by operations of another processor model running in parallel on another thread.

(b) Parallel use of pointers causes nondeterminism. It is likely that multiple pro- cessor models hold a DMI pointer for the same memory region. When those models run in parallel and access the same memory address without proper syn- chronisation, a race condition within the simulated memory is created. While this is essentially identical to what happens on real hardware, it also causes the simulator to behave nondeterministically. This can render the VP unfit for use in debug scenarios where a complex race conditions needs to be tracked down.

(c) Atomic models require atomic implementations. Multi-core processors require some form of atomic memory access operations in order to synchronise with each other. Read-Modify-Write (RMW) operations such as atomic-increment or compare-and-swap are frequently employed for this case. However, most proces- sor models will only model RMWs to execute atomically in simulated time, but not in real time. The cooperative process scheduler of SYSTEMC guarantees atomic execution for any model code, facilitating modelling of atomics. However, in a parallel simulation environment every atomic memory access via a DMI pointer must also execute atomically on the host. Otherwise it could be interrupted by another model running in parallel, breaking functional correctness.

As a result of these challenges, cross-thread DMI is only allowed in fast mode, since it may induce nondeterministic behaviour. Furthermore, some processor models might require manual tuning of atomic memory access operations.

1 An introduction to the TLM DMI protocol can be found in Appendix A 78 Chapter 6. Flexible Time-Decoupling

6.3 Temporal Decoupling

Temporal decoupling refers to the modelling technique that enables simulation pro- cesses to operate ahead of local time. It allows the corresponding simulation processes to run ahead of time for a certain quantum ∆tq before having to synchronise with sim- ulation time again by calling wait. This has two main benefits: Firstly, calling wait less often reduces overhead and thereby increases simulation speed. Secondly, most ISSs that employ DBT perform best when executing a chunk of target instructions at once, which is facilitated by the concept of a quantum in temporal decoupling. The price of increased performance is reduced timing accuracy. TLM temporal decoupling therefore shares similar goals and costs as the time-decoupling employed by SCOPE. This section therefore investigates the magnitude of the timing errors caused by both approaches, before presenting an approach that combines temporal and time- decoupling for optimal performance without escalating the timing error.

6.3.1 Timing Error Reducing timing accuracy to increase performance is a typical trade-off in ESL sim- ulation, e.g. in Simics [113, 47] or in ARM programmer’s view [8]. The magnitude of the timing error in the context of the application scenario decides whether a proposed abstraction is acceptable or not. Table 6.1 summarises this timing error for the flexible time-decoupling primitives employed by SCOPE in relation to the chosen lookahead.

Communication Deterministic Mode Fast Mode

TLM quantum ∆εquantum ≤ ∆tq ∆εquantum ≤ ∆tq ∆ε ∆ ∆ ∆ ∆ε ∆ ∆ ∆ TLM-BT tx = tla − ttx ≤ tla tx = | ti,j|− ttx < tla ∆ε ∆ TLM-DMI n/a DMI ≤ tla ∆ε ∆ SYSTEMCsignal n/a s = ti − tj < tla

Table 6.1: Cross-thread timing errors with flexible time-decoupling

Taking Equation 5.1 into consideration, the local time difference between two ∆ ∆ threads is limited by the lookahead, i.e., ti,j < tla. Furthermore it is assumed that no a priori communication knowledge is present, i.e., ∆ttx = 0. Based on this, the ∆ lookahead tla can be derived as the general upper bound for the worst case timing error. However, it should be noted that ∆εtx is incurred per transaction, resulting in accumulated timing errors for peripheral models without DMI support that require a high rate of I/O operations, such as SD memory cards and similar block devices. Typical sequential VPs incur worst case timing errors when a memory access is performed at the end of a quantum, right before time is re-synchronised. In this case, the timing error is limited by ∆tq. Assuming this is an acceptable bound for VPs ∆ ∆ employed by the industry today, choosing tla = tq keeps additional timing errors from flexible time-decoupling within the same order of magnitude. 6.4. Experimental Results 79

6.3.2 Mitigation Strategy

The transaction timing error ∆ttx is of particular interest when assessing timing ac- curacy. It represents an upper bound for a single transaction based communication. Timing accuracy in the deterministic and fast simulation modes therefore depends on the number of transactions sent across different time zones. This effect can escalate in certain situations, e.g., when a processor accesses a block device without DMI, as mentioned previously. Here, the processor model spends most of its quantum inside load or store operations, waiting for TLM blocking transport calls to complete. Con- sequently, the instruction pipeline starves and causes the VP to become unresponsive. A mitigation strategy for this problem is to allow processor models to catch up, if they spent their previous quantum waiting for a remote transaction. Using the augmented TLM target sockets of SCOPE, a processor model can query ∆εs, i.e., the accumulated extra time its TLM socket spent waiting for remote transactions during the previous quantum, which would not have occurred in a sequential simulation or with regular transactions. Usually, the number of instructions a processor model ∆ should execute in a chunk is solely based on the quantum, i.e., ninsn = tq · fclock. Afterwards, time synchronisation is performed by calling wait(∆tq ). To compensate for ∆ttx and allow the instruction pipeline to catch up, a processor model may now ∆ε additionally execute nextra = s · fclock instructions. The time required to perform nextra instructions has already been accounted for during the previous quantum, so no extra synchronisation is required in this case.

6.4 Experimental Results

Experimental evaluation of the proposed flexible time-decoupling is performed in two parts. Section 6.4.1 presents a quantitative analysis of the timing error in fast mode. A VP with synthetic application and network processor models is employed to study the impact of an uneven load distribution on timing accuracy. Subsequently, Sections 6.4.2 and 6.4.3 apply flexible time-decoupling in order to parallelize the heterogeneous multi-core GEMSCLAIM Virtual Platform (GVP). Because the GVP requires cross- thread DMI and interrupt signals, its parallelization only became possible thanks to the relaxed timing constraints as offered by the primitives introduced in this chapter. The evaluation of speedup and timing error for both sets of experiments uses the industry standard OSCI SYSTEMC kernel as a baseline reference. Each experiment was repeated a fixed number of times and only averages are reported here. Measurement was performed on a quad-core Intel i7 workstation PC clocked at 2.67 GHz with 12 GB RAM. To ensure consistent benchmarking results, temperature based dynamic overclocking (Intel Turbo-Boost) was disabled. More detailed information about the simulation host, as well as repetition count and the runtime of single experiment iterations can be found in Appendix C. 80 Chapter 6. Flexible Time-Decoupling

regular flexible remote regular CPU0 Router0 Router1 CPU1 transaction transaction transaction

thread 0 thread 1

Figure 6.6: VP used for synthetic experiments

6.4.1 Synthetic Experiments To assess the average timing error imposed on the simulation when operating in fast mode, a synthetic VP has been created, resembling a typical ESL modelling sce- nario. The platform is illustrated in Figure 6.6. It consists of two processors that communicate with each other using network routers. The processors execute No Op- eration (NOP) instructions at a statically configurable operation frequency using fcpu0 and fcpu1. The load that is caused on a host thread to simulate the execution of a single NOP instruction in a processor can also be configured statically using the pa- rameters load0 and load1 for CPU0 and CPU1, respectively. This enables modelling of systems where different types of processors, such as RISCs and VLIWs are simulated together, with the corresponding ISSs causing different loads on the host. Router communication is modelled by exchange of data tokens. Every ten instruc- tions, the processors trigger their routers to exchange such a token. Router operation frequency is another statically configurable parameter, denoted as frouter0 and frouter1. At their next clock cycle, the routers transmit waiting tokens to their peer instanta- neously, i.e., ∆ttx = 0. On the receiving end, correct transmission is checked and the extra time taken for transportation between threads is accounted towards a global timing error ∆ε. The simulation ends after the exchange of n = 1000 data tokens and the average timing error per transaction ∆εtx = ∆ε/2n is reported.

Symbol Description Value

fcpu0 CPU0 clock frequency 100 MHz

frouter0 Router0 clock frequency 10 MHz

load0 Host cycles per NOP on CPU0 1k ...10k cycles

fcpu1 CPU1 clock frequency 100 MHz

frouter1 Router1 clock frequency 10 MHz

load1 Host cycles per NOP on CPU1 1k ...10k cycles

Table 6.2: Experiment parameters for the synthetic VP

The synthetic VP has been constructed using SCOPE in fast mode and the exper- iment parameters from Table 6.2. Each processor-router subsystem is simulated on its own thread, causing all communication between the routers to use the flexible re- 6.4. Experimental Results 81

∆tla=1ns ∆tla=10ns ∆tla=100ns tx 1.6 tx

∆ε 1.4

2∆t ∆ε la 1.2 2∆tla 1 ∆tla 0.8 0.6 0.4 ∆ 0.0 0.2 tla timing error 10k 10k 7k 7k timing error 0.0 4k 4k 0.1 1 10 1k 1k load / load load1 (cycles) load0 (cycles) 0 1

(a) Map of the transaction timing error ∆εtx (b) Average transaction timing error ∆εtx ∆ ∆ with variable loads ( tla = 10 ns) for tla = 1ns,10ns,100ns

∆tla=2ns ∆tla=20ns ∆tla=200ns ∆tla=5ns ∆tla=50ns ∆tla=500ns tx tx

∆ε 2∆tla ∆ε 2∆tla

∆tla ∆tla

timing error 0.0 timing error 0.0 0.1 1 10 0.1 1 10

load0 / load1 load0 / load1

(c) Average transaction timing error ∆εtx (d) Average transaction timing error ∆εtx ∆ ∆ for tla = 2ns,20ns,200ns for tla = 5ns,50ns,500ns

Figure 6.7: Relative timing errors for synthetic loads mote transaction mechanism. The lookahead has been chosen to match the operation ∆ −1 −1 frequency of the processors, i.e., tla = fcpu0 = fcpu1 = 10 ns. Various scenarios are tested using different load values for both ISSs. They range between 1k and 10k host cycles per simulated NOP instruction, since those values are typical for cycle-accurate simulators but also for faster ISSs running within a quantum. Figure 6.7a presents the results. The average timing error per transaction ∆εtx ∆ ∆ approximately lies between 0.2 tla and 1.6 tla in this scenario. Low timing errors are encountered when simulation load on both threads is evenly balanced. In such cases, simulation time will advance close-to-synchronously in both threads, assuming both get an equal share of runtime on the host. Since the time difference of both threads ∆t0,1 is very small, only a small amount of extra time needs to be added per transac- tion, thus explaining the lower error bound. The upper error bound can be explained when looking at unevenly balanced load scenarios. Here, one thread generally runs ahead of the other one, which has to carry more load. The time difference between ∆ ∆ those threads is limited to t0,1 < tla, which means that the lookahead forms the upper limit of ∆εtx. It seems obvious that the balance of simulation load between both threads plays a significant role in the timing error imposed on the system. 82 Chapter 6. Flexible Time-Decoupling

In the following, only the ration of load0 to load1 is considered and the scenarios are repeated with varying lookahead. Figures 6.7b – 6.7d show the results. Two different kinds of behaviour groups regarding the timing error can be identified in ∆ all tests. The first group consists of scenarios where tla is smaller than the clock ∆ < −1 period of the processors, i.e., tla fcpu0 = 10 ns. In those cases, the synchronisation overhead allows the receiving thread to simulate ahead in time of the sending one, ∆ forcing the sender to add tla as a timing error. On the receiving side, reception of the transaction causes load, in turn allowing the sender to advance in time, so that it ∆ ∆ is ahead by tla. This forces the receiver to add another tla to the accumulated error ∆ε ∆ in order to return the transaction result to the sender, i.e., tx ≈ 2 tla. Therefore, ∆ small tla values should be avoided, since they result in an increased relative timing error and reduced parallel speedup. ∆ > −1 The second group of scenarios is comprised of tests with tla fcpu0 = 10ns. ∆ In such cases, the maximum error approaches tla for unevenly balanced loads, as previously discussed. The minimum error can generally be found in evenly balanced scenarios. Figures 6.7b – 6.7d show, that a higher lookahead raises the average min- ∆ε ∆ ∆ ∆ε ∆ imum error, ranging from tx = 0.51 tla with tla = 10 ns to tx = 0.77 tla with ∆ tla = 500ns. The overall trend indicates that an evenly balanced simulation system ∆ will produce more accurate results in fast mode. The maximum timing error is 2 tla, ∆ε ∆ while in most cases with sufficiently high lookahead, tx ≈ tla can be assumed.

6.4.2 GEMSCLAIM Experiment Setup In the following, flexible time-decoupling is put to the test using a realistic VP. The GEMSCLAIM platform offers an interesting opportunity, as it features a sizable set of heterogeneous processors and its VP can therefore be expected to greatly benefit from parallelization. So far, the GVP has not been compatible with SCOPE due to its inability to state cross-thread communication between the processors and the main memory ahead of time. However, thanks to the flexible time-decoupling extension, it was possible to create two GVP variants that utilise either fast or deterministic simulation mode. In total, three different variants have been tested:

• OSCI/accur.: this timing accurate variant of the GVP has been constructed using the OSCI SYSTEMC kernel version 2.3 [1, 80] and serves as a baseline for speedup and timing accuracy analysis.

• SCope/fast: this variant employs the SCOPE SYSTEMC kernel in fast simulation mode. Cross-thread communication uses DMI whenever the global shared mem- ory is accessed, while remote transactions are used for I/O components.

• SCope/deter.: this variant has been created by compiling the GVP against the SCOPE SYSTEMC kernel in deterministic mode. Cross-thread DMI is forbidden in this mode, so all memory accesses must use flexible remote transactions.

Except for the minor extensions of setting the simulation mode at startup, no changes to the GVP were required to enable parallel simulation with SCOPE. 6.4. Experimental Results 83

In order to provide an input stimulus for the simulator, the ocean current simu- lator from the SPLASH2 [217] benchmark suite has been ported to the GEMSCLAIM architecture. This experiment uses the parallel ocean simulator in its non-contiguous partition configuration, which is henceforth denoted as ocean-ncp. Each RISC and VLIW processor executes a single thread, with the first RISC assuming the role of the master and initialising all data. The SW port places the runtime stacks for each core on the local scratchpad memories. Since these are simulated by the same thread as the corresponding ISSs, the stack and all local variables are always accessed using DMI. Global variables have been placed in the shared memory and must be accessed using flexible remote transactions in deterministic mode. In fast mode, DMI is used for accessing both global and local memories. Synchronisation is achieved using pthreads-like spinlocks and barriers, which have been implemented using an atomic swap operation present in the instruction sets of both RISC and VLIW processors. However, the implementation of this operation relies on DMI pointers and bypasses the restrictions of the deterministic simulation mode. Consequently, deterministic operation of the parallel GVP variants cannot be guaran- teed. Hence, an additional functional correctness test of the ocean-ncp benchmark has been performed by comparing its output to the golden reference provided with the application as well as to native execution on x86. To accelerate the experimentation process, the default problem size has then been reduced from 258 to 10 and correct operation was checked again by comparing to native execution output.

6.4.3 GEMSCLAIM Experimentation

Ocean-ncp is executed on the GVP using a symmetric cluster configuration featuring 16 + 16 RISC and VLIW processors. At the end of each complete run through ocean- ncp, simulator run time and simulation duration are extracted. Each experiment is repeated a fixed number of times and only averages are reported here. Detailed information about the runtime and duration of single experiment iterations, repetition count and simulation host can be found in Appendix C. The first set of experiments reports on simulation runtime and duration in fast and deterministic mode when using up to four threads. Figures 6.8a and 6.8c present ∆ the results for the runtime when using tla = 10ns and 100ns, respectively. The ∆ quickest execution is achieved in fast mode using four threads and tla = 100 ns, outperforming the OSCI variant by 3.5×. The same variant still achieves a speedup of 1.7× when limited to only two threads. A lower lookahead does not show a significant ∆ impact on these results. With tla = 10ns, speedups of 3.1× and 1.7× are measured. The deterministic mode yields reduced performance when compared to fast mode, exposing an average slowdown of 0.71× and 0.68× for two and four threads, respec- ∆ε ∆ tively. This is because of a longer simulation duration due to tx = 2 tla being added on top for each flexible remote transaction. However, despite of the longer duration, it still outperforms the OSCI variant, achieving average speedups of 1.3× and 2.4×. Similar to fast mode, deterministic mode runtimes suffer only miniscule slowdowns ∆ −1 when low lookahead values are used, such as tla = fcpu = 10ns. 84 Chapter 6. Flexible Time-Decoupling

SCope/fast SCope/fast (4 threads) 605.88 (4 threads) 78

SCope/det. SCope/det. (4 threads) 901.38 (4 threads) 326

SCope/fast SCope/fast (2 threads) 1130.98 (2 threads) 68

SCope/det. SCope/det. (2 threads) 1663.24 (2 threads) 281

SCope/accur. SCope/accur. (1 threads) 2074.56 (1 threads) 0

OSCI/accur. OSCI/accur. (1 threads) 1882.92 (1 threads) 0

0 300 600 900 1200 1500 1800 2100 0 60 120 180 240 300 runtime (s) simulation time (µs)

∆ ∆ (a) Runtime for tla = 10ns (b) Timing error for tla = 10ns

SCope/fast SCope/fast (4 threads) 535.83 (4 threads) 35

SCope/det. SCope/det. (4 threads) 773.03 (4 threads) 2986

SCope/fast SCope/fast (2 threads) 1086.64 (2 threads) 12

SCope/det. SCope/det. (2 threads) 1470.87 (2 threads) 1398

SCope/accur. SCope/accur. (1 threads) 2074.56 (1 threads) 0

OSCI/accur. OSCI/accur. (1 threads) 1882.92 (1 threads) 0

0 300 600 900 1200 1500 1800 2100 0 600 1200 1800 2400 3000 runtime (s) simulation time (µs)

∆ ∆ (c) Runtime for tla = 100 ns (d) Timing error for tla = 100 ns

Figure 6.8: GVP runtime and timing error for SPLASH2 ocean-ncp

Figures 6.8b and 6.8d report the absolute timing error for simulation using flexible time-decoupling. It is derived as the absolute difference between the measured dura- tion and the duration of the OSCI variant, which is used as a baseline. In general, the timing errors for simulations running in deterministic mode are significantly higher. Since these simulators cannot use DMI for accessing the main memory, each access to ∆ε ∆ global variables cause a timing error of tx = 2 tla. Consequently, a higher looka- head causes a higher absolute timing error, as can be seen when directly comparing Figures 6.8b and 6.8d. Additionally, the timing error grows the more threads are 6.5. Limitations and Outlook 85

fast det. fast det.

1e+00 4x 1e-01 3x 2x 1e-02 speedup 1e-03

1x relative error 1e-04 10 ns 100 ns 1000 ns 10 ns 100 ns 1000 ns

lookahead ∆tla ∆tla

(a) Speedup using 4 threads (b) Relative error with 4 threads

Figure 6.9: GVP speedup and relative error for SPLASH2 ocean-ncp used, since more remote transactions become necessary. This is because only proces- sors simulated on the first thread can communicate with the memory without timing error, since that is also handled by the first thread. Consequently it is no surprise that ∆ the highest timing error can be found for tla = 100 ns and 4 threads. This worst case absolute timing error corresponds to a relative error of 1.2%. Figure 6.9 presents speedup and relative timing error for both modes and four ∆ threads where tla is varied from 10 ns to 1 µs. Figure 6.9a confirms the previous observation that fast mode generally achieves higher performance gains than deter- ministic mode. It can be seen that simulation performance for both modes generally ∆ improves as lookahead increases, but gets saturated at around tla = 100ns. At this point, the deterministic variant reaches a maximum speedup of 2.4× and any fur- ther gains from parallel simulation are offset by longer simulation durations. After ∆ tla = 400 ns, performance starts to drop down. The fast variant faces a similar phe- nomenon, but to a much lesser degree. Its timing error is significantly lower since only few remote transactions are used, e.g., for accessing I/O peripherals. Figure 6.9b shows that the timing error for the fast variant approximately lies between 0.02% and 0.06%. Given that the standard deviation of the measurement data yields a variation coefficient of 0.03%, this relative error can be attributed to measurement noise. However, for the deterministic variant, it can be seen that the ∆ relative error rises linearly with tla. This corresponds to the previous observation ∆ε ∆ that tx ≈ 2 tla per transaction. The number of transactions, i.e., memory accesses, is approximately constant for the given application, hence the linear relationship.

6.5 Limitations and Outlook

Based on the results of the experimentation, a few limitations of the proposed ap- proach can be identified and used as a foundation for future work. 86 Chapter 6. Flexible Time-Decoupling

A key concern is the functional correct implementation of exclusive memory ac- cess operations as required for thread synchronisation in most software. Generally it is sufficient to map atomic target instructions onto corresponding atomic instructions on the host. However, many embedded processors nowadays rely on optimistic load- linked and store-conditional operations, which are not available in x86. A solution to this problem that allows using DMI for optimistic atomic operations in parallel simulators is presented in the next chapter. A second concern is that the addition of a simulation mode shifts additional work- load on the VP designer. Expert knowledge is required to find a suitable combination of partitioning, lookahead and simulation mode considering the intended use case. Especially the interdependence between lookahead and timing error requires careful elaboration of platform parameters in order to identify the optimal configuration. In cases where no single configuration that suits all use cases can be found, lookahead and simulation mode selection may also be shifted from compile time to simulation time, allowing the VP user to fine-tune. Future work might also look into introducing runtime adaptability. Most use cases require high timing accuracy only during certain phases of execution, while other phases should just be simulated as quickly as possible. An example for this is Linux driver development, where high accuracy is usually only required during ini- tialisation and interrupt handler execution. Therefore, future implementations should allow lookahead and simulation mode to be changed even after elaboration.

6.6 Synopsis

This chapter has presented flexible time-decoupling, which allows the application of the time-decoupled simulation kernel SCOPE in a much wider range of platforms, including realistic VPs from the ESL domain. Such platforms usually incorporate closed source third-party components without a priori communication knowledge and therefore cannot easily be adapted to operate in a time-decoupled environment. As a solution, new simulation modes have been proposed that facilitate integration of those models into parallel time-decoupled simulators by allowing alteration of their timing behaviour. These new modes provide designers of VPs with a trade-off between simulation speed, timing accuracy and determinism. The approach was demonstrated by extending the parallel SYSTEMC kernel SCOPE. With minimal extra effort, the heterogeneous GEMSCLAIM platform gained a speedup of 2.4× in deterministic mode at the cost of around 1.2% error in timing. In fast mode, an even higher speedup of 3.5× was achieved by falling back on nondeterministic cross-thread DMI. Further experimentation has shown that an improved load bal- ancing between the individual simulation threads can significantly reduce the timing error imposed by the flexible time-decoupling. Chapter 7

Exclusive Memory Access Models

Exclusive memory access operations form the basis for implementing multi-core syn- chronisation primitives, such as spinlocks and barriers, as required by most multi- threaded software for SMP and HMP architectures. Without those operations, race conditions are expected to occur within shared memory, resulting in erroneous exe- cution. Multi-core processors usually offer RMW operations, such as Compare-and- Swap (CAS) and Test-and-Set (TAS) for this purpose. These instructions modify mem- ory contents atomically, meaning that they cannot be interrupted by access operations from other processors or interrupt requests from external sources. Therefore, models for exclusive memory access operations play an important role in the design of multi-core VPs. In this context, it is important to note that for conven- tional sequential simulators it is sufficient to execute these operations atomically only in simulation time and not necessarily in real time. Since the regular SYSTEMC kernel employs a cooperative scheduler for executing simulation processes using only a sin- gle thread, an active processor model always executes atomically until it voluntarily yields to allow simulation of other components. Consequently, there is no danger of another model interfering with its memory accesses, causing them to appear to be executed atomically from the point of view of the rest of the system. This situation changes when transitioning existing VPs into parallel simulation en- vironments, such as SCOPE, where atomic execution of individual simulation processes can no longer be guaranteed. To ensure functional correctness, exclusive memory ac- cess operations must now operate atomically not only in simulation time, but also in real time. Current implementations are incapable of dealing with this challenge and must be replaced with new models. This chapter presents two such models for supporting exclusive memory access operations through the TLM blocking transport interface as well as through DMI [201]. Integration with existing processor models is facilitated by operating at the interface level between such a model and its TLM sockets located within its SYSTEMC wrapper. The chapter begins by stocktaking which types of exclusive memory access oper- ations are being employed in contemporary instruction sets and their corresponding modelling considerations and implications in Section 7.1. Section 7.2 subsequently presents models for exclusive memory access for transaction and DMI-based ap- proaches. An experimental evaluation of these models is then presented in Section 7.3, where their performance impact is investigated using a realistic VP that runs the Linux kernel. Finally, a discussion on limitations and potential future work is pre- sented in Section 7.4, before the chapter is summarised in Section 7.5.

87 88 Chapter 7. Exclusive Memory Access Models

7.1 Modelling Considerations

The virtual sequential environment of SCOPE ensures sequential execution for all sim- ulation processes within, as described in Chapter 5. Any regular SYSTEMC communi- cation interface may take advantage of this feature, which in turn facilitates design of exclusive memory access operations via the the TLM blocking transport interface. As long as a blocking transport IMC does not yield, its atomic execution with re- gard to other IMCs is guaranteed. However, this does not include communication channels that are out of control of SCOPE, such as plain DMI pointer accesses. Hence, it is necessary do disallow the use of DMI for the duration of the IMC, or find an- other modelling strategy that can incorporate DMI accesses, which would otherwise intervene with the virtual sequential environment. When creating such multi-channel models for exclusive memory access operations in VPs, capabilities of the host computer may be taken into account to facilitate devel- opment. Contemporary VPs generally run on x86 based hosts, which by themselves already offer certain atomic operations. Using a lock prefix, a subset of Arithmetic Logic Unit (ALU) operations of x86 may be marked for atomic execution. Given the Complex Instruction Set Computer (CISC) nature of x86, which allows most ALU instructions to operate directly on memory contents, the lock prefix is sufficient for achieving atomic memory access. This approach has already been used to model atomic CAS operations for the GEMSCLAIM platform in Chapter 6. For this specific VP, the lock prefix has been used with the x86 cmpxchg instruction to atomically swap memory contents via a TLM DMI pointer. Unfortunately, the atomic operations offered by the x86 instruction sets are rarely used in modern embedded instructions sets. Instead, optimistic access operations are being deployed, which do not guarantee atomic execution, but instead inform the user later on whether the performed modification was atomic or not. Embedded multi-core architectures have been shown to greatly benefit from optimistic approaches, since they do not require bus locking and therefore do not exhibit a performance bottleneck when used frequently. One such approach is based on two novel operations, Load- Linked (LL) and Store-Conditional (SC) [72], which are explained in the following.

7.1.1 Load-Linked and Store-Conditional

One of the first implementations of optimistic atomic operations using the concept of LL/SC has been proposed by Jensen et al. [87] in 1987. Traditional atomic operations rely on single instructions, whereas LL/SC uses two: first, a value from an address is loaded and linked to the processor issuing the request. After loading, the value may be modified locally by the processor before it is written back to memory using a SC instruction. However, the store only succeeds and changes memory contents if the location has not been accessed by any other processor in the meantime. Otherwise, failure is reported and the processor usually restarts the entire process by issuing an- other LL request. While this operation is not strictly atomic, it appears to be atomic to the rest of the system, since a successful SC operation guarantees absence of interfer- 7.1. Modelling Considerations 89

ISA Load-Linked Store-Conditional

ARMv7 LDREX STREX ARMv8 LDXR STXR MIPS32 LL SC OpenRISC l.lwa l.swa Power ISA lwarx stwcx RISC-V LR SC

Table 7.1: LL/SC in embedded RISC ISAs (adapted from [201]) ence by other processors. Listing 7.1 presents an example that implements an atomic increment operation using LL/SC in an ARM-like dialect1. Furthermore it has been shown that LL/SC can be used to emulate other synchronisation primitives, such as CAS, in a similar fashion [4]. atoinc: LDREX r1, [addr] ; load value and link for [addr] ADD r1, r1, #1 ; increment value STREX r1, [addr] ; try to store incremented value TEQ r1, #0 ; did it succeed? BNE atoinc ; no - try again

Listing 7.1: Atomic increment operation using LL/SC Designing instruction sets for multi-core processors around optimistic atomics has proven beneficial in the past and many modern ISAs have adopted it. Table 7.1 gives an overview about contemporary embedded RISC architectures and their corre- sponding instructions implementing LL/SC. Reasons for the widespread adoption of LL/SC and preference over conventional CAS and TAS include:

• Single instruction atomics have to perform multiple operations, i.e., read, mod- ify and write, uninterruptedly, which makes design of a RISC style instruction pipeline difficult and results in long pipeline stages.

• Locking atomics prevent all other cores in the system from fetching data, even if they are not accessing shared data at all. The design of LL/SC prevents those bottlenecks, though it can be argued that it suffers lower performance when competing for shared data with other processors due to failing frequently.

• LL/SC can be cheaply implemented on top of the already existing cache co- herency protocols. Whenever the cacheline holding the linked data loses its exclusivity status, a subsequent conditional store must fail. Further failure con- ditions include external interrupts or a second load link.

1 This assembly dialect denotes LL and SC as LDREX and STREX, respectively. 90 Chapter 7. Exclusive Memory Access Models

Given that most VPs resemble embedded systems employing ISAs from Table 7.1, a suitable implementation for LL/SC in a parallel simulation environment must be found. While it has been shown that CAS and TAS can efficiently be emulated using LL/SC [4], the other way around proves much more difficult. Thus, some approaches fail a functional correctness test already, when no parallelism is involved. The ABA problem plays a major role in this and is therefore explained in the following.

7.1.2 The ABA Problem

The ABA problem refers to the fallacy of believing a memory location has not been changed if it has the same value at two different observation points. Consider the situation where two processors P1 and P2 access the same memory location using regular load and store operations. First, P1 reads the location and finds the value A. Next, P2 alters it, writing B and, some time later, restoring the value back to A. When afterwards P1 accesses the location again, it reads the original value A. How- ever, deducing from this observation that the value is unchanged would be wrong. Consequently, a functional correct implementation of LL/SC should result in a failed SC operation in this situation. Listing 7.2 presents example code that tests for ABA problem susceptibility. It modifies the contents of a memory location according to the sequence A → B → A between a LL and its corresponding SC. If register r2 holds zero after execution, the store conditional operation has succeeded, indicating a faulty LL/SC implementation. abatest: LDR r1, ’A’ ; r1 = ’A’ LDR r2, ’B’ ; r2 = ’B’ STR r1, [addr] ; initialise memory to ’A’ LDREX r1, [addr] ; place load-link at addr STR r2, [addr] ; modify memory to ’B’ STR r1, [addr] ; restore memory to ’A’ STREX r2, [addr] ; try store-conditional

Listing 7.2: Test for ABA problem susceptibility

However, at the point of writing many ESL models will still allow such an oper- ation to succeed, e.g., QEMU [16] for the architectures in Table 7.1. The reason for this is that the correct emulation of LL/SC on x86 relies on a combination of CAS and tags for every addressable memory location, as suggested by Anderson et al. [4] and Dechev et al. [42]. In this context, tags can be used to count the total number of modi- fications done to a memory cell, which allows precise detection of A → B → A chains during a SC operation. However, memory tagging incurs a significant overhead and therefore discourages fully correct implementations in VPs striving for optimal per- formance. For QEMU it is argued that the ABA problem rarely occurs in typical target software and, therefore, a faster design was given preference [65]. 7.2. Modelling Approach 91

7.2 Modelling Approach

VP providers are usually not concerned with models for exclusive memory access and instead rely on the ISS to do the work for them [8, 82, 182, 183]. Most multi-core ISSs are provided jointly in a single closed source black box model. Internally, the cores determine whenever a SC operation should fail and never execute it in the first place. While this approach reduces the development cost of a VP, it also has a number of disadvantages. First, synchronisation between black box ISSs from different vendors appears difficult, since the exclusive access information is not provided outside of the ISS. Second, such an approach leaves little potential for parallel simulation, since the multi-core processors only appears as a single monolithic black box model within the VP and are driven by a single SYSTEMC process. Contrary to conventional approaches, this section presents a model for exclusive memory access operations that moves synchronisation management out of the indi- vidual components into the system level. Its intended purpose is to enable a simula- tion environment with many separate ISSs that is suitable for parallel simulation with SCOPE. To be considered a viable approach, a model must abide by typical constraints imposed by the modelling environment. This work assumes the following ones:

1. Closed-Source: It is assumed that some simulation models (e.g., ISSs) are only available in compiled form and cannot be modified. Assumptions as to how the ISS interfaces with the VP are stated as needed.

2. Parallel Simulation: The approach must consider that SYSTEMC processes (and therefore all memory access operations) might be executed concurrently. Conse- quently, critical sections must either be executed from within a virtual sequential environment or offer atomic implementations.

3. Time-Decoupling: The approach must consider that some memory access oper- ations may be performed ahead of time. All LL/SC operations must therefore be synchronised with simulation time by calling wait.

Due to constraint (1), necessary extensions cannot be made directly in the model. Instead, augmented TLM initiator and target sockets have been created that provide the proposed LL/SC functionality. Furthermore, these sockets support temporal cor- rect communication according to constraint (3) for both types of interfaces typically used with TLM. The choice of communication type lies with the VP designer. High- performance simulators intended for software debugging generally use the fast DMI- based communication, while timing accurate simulators deployed in DSE tasks usu- ally choose the transaction based path. As a result, a viable modelling approach for exclusive memory access must also respect the choice of the designer regarding speed and accuracy. For example, if only DMI communication is used, the model should adapt and also use a faster implementation. To accommodate this, the proposed ap- proach features two operation modes that can be chosen by the VP designer at design time. Before these modes are introduced, the next section first presents a DMI cache model, which is a requirement for the LL/SC model. 92 Chapter 7. Exclusive Memory Access Models

DMI Cache

Index Address Start Address End Flags Host Pointer

0 0xf7eb0000 0xf7eb0fff R W 0x00007ffab8731000 1 0xf7eb1000 0xf7eb1fff R W 0x00007ffab8732000 2 0x00000000 0x00000fff R - 0x00007ffab66f0000 3 0x00001000 0x00001fff R - 0x00007ffab66f1000 4 0xf7af2000 0xf7af2fff - - 0x00007ffab8733000

Figure 7.1: Snapshot of the DMI cache model at runtime

7.2.1 DMI Cache Model The TLM DMI communication interface defines how initiators, such as processors, and targets, such as memories, may exchange DMI pointers. These pointers must be provided with a set of attributes, including target memory address range and access type (read/write) that they may be used for. Moreover, each pointer may be invalidated or have its attributes changed at any time. Since reacquisition of DMI pointers is a costly process, VPs striving for high performance need to manage their DMI pointers efficiently. For this purpose, this section presents a DMI cache model that keeps track of available DMI pointers throughout the simulation. It is optimised to provide a fast translation from a target memory address to a corresponding host memory DMI pointer. Fundamentally, three operations are supported:

• lookup: translates a target address into a DMI pointer. In case no suitable pointer is available, a cache miss is reported.

• poll: triggers the model to perform a cache refill attempt. Using the TLM DMI communication interface, it queries pointers for a specified target address range.

• invalidate: flushes all cache entries whose target address range overlaps with a specified address range.

Storing individual translations from target address to DMI pointer would be in- efficient and cause the cache model to consume excessive amounts of host memory. Instead, DMI pointers are stored for contiguous memory regions, i.e., ranges. As indi- cated by Figure 7.1, each range holds information about their target memory region as well as their access permissions. The size of each range depends on the granting memory model, but is usually page sized, i.e., 4KiB, or spans the entire size of the component. Internally, the cache model keeps a list of ranges, which is searched lin- early for every translation request. To accelerate subsequent lookups, the list is sorted using least-recently-used by moving the previously requested range to the front of the list. If, during a translation request, no matching address range is found, or if the access flags do not match the requested operation type, a cache miss is reported. 7.2. Modelling Approach 93

The cache is filled by polling on the TLM DMI communication interface. An initiator socket may query a DMI pointer in such a way at any time, but will thereby also incur significant performance penalties. To avoid continuous polling when the DMI cache is empty, the augmented TLM initiator socket instead waits for a DMI hint. Such a hint may be signalled by a target after the completion of a blocking transport call, indicating to the initiator that it may instead query a DMI pointer for similar subsequent requests. Once a hint has been received, the cache polls the target and adds the corresponding range, pointer and access permissions to its list. DMI pointer invalidation requests clear the access permissions of all ranges that overlap with the specified range instead of removing the entries entirely. This way, all affected regions will no longer be considered during future lookups due to insufficient access flags, but need not immediately be removed, which avoids costly heap mem- ory operations. Furthermore, ranges are usually only invalidated for a short while, making reacquisition possible by merely updating their flags, which is generally more efficient than regular polling operations. Figure 7.1 shows an example snapshot of a DMI cache model at runtime. The first two entries contain 4 KiB ranges corresponding to data memory pages least recently used by the controlling ISS. Both regions are contiguous in target address space as well as host memory, indicating they are serviced by the same memory model. Entries two and three are read-only pages located at target address 0. Most embedded systems place a ROM component at this location, holding platform specific boot code. Those entries hold the DMI pointers to this code and are still present in the cache. Unlike a HW cache, this model does not have a maximum capacity and allows older entries to remain active, even if simulation has moved on. Finally, it can be seen that the fourth entry has been invalidated some time earlier, either by the memory or an interconnect component, and is waiting for reacquisition or reuse for a new DMI range.

7.2.2 Memory-based Model

The memory-based model assumes that all memory operations of the processors, such as fetch, load and store, are directly done by the ISS using a DMI pointer. For exclusive memory access, the ISS will instead call user-defined functions (upcalls) to perform those operations. Due to constraint (2), the model has to consider that store operations from other processors may happen concurrently and may invalidate an ongoing LL/SC operation without prior notification. Under these conditions, a pure software solution for correct memory access syn- chronisation appears infeasible and hardware support is required. As previously mentioned, most SYSTEMC simulations are currently build for and run on the x86 ar- chitecture, which does not natively support LL/SC operations. Instead, it offers the atomic CAS operation, which stores a data word to a given address if the data stored the address matches a provided comparison value. Previous work has shown that LL/SC and CAS are equivalent in that either synchronisation primitive can be used to model the other [4]. 94 Chapter 7. Exclusive Memory Access Models

Algorithm 7.1: Thread-safe LL/SC implementation using CAS Data: memptr retrieved using DMI cache exaddr linked address (per CPU) exdata originally linked value (per CPU)

1 Function LOADLINKED(addr, coreid) 2 exaddrcoreid ← addr; 3 exdatacoreid ← memptr[addr]; 4 return exdatacoreid ;

5 Function STORECONDITIONAL(addr, data, coreid) 6 if addr 6= exaddrcoreid then 7 return false; 8 end 9 return CAS(memptr + addr, exdatacoreid , data);

Algorithm 7.1 exploits this equivalence by modelling LL/SC using CAS under the assumption that processors access memory using a cached DMI pointer, denoted as memptr. For each processor core, two variables exaddr and exdata are defined, which are used to hold information about the operation. Once processor core i performs an LL operation, the corresponding upcall stores the address accessed and the value retrieved in exaddri and exdatai, respectively. Both values are then used during the upcall of a subsequent SC: first it is checked, whether the address to be stored to matches the one that was previously linked in exaddri. If this is not the case, the oper- ation is aborted. Next, the algorithm attempts to store the data using a CAS operation, which executes atomically on x86 and cannot be interrupted by concurrently running cores. The operation only succeeds if the target memory still contains the same value as stored in exdatai , which was read during the previous LL upcall. In that case, memptr[addr] receives the new value and the old one is reported back. Otherwise, the upcall returns the new value to the ISS, which in turn must update its internal state to reflect the failed SC operation. This implementation follows the one proposed by Anderson et al. [4], but omits tagging of linked data. This makes it susceptible to the ABA problem, potentially triggering false positives and performing an SC operation that should have failed. Tags can be used to uniquely identify when the value at a memory address was last modified, e.g. by using timestamps or write counters. When combined with linked address and data, unique tags enable construction of SC operations that detect ABA access sequences. However, in the context of VP, tags introduce a set of new problems. First, adding tags would require modification of regular store operations, since each store must update the tag to a new unique value. However, due to constraint (1), modification of the ISS might not be possible. Second, since there is no limitation of memory locations that may be used for LL/SC, tags would need to be stored for each addressable memory unit, i.e., usually for each byte of memory. Even if tags were only created on demand, this would result in a drastic increase of memory consumption, 7.2. Modelling Approach 95 potentially exceeding available host memory for large VPs. Consequently, tags have been omitted from the memory-based LL/SC model. Susceptibility to the ABA problem does not render the memory-based model unfit for use in realistic VPs. In fact, an efficient avoidance strategy for the ABA problem only requires the values stored to a linked address to be unique. This is the case for synchronisation primitives used by most embedded SW, such as the Linux kernel and many implementations of the C standard library, e.g., GLIBC [63], Newlib [197] and the MUSL C library [48] for various embedded architectures. Additionally, most ticket-based spinlock implementations operate on an continuously increasing counter, which also guarantees that only unique values are written using LL/SC. Section 7.3 further supports this claim by presenting a case study that boots the Linux kernel for the OpenRISC architecture using the memory-based LL/SC model. However, in case SW is used that requires architecturally correct LL/SC semantics, a VP should prefer the transaction-based model, which is described in the following section.

7.2.3 Transaction-based Model The transaction-based model assumes that the ISS always uses upcalls for handling all types of memory access operations. Within an upcall, a TLM generic payload object is created and forwarded using the TLM blocking transport communication interface. For LL/SC operations, the augmented TLM initiator socket automatically extends this payload with information about the operation type (LL or SC) and the ID of the initiating processor. Once an augmented TLM target socket receives such an extended transaction, it uses a monitor to synchronise exclusive memory access. Such a monitor needs to be added to each bus or interconnect in the simulator and must be shared among all its target sockets. For closed source models according to constraint (1), this can be achieved by placing the monitor in a SYSTEMC wrapper, which is typically created to accommodate the TLM sockets of the interconnect. An exclusive memory access monitor maintains a list of all linked memory loca- tions (links) including the linking processor ID. Due to constraint (2), all operations on this list must be protected from concurrent access, e.g., by using spinlocks. Addi- tionally, each monitor provides the following operations: • mark adds the address and size of an ongoing transaction and active processor ID to links whenever a LL request is received. In the default configuration, this operation also invalidates any previous link of that processor. However, the monitor can optionally be configured to support multiple links per processor. • check looks up the transaction address and active processor ID in links. All en- tries that overlap with the address are removed. Returns true if an entry with matching address and processor ID has been found, otherwise false. This oper- ation is called whenever a store operation (regular or conditional) is received. The augmented TLM target socket uses the return value of check to determine whether an SC operation succeeded and should be forwarded to memory or whether its failure should be reported back to the ISS. 96 Chapter 7. Exclusive Memory Access Models

(a) LL Interconnect 1 Interconnect 2 CPU0 Monitor CPU0 Monitor

(b) mark MEM MEM links (d) LL links CPU1 0: 0x804cf200 CPU1 0: 0x804cf200 (c) forward 1: 0x804cf200 (f) forward (e) mark

3 Interconnect (j) SC Interconnect 4 CPU0 Monitor CPU0 Monitor

(h) check MEM (k) check MEM (g) SC links links CPU1 0: 0x804cf200 CPU1 1: 0x804cf200 (i) forward (l) abort

Figure 7.2: Transaction-based model using interconnect monitors

Figure 7.2 exemplifies this process using a scenario where two processors concur- rently access memory address 0x804cf 200 using exclusive memory access operations. First, CPU0 issues an LL operation (a), which causes the address to be added to the links of an interconnect monitor (b). The request is then forwarded regularly to the memory component (c) that subsequently returns the contents of the memory ad- dress. Next, CPU1 also issues an identical LL request (d), which adds a second entry to links, pointing to the second processor (e), before regularly forwarding the request again (f). Subsequently, CPU1 executes an SC operation (g) on the same memory address it has previously linked (d). The monitor detects this and removes all entries pointing to that address from links (h), before forwarding the transaction to the mem- ory component (i). When CPU0 issues its SC operation (j), the monitor detects that this address is no longer linked (k) and therefore aborts the operation (l). The main advantage of the transaction-based LL/SC model is its high modelling detail. It is not susceptible to the ABA problem, since it explicitly monitors every write access. This allows tracing the access history for every linked memory address, causing a link to break immediately upon the first write operation. Furthermore, the model can be configured to reflect architecture specific intricacies, such as:

• Nested LL/SC: each monitor usually holds only one link per processor. When- ever a new link is acquired, the previous one is overwritten. However, certain architectures may allow nested LL/SC operations, which can be reflected by appending new links instead of overwriting old ones.

• Cascading LL/SC: monitors must be placed in every interconnect in the system. An SC operation is only successful if it passes through the monitors of all inter- connects on its way to memory. This results in increased timing accuracy and extends its applicability beyond regular HMP and SMP designs. 7.2. Modelling Approach 97

• Cache-line accuracy: some architectures may cause links to break already if an- other processor accesses data on the same cache-line, but not necessarily at the same address. This behaviour can optionally be reflected by checking for match- ing cache-line address instead of actual address during the check operation.

• LL/SC for I/O registers: monitors are agnostic towards the target component behind a specific address. Consequently it is possible to perform LL/SC op- erations not only on memory, but also on I/O registers of shared peripheral components, where DMI pointers are naturally unavailable.

A drawback of the transaction-based model is the serialisation of write and LL/SC operations of all processors, which must go through the spinlock-protected monitor. Although the most common memory access operations of a processor, i.e., fetch and load, are not affected by this, access to the monitor might still prove a performance bottleneck. However, reduced performance is a typical trade-off for higher simulation accuracy and must therefore be expected when using the transaction-based model.

7.2.4 Mixed Operation It is the goal of the mixed operation mode to allow DMI and the transaction-based exclusive memory access model to be used jointly in order to benefit from the advan- tages of both approaches: high simulation speed without ABA susceptibility. The pre- ferred method of handling memory accesses in high performance VPs is DMI. How- ever, all DMI accesses bypass exclusivity monitors, which causes links that should have been broken by an overlapping store operation to remain active. Using DMI in conjunction with the transaction-based model must therefore be expected to produce stale links and break LL/SC functional correctness, unless precautions are taken. To work around this issue, the mixed operation mode states two requirements. The first requirement demands that all memory write accesses must go through the DMI caches and must not be performed by the ISSs internally. This allows keeping track of what addresses the processors write to, while still retaining the ability to use DMI later on to perform the requested operation. Interfacing with its DMI cache requires the ISS to perform an upcall, which is generally slower than direct memory accesses performed internally by the DBT engine. However, since the upcall does not need to go through the TLM blocking transport interface and instead may use DMI, a significant performance increase compared to the transaction-based approach can be expected. Furthermore, the upcall must only be performed for write accesses, allowing the ISS to accelerate the more common load and fetch operations. The second requirement for the mixed operation mode is that an extra exclusivity monitor must be assigned to each memory component. Without violating constraint (1), this can be achieved by placing it inside its SYSTEMC wrapper along with a TLM target socket. Whenever DMI access is granted to the associated memory, the socket also provides a reference to its monitor back to the initiator. A TLM transaction extension is used for this purpose. The initiator may freely use the DMI pointer for read accesses, but must notify the monitor of a write access using check beforehand. 98 Chapter 7. Exclusive Memory Access Models

ISS0 ISS1 ISS2 ISS3

(1) (4) DMI DMI DMI DMI Cache Cache Cache Cache (2) (5) M1 Core Bus

(3) Bridge I/O I/O I/O

(7) M2 Peripheral Bus (6)

MEM I/O I/O I/O

M3

Figure 7.3: LL/SC monitor placement for mixed operation with DMI

All exclusive access operations are realised using transactions and must therefore pass through all monitors on their way from the initiating processor over any number of interconnects to memory. During that process, LL operations create new links in all monitors that they pass through, including the extra one within the memory component. Similarly, a SC operation must also pass through all monitors and is aborted as soon as any monitor reports a broken link. Interconnect monitors report on links broken by regular transactions, while the memory monitor reports on those broken by DMI accesses from within an ISS upcall. Consider the example presented in Figure 7.3. It shows a typical SMP based multi- core VP with four ISSs, connected to a high-speed core bus via their DMI caches. A bridge connects them to a peripheral bus, which provides access to main memory, as well as to several I/O components. The design features three exclusivity moni- tors: M1 and M2 observe transactions passing through the core and peripheral buses, respectively. M3 represents the extra memory monitor as required for mixed opera- tion. It is assumed that all DMI caches have already acquired DMI access to memory and therefore also hold a reference to M3. Furthermore, it is assumed that ISS0 has performed an LL operation and links have been added to M1, M2 and M3 accordingly. ISS0 issues an SC operation by performing an upcall to its DMI cache (1). Since such an operation must be performed using a transaction, the cache uses the TLM blocking transport interface to forward it to the core bus (2). Assuming no other SC transaction has intervened, the link in M1 is still active and allows the request to proceed to M2 (3). At that stage the transaction is delayed due to the reduced speed of the peripheral bus compared to the core bus. Simultaneously, ISS3 issues a write operation to the linked address and performs an upcall to its DMI cache (4). According to protocol, the cache first invalidates the link in M3 before accessing the 7.3. Experimental Results 99

Variant Fetch Load Store LL SC

ORVP/DMI DMI DMI DMI DMI DMI ORVP/BT DMI BT BT BT BT ORVP/Mix DMI DMI DMI BT BT

Table 7.2: ORVP memory access configurations address via a DMI pointer (5). Subsequently, the SC transaction of ISS0 is allowed to proceed on the peripheral bus, invalidating the link in M2 in the process (6). However, upon reaching the memory, no valid link is present in M3 and, consequently, the operation is aborted. Finally, ISS0 is informed about its failed SC attempt (7). As it can be seen from the example in Figure 7.3, the mixed operation mode allows the application of the more accurate transaction-based LL/SC model within a VP that makes heavy use of DMI. The model does not restrict the use of DMI pointers for reading purposes, allowing the ISSs to operate at optimal performance when fetching instructions and loading data. For write operations, upcalls are used to notify the rest of the system about potentially broken links. The use of monitors and transactions for LL/SC operations guarantees immunity from ABA related synchronisation errors.

7.3 Experimental Results

The proposed LL/SC model for exclusive memory access enables construction of a parallel variant of the OpenRISC Virtual Platform (ORVP) based on the SCOPE kernel. For the experimentation, the number of OpenRISC ISSs has been fixed to four, which are evenly distributed among all available threads using the automatic load balancer of SCOPE. The lookahead has been set to match the TLM quantum, which comprises ∆ ∆ −1 500 processor clock cycles, i.e., tq = tla = 500 fcpu = 5 µs. The VP can be configured to switch between the memory-based, transaction-based and mixed operation model depending on the type of access. Three variants of the ORVP have been created, which differ in the way they perform exclusive and nonex- clusive memory accesses. Table 7.2 summarises the configuration chosen for the re- spective variant. All use DMI for instruction fetching directly from the ISS without an upcall detour. Intricacies of individual variants are outlined in the following:

• ORVP/DMI: this variant exclusively uses DMI for all kinds of memory access operations. Internally, the ISSs use CAS to model LL/SC.

• ORVP/BT: with the exception of instruction fetching, this variant uses the TLM blocking transport interface for exclusive and nonexclusive memory access.

• ORVP/Mix: this variant employs the mixed operation LL/SC model. Exclusive operations use transactions, regular operations access memory using DMI. 100 Chapter 7. Exclusive Memory Access Models

Two sets of experiments are presented. First, Section 7.3.1 investigates the sim- ulation overhead imposed on the VP by the various LL/SC model variants, before Section 7.3.2 subsequently presents potential performance improvements of parallel simulation with SCOPE over regular OSCI SYSTEMC. Each experiment was repeated a fixed number of times and only averages are reported here. Detailed information about the runtime of single experiment iterations, repetition count and simulation host can be found in Appendix C.

7.3.1 Simulation Overhead

The first set of experiments studies the performance impact of the LL/SC model used for constructing the different ORVP variants. To study the impact, the booting process of the Linux kernel for the OpenRISC architecture has been chosen. This application scenario is often encountered in ESL scenarios, given the popularity of the Linux kernel in the embedded space. Furthermore, work on the boot process is often performed at the beginning of a HW/SW Codesign based project, e.g., for device driver development, and is therefore considered a prime use-case for a VP. The boot process considered by this evaluation begins at device power on, i.e., tsim = 0, and ends when an interactive command line prompt is presented via the system UART. Via experimentation, this has been found out to happen at tsim = 2 s for OpenRISC. During that time, the system performs initial setup of kernel data structures, and prepares MMUs, TLBs and caches for multi-core operation. Finally, the kernel unpacks the initial ramdisk into memory, which takes up the majority of time in the boot process. Before the command prompt is shown, the userspace init process launches a few background daemons, such as syslogd, inetd and dhcpd. Booting the Linux kernel in an SMP configuration can be seen as a stress test for its synchronisation primitives and thereby for the exclusive memory access model em- ployed by a VP. This is because synchronised access must be provided in an efficient way to heavily competed shared data structures, such as IRQ scheduling and arbi- tration facilities, the virtual memory page tables and the file system. Unfortunately, userspace synchronisation primitives as provided by the C standard library cannot be tested during Linux boot, since none of the init programs make use of them. Con- sequently, testing of functional correctness of the proposed LL/SC model for regular pthreads spinlocks must be postponed to the next section. The Linux boot process has been simulated using all three variants of the ORVP. The results for the runtime is presented in Figure 7.4a, while Figure 7.4b shows in- formation about the total number of LL/SC operations performed during the run, separated by operation type. The first observation is that all models are suitable for operation in a realistic environment by faultlessly completing the run. Despite of the ABA susceptibility of the ORVP/DMI variant, no errors were detected and the boot process completed successfully in every test iteration. A deeper investigation into the kernel image disassembly shows that LL/SC instructions are generally used for im- plementing spinlocks and atomic increment operations. Both kinds produce unique values by exclusively incrementing the values at linked addresses and thereby effec- 7.3. Experimental Results 101

DMI 26.3 DMI 6804747 LL 6804504 SC

Mix 62.2 Mix 6809010 LL 6808615 SC

BT 199.4 BT 6814075 LL 6813799 SC

0 60 120 180 0M 2M 4M 6M 8M 10M 12M 14M runtime (s) number of operations

(a) Runtime (b) Number of LL/SC operations

Figure 7.4: ORVP runtime and LL/SC operations during Linux boot

tively avoid ABA store chains. While this is essentially in line with the suggested avoidance strategy by Dechev et al. [42], handcrafted algorithms are still able to ex- ploit this susceptibility. For example, executing the program outlined in Listing 7.2 exposes the inaccuracy of the DMI-based model. Investigating the runtimes shown in Figure 7.4a yields no surprises. The short- est simulation runtime of 26.3 s is achieved using the DMI-based model. The more accurate, transaction-based approach simulates approximately 7.6× longer, requir- ing 199.4s for simulating a single boot cycle. In this case, all memory operations are realised as transactions and must pass through the system bus and its arbitra- tion facilities, consuming multiple delta cycles in the process. These extra cycles in combination with the IMC and upcall overhead are the main reasons for the longer runtime of the ORVP/BT variant. The ORVP/Mix variant offers an interesting tradeoff in this context. It is not susceptible to ABA store chains and operates 3.2× faster than the transaction-based variant, since it can perform all nonexclusive memory accesses via DMI. Compared to the ORVP/DMI variant, the mixed operation mode runs 2.4× slower and finishes the boot simulation in 62.2 s. Figure 7.4b presents the total number of LL and SC operations performed by each variant during the Linux boot simulation. The data has been extracted by accumu- lating the number of exclusive memory operations of each of the four processors as reported by the corresponding ISS. On average, 1.7% of all instructions executed by each ISS are exclusive memory access operations. This number differs among the three VP variants due to changes in simulation timing. For example, in the ORVP/BT variant, all memory communication must go through bus arbitration. SC requests stuck in the bus as a result of congestion are more likely to encounter broken links due to the longer delay since the original LL. This results in a repetition of the origi- nal exclusive memory access request, hence the higher number. The DMI variant does not encounter this issue, since most combinations of LL and SC instructions execute atomically within a single quantum. Finally it can be seen that the number of LL in- structions exceeds the number of SC instructions by a small margin, which is unusual, 102 Chapter 7. Exclusive Memory Access Models

Benchmark Problem Size

boot Kernel boot from reset to login prompt fibonacci Iterative calculation of f ib(10000) mandelbrot Parallel calculation of the mandelbrot set dhrystone One million iterations of the dhrystone [200] benchmark dhrystone-4x Four concurrent instances of dhrystone coremark coremark [45] benchmark using pthreads barnes Implementation of the Barnes-Hut hierarchical N-body method fmm Body interaction using Fast Multipole Method, 16384 particles ocean-c/nc Ocean current simulator, 256 × 256 grid, four threads radiosity Computation of light distribution, default scene raytrace Raytracing using teapot mesh volrend Volumetric rendering using head-scaleddown4 mesh water-nsquared Current and force simulation of 512 water molecules water-spatial Same as above, but uses spatial separation algorithm cholesky Cholesky factorisation on sparse matrix; input file tk29.O fft Fast Fourier Transform, 215 − 1 complex data points lu-n/nc LU matrix factorisation, 512 × 512 matrix size radix Integer radix sort with 218 keys and a radix of 1024

Table 7.3: Benchmark description and problem size

since both operations are typically used in pairs. This discrepancy can be explained when looking at the scheduler subsystem of the Linux kernel. It makes frequent use of CAS and TAS operations, which have been implemented using LL/SC. However, the SC instruction is only executed when the value read by the LL instruction matches a comparison value, as required by CAS and TAS semantics. Since that is not always the case, the number of SC operations can be lower than that of LL operations.

7.3.2 Parallel Performance

With the functional correctness of the proposed LL/SC model ascertained, it is possi- ble to analyse the potential performance gains due to parallel simulation with SCOPE. For this reason, the fastest variant, i.e., ORVP/DMI, has been selected to compete against industry standard sequential simulation. A new variant called ORVP/OSCI has been created for this purpose. It also uses the memory-based LL/SC model and serves as a baseline for speedup calculation. As in the previous experimentation, both ORVP variants have been configured to operate in a quad-core SMP configuration. 7.3. Experimental Results 103

4x

3x

2x speedup 1x

0x fft lu-c boot fmm lu-nc radix barnes ocean-c volrend raytrace cholesky ocean-nc radiosity fibonacci coremark dhrystone mandelbrot water-spatial dhrystone-4x water-nsquared

Figure 7.5: Parallel performance speedup using four threads

2.0x

1.5x

1.0x speedup 0.5x

0.0x fft lu-c boot fmm lu-nc radix barnes ocean-c volrend raytrace cholesky ocean-nc radiosity fibonacci coremark dhrystone mandelbrot water-spatial dhrystone-4x water-nsquared

Figure 7.6: Parallel performance speedup using two threads

In total, 20 application scenarios are considered for benchmarking. The Linux boot scenario is accompanied by other industry standard benchmarks for processor perfor- mance evaluation, such as dhrystone [200] and coremark [45]. Furthermore, some well known algorithms from general computer science, such as fibonacci and mandelbrot, are added to the mix. Finally, the set is completed with nine applications and five kernels taken from SPLASH2 benchmarking suite [217]. With the exception of the boot scenario, all applications have been built using the musl C library [48] and its corresponding pthreads backend. Table 7.3 gives a short description of each scenario, including the problem size for applications that support different input data sets. 104 Chapter 7. Exclusive Memory Access Models

Wall-clock execution time measurement presents itself as a challenge, since the applications must be launched from an operating Linux environment within a run- ning VP simulation. Consequently, just timing the runtime of the simulator, like it is done for the boot scenario, is insufficient. Instead, Wall-clock runtime is measured using a modified version of the Unix time command, called measure. It reports dura- tion between program start and exit, but uses host time instead of simulation time. To that extent, host timestamps are retrieved via a specialised semihosting instruction that has been added to the OpenRISC ISSs at an unused opcode. Using measure, the benchmarking routine is as follows: first, a fresh instance of ORVP is started and a login is performed from the host machine to the VP via telnet. Next, the application under test is loaded from SD card via SPI and executed once in order to warm up the filesystem cache. Finally, the actual benchmarking is performed using the measure tool and is repeated multiple times. Results are reported as averages, but detailed runtimes for individual iterations can be found in Appendix C. Benchmarking results for SCOPE using four threads are presented in Figure 7.5, where each OpenRISC ISS has been mapped to one thread using static load balanc- ing. Here it can be seen that the highest speedup of 3.27× is achieved in the ocean-nc scenario, which corresponds to a parallel efficiency of 0.82. The barnes and lu-c sce- narios both experience the lowest speedup of 2.77×. On average, simulation of all benchmark could be accelerated by 3.1× using SCOPE and the proposed model for exclusive memory access. The variance in achievable speedup is to be expected given the nondeterministic nature of the applications and SCOPE operating in fast mode. A lower deviation can be observed when only two threads are used as shown in Fig- ure 7.6. In this case, the two threads handle simulation of two ISSs each and achieve an overall lower synchronisation overhead. This also reflects in the best case speedup of 1.85× for the ocean-nc and mandelbrot scenarios, which therefore operate at a paral- lel efficiency of 0.93. Finally, the lowest speedup for the ORVP/DMI variant using two threads is encountered in the volrend scenario and amounts to 1.58×. No synchronisation errors were encountered during experimentation that could indicate a violation of LL/SC semantics due to ABA susceptibility. A deeper inves- tigation shows that the musl C library uses CAS to atomically switch the state of a spinlock. For the OpenRISC port, this operation is implemented using LL/SC, which can cause ABA store chains during lock acquisition and release. Consider the case where a thread attempts to acquire the spinlock and performs an LL operation to retrieve the current state of the lock. Before this thread can switch the spinlock into a locked state, it gets interrupted by another one, also wishing to acquire the lock. As long as the second thread acquires and releases the lock before the first thread resumes operation, the SC of the first thread will succeed, although it should have failed due to a broken link. Fortunately, this anomaly does not pose a problem for actual software: while LL/SC semantics are violated by conducting an SC on a broken link, spinlock semantics remain intact. Since only one thread can ever hold the lock at the same time, shared data access remains synchronised. 7.4. Limitations and Outlook 105

7.4 Limitations and Outlook

Based on the results of the experimental evaluation, a set of limitations of the pro- posed exclusive memory access model can be identified. Fundamentally, they can be separated into conceptual and implementatory issues. The latter kind is concerned with the ease of integration and applicability in a realistic VP under the assumption that closed source components are used and that interfaces are immutable. Concep- tual limitations address the fundamental problem of modelling LL/SC on CAS/TAS based computer architectures without loss of accuracy. At its centre lies the ABA prob- lem. While related work has presented multiple approaches regarding detection and prevention of ABA store chains, none has so far been shown to be efficiently applica- ble in realistic VPs. These conventional approaches often require excessive amounts of memory or rigorous synchronisation, which defeat the point of a parallel simula- tion environment in the first place. A potential solution could be the application of transactional memory. By bundling LL/SC instruction pairs with their enclosed ALU operations into an atomically executing transaction, any interference from concurrent access operations is efficiently prevented without violating LL/SC semantics. How- ever, more research is required regarding the application of transactional memory techniques within high-performance DBT-based ISSs. Aside from conceptual issues, the proposed design exposes certain implementa- tory limitations. These are discussed in the following:

• Interfacing with DBT based ISSs. The model assumes that it is possible to inter- cept exclusive memory access operations using upcalls. However, closed source multi-core ISSs might not offer this feature and instead orchestrate LL/SC syn- chronisation internally. This also renders them unfit for use within a VP mod- elling an HMP system, where it would otherwise need to communicate with ISSs from other vendors. A workaround in this situation must take the ac- tual implementation of the ISS into account. Consequently, no general solution but only a guideline can be provided. Usually it involves withholding of DMI memory pointers and instead relying on the transaction-based communication, which must be provided anyway for accessing the register interface of I/O com- ponents. Exclusive access can be detected in this case via investigation of the instruction register of the processor model and decoding the current opcode.

• Optimising performance of LL/SC in mixed mode. Performance of the LL/SC model in mixed operation mode suffers a slowdown of 2.4× compared to the memory-based model. The reason for this is the serialisation of all store oper- ations, which need to pass through the spinlock-protected exclusivity monitor. Consequently, the single monitor for the entire memory becomes a bottleneck. Alternative implementations may instead choose to provide monitors per page of virtual memory, thereby only serialising store operations on the same page. The drawback of this approach is that a relatively large amount of memory re- gions must be maintained by the DMI cache, which potentially results in longer lookup times. More research is required in order to evaluate this trade-off. 106 Chapter 7. Exclusive Memory Access Models

• Potential Race Conditions during DMI IMCs. DMI request and invalidation requests are performed by different SYSTEMC processes and therefore may cross between virtual sequential environments, potentially introducing race condi- tions. Cache poll requests are stated from the same process as the associated ISS. However, invalidations usually originate from memory components or intercon- nects, which may be simulated on a different thread. Since the SYSTEMC standard demands that DMI IMCs execute uninterruptedly, providing a safe transition between both virtual sequential environments proves to be challenging, but a potential solution is presented in Chapter 9. For ORVP, no special action needed to be taken, since the DMI cache is already provided in a thread-safe version.

7.5 Synopsis

When proposing a parallel SYSTEMC engine, such as SCOPE, for accelerating VP simu- lation performance, one must also solve the problem of modelling exclusive memory access when atomic execution of model code is no longer guaranteed. Conventional approaches place exclusive memory access synchronisation facilities inside the ISS, thereby enforcing sequential simulation and eliminating interoperability with ISSs from other vendors. As an alternative approach, this chapter has proposed a model that moves these synchronisation facilities out of the ISSs into the system level, thereby retaining the potential for parallel simulation. Furthermore, the model supports dif- ferent operation modes to adapt to the preferred TLM communication interface and modelling accuracy of a given VP. The exclusive memory access model represents the last key technology required for parallelization of realistic VP. With its help it became possible to construct a paral- lel version of the multi-core ORVP, leveraging the SCOPE simulation kernel and achiev- ing peak performance gains of 3.27× when using four threads. Combining results from different application domains, such as processor benchmarking and scientific computing, an average speedup of 3.1× has been achieved. The relevant industry benchmarks Linux boot and coremark have been accelerated by 3.16× and 2.93×, re- spectively. Beyond being the key enabler for parallel simulation of ORVP, the model offers various tunable parameters to reflect architecture specific intricacies, such as nesting and cascading of exclusive memory access operations. Chapter 8

Processor Sleep Models

So far, research in fast SYSTEMC simulation techniques has identified two fundamental routes that one can take to achieve a performance boost: (1) simulate parallel or (2) simulate less. While parallel simulation has so far proven to be an effective way to accelerate multi-core VPs, conventional sequential VPs have generally taken the second route. It refers to abstraction, where unneeded modelling detail is omitted to reduce workload and thereby increase simulation speed. A popular example of such an approach is TLM. In TLM, bus communication is abstracted using transactions and IMCs instead of modelling wires and individual signals. Since clock cycle accuracy is usually not required for every load or store operation, this detail may be omitted in favour of a faster running simulator, which directly translates to increased SW debugging productivity, a prime use case for VPs. In this context, simulation of processors, which are idle or sleeping, presents an- other opportunity for a performance boost via omission. Modern embedded proces- sors generally offer operation states in which they run at a lower clock rate or even halt instruction execution entirely, but also consume significantly less power. Most embedded OSs are optimised to take advantage of this and output hints to switch processors into sleep mode whenever they are not needed in order to conserve en- ergy and thereby extend battery lifetime. In the context of VPs, it makes therefore sense to have a processor sleep model embedded in an ISS and trace these sleep hints to be able to skip simulation whenever the corresponding processor become inactive.1 A combination of such a sleep model and an ISS running a modern OS, such as Linux, can therefore be expected to greatly boost VP simulation performance in sequential and low load scenarios, where the additional processor cores are not required. One example for this is the Linux boot scenario as introduced in Chapter 7, where SCOPE achieved a parallel speedup of 3.2×. However, it is well understood that the booting procedure of the Linux kernel is a mostly sequential process, where one core bootstraps and initialises the system while secondary cores remain idle, waiting for work from user processes. Consequently, the gains due to parallel simulation must come from offloading simulation of idle processors to other threads. While idle, these processors usually only execute NOP loops and perform no meaningful work. It is not unreasonable to expect similar performance gains in sequential VPs if the simulation of those idle secondary cores is omitted using a processor sleep model [203]. Therefore, it is the goal of this chapter to propose and evaluate such a processor sleep model in the context of a realistic VP as described in Chapter 4. This chapter begins with a short overview of sleep states in contemporary embedded architectures

1 Literature sometimes also refers to processor sleep models as hypersimulation [47, 113]

107 108 Chapter 8. Processor Sleep Models

Architecture Instruction Description

ARMv6 mcr Cache operations register ARMv7/8 wfi/wfe Wait for Interrupt/Event MIPS32 HALT Halt execution Power ISA yield Set low priority hint OpenRISC l.mtspr Power state register RISC-V WFI Wait for Interrupt

Table 8.1: Sleep signal instructions in popular embedded architectures in Section 8.1 and how they are utilised by Linux. Since sleep modes are tightly coupled to an ISA, Section 8.2 first introduces an in-house developed OpenRISC ISS based on cached compilation, before Section 8.3 outlines the modelling approach. An experimental evaluation is performed in Section 8.4, putting the processor sleep model into context with parallel simulation to assess benefits and compatibility. Next, Section 8.5 discusses its limitations and employs them as a foundation for future work proposals. Finally, this chapter is concluded with a short summary in Section 8.6.

8.1 Processor Sleep States

With the advent of embedded processor architectures driven by early designs from and ARM, low power and energy consumption of a processor be- came a primary design goal. General advancements in computer architecture and manufacturing allowed early embedded processors to operate at significantly lower power and energy compared to their HPC counterparts. However, modern multi-core designs must provide comparable computational power and therefore struggle pro- viding the same energy efficiency as in the early days. To further optimise system energy consumption it is therefore imperative to disable unneeded cores during low or medium load scenarios. A processor on its own is incapable of identifying these situations and instead relies on the OS to provide it with sleep hints. The actual implementation of these hints differs among all embedded architec- tures. For example, in ARMv6, the OS may provide a sleep hint to the processor by writing into the cache operations register c7. Similarly, the OpenRISC architecture offers a dedicated special purpose register for the same reason. Later designs from ARM, IBM and RISC-V offer dedicated instructions, such as Wait-For-Interrupt (WFI), to signal to the CPU to enter a sleep state. Table 8.1 gives an overview of sleep hints for various popular embedded architectures. Deciding when to give a processor a sleep hint requires careful elaboration, since maximum performance cannot be restored instantaneously as a processor usually requires extra time to resume operation after sleep. The cpuidle subsystem of the 8.2. The OpenRISC Instruction Set Simulator 109

Linux kernel therefore contains elaborate heuristics to handle this task. Whenever no tasks are ready for execution and no events are expected to occur in the near future, such as timer events or predictable I/O interrupts, the task scheduler executes the idle process, which repeatedly sends sleep hints to the processor. It originates from the bootstrap process (PID 0), which performs initial setup and forks the actual init process (PID 1) before jumping into the idle loop as shown in Listing 8.1.2 void cpu_startup_entry() { /* Initialisation code omitted */ while (1) arch_cpu_idle(); }

Listing 8.1: Simplified Linux idle process The call to arch_cpu_idle allows architecture specific sleep hints to be sent. Each ar- chitecture therefore implements its own version according to Table 8.1. Consequently, this must also be the point for a processor sleep model to intercept the hint. Given the architecture specific nature of a sleep model, no general approach for designing one only based on SYSTEMC can be given. Instead, this chapter studies various approaches for the design of a sleep model for the OpenRISC architecture with the hope to be useful as a template for other architectures as well. Due to this tight interdependence, the following section first introduces the OpenRISC architecture and its ISS, or1kiss.

8.2 The OpenRISC Instruction Set Simulator

The OpenRISC architecture [184] specifies a family of processors featuring a RISC- like load/store instruction set with 32 general purpose registers, 32bit instruction and 32/64bit data bus. Processors from this family are designed for high performance networking and embedded scenarios. OpenRISC specifies a modular instruction set in order to provide a trade off between design complexity and performance for various use cases. In the context of this work, only the basic 32bit instructions (load/store, arithmetic, logical, etc.), and the 32/64bit floating point instructions are considered. Beyond that, OpenRISC provides all facilities common to modern processors, such as a privileged execution mode, interrupts and a virtual memory system. Multi- core support is provided via a combination of (1) dedicated instructions for exclusive memory access using LL/SC, (2) core id and system identification registers, and (3) inter-processor communication and signalling facilities. Figure 8.1 depicts the main components of OpenRISC that have been modelled in the or1kiss ISS. These denote the minimum requirement for supporting a modern operating system, such as Linux, which, in turn, is a requirement for efficient utilisa- tion of processor sleep models. A short overview about the model capabilities of each component is therefore given in the following:

2 http://elixir.free-electrons.com/linux/v4.3/source/kernel/sched/idle.c#L281 110 Chapter 8. Processor Sleep Models

INSN MMU INSN Cache Memory Upcall

Interpreter Tick Timer

Decode Fetch PIC/MPIC Cache Decode PMU Execute Debug Unit

DATA MMU DATA Cache Memory Upcall

Figure 8.1: Main Components modelled in the OpenRISC ISS

• INSN/DATA MMUs are required for supporting a virtual memory system. They support 8 KiB pages with read, write and execution protection. Fast virtual to physical address translation relies on a Translation Lookaside Buffer (TLB), which supports refill via HW or SW exceptions.

• INSN/DATA Caches provide a minimal interface to allow the OS identification of cache and cache line sizes. However, for the benefit of performance, neither functional nor timing simulation is performed. This is possible, because the are write-through and coherent, thereby effectively transparent to the user.

• Tick Timer is used to provide timing information to the system. It has two main purposes: first, it provides periodic interrupts at a configurable rate to be used for preemption by the scheduler. Second, it offers a high resolution timing reference, which is used as a clock source by Linux.

• PIC/MPIC. The Multi-Processor Interrupt Controller (MPIC) provides interrupt masking control and load balancing for multi-processor systems. This is espe- cially important for the sleep model, since interrupts resemble the only way to wake a processor up from its idle state.

• PMU. The Power Management Unit (PMU) provides facilities for dynamic power and frequency scaling. However, since Linux does not make use of them, cor- responding models have been omitted. The PMU also provides the special pur- pose register for signalling sleep hints.

• Debug Unit is used to enable bare metal debugging via JTAG. In the context of an ISS, this functionality has been replaced with support for debugging using rGDB [58, 177]. It is important for this unit to remain responsive, even if the processor is currently idle, so special precautions must be taken.

• Interpreter resembles the main facility to fetch and decode OpenRISC instruc- tions and execute them on the x86 host computer. It operates in an interpretative fashion, but additionally utilises a decode cache [138] to improve performance. Due to its overall importance, the interpreter is described in the following. 8.2. The OpenRISC Instruction Set Simulator 111

add r1, r2, 0xfb

Decoded Instruction Register File

addr 0x00000a00 r0 r4 insn 0x400100fb r1 r5 dest r2 r6 src1 r3 r7 src2 imm 0x00fb Handler handler *dest = *src1 + *src2

Store in Decode Cache

Figure 8.2: Decoded instruction for an addition operation with an immediate value

8.2.1 Cache Compiled Simulation

In a cache compiled ISS [138], results of the decode stage are represented as cacheable objects for efficient reuse. The cache holding these objects is denoted as decode cache and the objects themselves are referred to as decoded instructions. A decoded instruc- tion holds pointers to all operands, e.g., ISS registers or constant immediate values, as well as to a callback implementing the actual functionality. An example for a decoded instruction is presented in Figure 8.2. Once such an object has been produced by the decoder of the ISS, it is placed in the decode cache alongside its address in instruction memory for efficient lookup. This is especially beneficial when the ISS is executing the same set of instructions repeatedly, e.g., in a loop. Whenever the program counter of a processor advances to the next address, e.g. by natural increment or by jumping, the ISS first checks its decode cache whether the instruction located at that address has already been decoded. If this is the case, there is no need for fetching and decoding the instruction again and it is sufficient to just execute the decoded instruction handler again. In case of a decode cache miss, i.e., the instruction has not yet been executed or it has been overwritten due to a cache collision, the ISS falls back to a conventional interpreter, which fetches and decodes a new instruction object and inserts it into the decode cache. Due to the way the decode cache lookup works, special precautions have to be taken for all self-modifying code, such as virtual machines using DBT. Since during lookup the instruction is only compared using its address and not its actual value, it is possible to have stale instruction objects in the decode cache. Consequently, the de- code cache must be invalidated whenever new instructions are written into memory over old ones that have already been executed. This is achieved by observing instruc- tion cache flushes, which must be performed according to specification, whenever code is loaded or written to memory. 112 Chapter 8. Processor Sleep Models

TLM initiator socket IRQ interupt set/clear SystemC Port handler

memory SS upcall TLM BT I upcall

get_dmi_ptr wait 1us

SC_THREAD DMI memcpy step (driver) cache

step 100 cycles

Figure 8.3: SYSTEMC wrapper for one OpenRISC processor

8.2.2 SystemC Wrapper

To be able to use the OpenRISC ISS within a SYSTEMC environment, a SYSTEMC wrap- per is needed. It embeds the ISS into a module and provides a simulation process, called driver, to drive the execution of its fetch, decode and execute stages when- ever simulation time advances. To improve performance, the wrapper takes advan- tages of temporal decoupling and runs ahead of simulation time before synchronising again with SYSTEMC time. This reduces timing resolution from cycle granularity (e.g. 100 MHz ≡ 10 ns) down to quantum granularity (e.g. ∆tq = 5 µs), but yields signifi- cant performance increases by reducing the number of costly wait calls. Since it is a commonly used technique in TLM development, it is also adopted here. Figure 8.3 gives an overview of the whole SYSTEMC wrapper for the OpenRISC ISS. The wrapper also provides a memory upcall mechanism to allow interaction with SYSTEMC based I/O devices via the TLM blocking transport communication interface. Memory read and write operations are handled using TLM DMI, which is initialised via the memory upcall. DMI pointers are cached to prevent overhead from successive retrieval in case of incoherent memory regions with frequent invalidations. If a DMI pointer is available from the DMI cache, instruction fetches, loads and stores are realised using plain memcpy calls directly from within the ISS. Otherwise, requests are handled using transactions via memory upcalls. The use of DMI reduces timing accuracy of the wrapped model, since interconnect components are skipped over; however, at the same time it significantly increases simulation performance. Interrupts are modelled using a combination SYSTEMC ports and processes, which react to a change in the interrupt signals on each of the 32 ports supported by Open- RISC per core. Once the corresponding handler is invoked, it detects the new level of its associated interrupt line and then either reports a set or a cleared interrupt to the ISS. Interrupt masking is subsequently handled by the ISS internally. 8.3. Processor Sleep Models 113

8.3 Processor Sleep Models

Unlike ARM or x86, the OpenRISC architecture does not have a dedicated instruc- tion for entering a low power idle state. Instead, power management is handle via the PMU, which exposes a set of registers that control processor clock frequency, clock gating and a so-called doze mode. Once enabled via the PMU, the doze mode suspends all components of the core, except the internal tick timer and interrupt con- troller. The core only leaves doze mode and wakes up again, when a tick timer or external interrupt arrives. The OpenRISC architecture explicitly allows requests to enter doze mode to be ignored by the processor to reduce design complexity at the cost of higher energy consumption. In a virtual environment, such a doze mode might be considered un- necessary, since there is no energy to be saved. Consequently, many implementations currently omit it, e.g. or1ksim [18] and QEMU [16]. However, in a multi-core plat- form, such as ORVP, secondary cores are often put into an idle or low power state to conserve energy. Simulation of these idle cores degrades simulation speed, prolong- ing the time it takes for the working processors to finish their tasks. Thus, utilising doze-mode as a hint to skip simulation of idle cores can be expected to yield signifi- cant performance gains in frequently encountered low or medium load scenarios. In the following, two different approaches to sleep model design are proposed. The ISS-based approach places the model inside the ISS and therefore simplifies in- tegration. This approach is discussed in Section 8.3.1. Subsequently, Section 8.3.2 presents a more elaborate approach that attempt to increase simulation performance during idle phases even further by drawing on the underlying DES engine of SYSTEMC. Finally, Section 8.4.1 introduces a tracing mechanism for overall system CPU utilisa- tion, which is based on the previously proposed sleep models. When applied in combination simulation performance analysis, it allows assessment of the efficiency of the models.

8.3.1 ISS Sleep Model

The ISS sleep model takes advantage of OS sleep hints to reduce workload during idle simulation phases. It operates exclusively from inside the ISS, so that no changes to the embedding SYSTEMC wrapper are required. Keeping the interface between wrapper and ISS the same as shown in Figure 8.3 has the advantage of keeping integration costs of the model low. The downside of this is, however, that the model does not have access to any of the SYSTEMC facilities, such as event notification and process scheduling, which would be beneficial for efficiently skipping over idle simulation phases. Moreover, SYSTEMC simulation time is also not available from within the ISS, making a synchronisation mechanism necessary. ISS and wrapper employ a budgeting system in order to handle this task. At the beginning of each TLM quantum ∆tq, the wrapper calculates the number of cycles the model is supposed to simulate, denoted as cbudget. Calculation is based on CPU ∆ clock frequency fcpu, i.e., cbudget = fcpu · tq. This value is then passed by the driver 114 Chapter 8. Processor Sleep Models

Algorithm 8.1: Time synchronisation between ISS and SYSTEMC wrapper

1 Function DRIVER() ∆ 2 cbudget ← fcpu · tq; 3 cactual ← STEP(cbudget); −1 4 wait(cactual · fcpu);

5 Function STEP(cbudget) 6 cstart ← ctotal; > 7 while cbudget 0 do 8 i ← get next instruction from decode cache; 9 EXECUTE(i); 10 ctotal ← ctotal + ci; 11 cbudget ← cbudget − ci; 12 end 13 return ctotal − cstart process to the ISS, allowing it to execute instructions until the budget is depleted. For every instruction i executed, cbudget is reduced by the cycle count of that instruction ci. Simultaneously, it maintains a cycle counter ctotal to keep track of the total number of cycles executed since simulation start. This cycle counter is used by the ISS sleep model as a replacement for simulation time. The ISS stops, once its budget reaches or drops below zero. It then returns the number of cycles actually executed, denoted as cactual to allow the wrapper to synchronise with SYSTEMC time. The entire procedure is shown in Algorithm 8.1. With ctotal in place as a notion of time, the sleep model additionally requires access to the power management register of the PMU. An OpenRISC processor enters its doze mode once the DME bit is set in that register. Once this bit is detected, the model begins skipping instruction execution (c.f. Algorithm 8.1, line 9) and returns early, thereby immediately consuming its entire budget. Regular operation is only resumed if one of the following events is detected:

• External interrupt: external interrupt notifications are forwarded to the ISS us- ing a dedicated SYSTEMC process. According to the cooperative scheduling se- mantics of SYSTEMC, this process can only run in between quanta. Therefore, it is sufficient to check for external interrupts only at the beginning of each quan- tum. If one is pending, the DME bit is cleared to indicate that regular operation should be resumed.

• Tick Timer interrupt: if the Tick Timer is active during sleep mode, it is pos- sible to have its interrupt to occur during a quantum. Fortunately, Tick Timer interrupts are predictable, since they are timed using ctotal as a reference. If one is scheduled to appear during a quantum, the model skips only the cycles up to it and then resumes normal operation. Otherwise, the quantum is skipped entirely, updating ctotal in the meantime. 8.3. Processor Sleep Models 115

active operation sleep hint doze mode

0 500 1000 1500 2000 2500 cycles ctotal (100 MHz) CPU0

CPU1

CPU2

CPU3 time tsim 0µs 5µs 10µs 15µs20µs 25µs (∆tq = 5µs) idle activation Interrupt Tick Timer wakeup wakeup

Figure 8.4: Processor activity using ISS sleep model

Figure 8.4 presents how the ISS model operates on a per-quantum basis. Quanta where the processors are fully active are drawn as wide arrows. Those where doze mode is active are drawn as narrow ones. It can be seen how Tick Timer interrupts wake up processors during quanta, while external interrupts are only recognised at quantum boundaries. However, Figure 8.4 also highlights one of the disadvantages of the ISS sleep model. Even if a processor remains in doze mode for the entirety of its quantum, the driver process is still activated by the SYSTEMC process scheduler. This effect is denoted as idle activation and it creates a performance bottleneck due to unnecessary event notification and context switching overhead. It is the goal of the DES sleep model to improve simulation performance in such cases by skipping over multiple quanta and thereby prevent process activations of idle processors.

8.3.2 DES Sleep Model

The DES sleep model aims to overcome the shortcomings of the ISS sleep model by drawing on the DES subsystems of SYSTEMC. The main problem of the ISS based approach is that it retains costly activations of the driver process at the beginning of each quantum, even if the corresponding processor is in doze mode. Because this process exists in the wrapper, it cannot be controlled by the ISS. Consequently, the interface between wrapper and ISS must be extended in order to support the exchange information on sleep hints and idle phase durations between both entities. For this reason, the DES sleep model adds another upcall for doze mode, as il- lustrated by Figure 8.5. Similar to upcalls for instruction and data memory access, the sleep upcall hands control over to the wrapper whenever the ISS receives a sleep hint. Using this upcall, the wrapper can prevent the driver process from being in- voked from SYSTEMC until a wakeup condition has occurred. To model this situation, the wrapper employs a regular SYSTEMC event ewakeup. During the sleep upcall, the wrapper waits on ewakeup in order to suspend the execution of the driver process and thereby immediately finishing its current quantum, allowing other processes to run 116 Chapter 8. Processor Sleep Models

TLM initiator socket IRQ interupt set/clear SystemC Port handler

notify(ewakeup) sleep hint upcall memory TLM BT c sleep ( timeout) ISS upcall upcall

get_dmi_ptr wait (ewakeup|ctimeout) wait 1us

SC_THREAD DMI memcpy step (driver) cache

step 100 cycles

Figure 8.5: Extended OpenRISC wrapper with sleep upcall

and simulation time to pass. Notification of ewakeup signals the end of an idle phase and is initiated by the wrapper whenever a wakeup condition is encountered:

• External interrupt: once an external interrupt is received, the wrapper first for- wards to the ISS, which interrupt line is affected and whether it was raised or lowered. Subsequently, it uses a delta notification on ewakeup in order to allow the driver process to resume operation during the next delta cycle and cause the processor to wake up.

• Tick Timer interrupt: an additional challenge is posed by the Tick Timer firing its internal interrupt while the driver process has been suspended, which would cause the processor to miss it, if no precautions are taken. Since this kind of in- terrupt is ISS internal and therefore not reported to the wrapper, the mechanism used with external interrupts cannot be applied. To work around this issue, the ISS passes an extra timeout argument ctimeout to the sleep upcall, whenever in- voking it while the Tick Timer is active. It states the maximum amount of cycles the driver process is allowed to yield, before it must resume stepping the ISS again. Since the ISS has no notion of SYSTEMC time, it passes ctimeout in cycles, which are converted to simulation time by the wrapper based on the CPU clock frequency. If ctimeout is specified, the wrapper is obliged to employ a SYSTEMC mechanism, which allows to wait on ewakeup with a timeout. This guarantees that the driver process can resume operation when ewakeup gets triggered or the timeout elapses, whichever happens first.

A final issue that must be taken care of when using a sleep upcall is time resyn- chronisation after a wakeup. In order to ensure correct operation of the internal cycle counters and Tick Timer, the wrapper must report the time elapsed while sleeping back to the ISS, once the upcall completes. Similar to how the timeout parameter is 8.4. Experimental Results 117

0 500 1000 1500 2000 2500 cycles ctotal (100 MHz) CPU0

CPU1

CPU2

CPU3 time tsim 0µs 5µs 10µs 15µs20µs 25µs (∆tq = 5µs)

Figure 8.6: Processor activity using DES sleep model

handled, the wrapper therefore first converts SYSTEMC time to processor cycles based on fcpu, before returning it back to the ISS. Finally, the ISS updates ctotal and its inter- nal Tick Timer accordingly. Using the DES sleep model, Figure 8.6 presents the processor activity based on the same scenario as depicted in Figure 8.4. It can be seen that the DES model suffers no idle process activations and efficiently handles longer idle phases, even for those bridging over multiple quanta. By removing the bottleneck of idle activations, the DES sleep model can be expected to achieve better performance than the ISS based approach, though the latter one can be realised with less implementatory effort.

8.4 Experimental Results

To evaluate and compare the performance benefits of the processor sleep models, a realistic VP capable of running a modern OS kernel is required. Given that the proposed approach has been designed around the OpenRISC architecture, the ORVP appears as an attractive target. In order to keep performance results comparable to those of previous chapters, system configuration parameters have been kept the same: the number of OpenRISC ISSs has been fixed to four and a TLM quantum of ∆tq = 5µs has been selected. All memory communication, including exclusive access, is performed using DMI, while other peripherals are accessed using transactions. Besides comparing both sleep models, this section also aims at assessing their applicability in parallel simulation scenarios. While sleep models are capable of cut- ting away unnecessary ISS operation during idle phases, they become ineffective in situations where the guest software occupies all OpenRISC processors at the same time. Consequently, the SCOPE SYSTEMC kernel has been selected as the main simula- tion driver. As with previous experiments, the lookahead matches the quantum, i.e., ∆ ∆ tla = tq = 5 µs. Two variants of ORVP are being deployed for experimentation:

• ORVP/NONE: this variant resembles the regular variant of the ORVP as de- scribed in Chapter 4 without a sleep model. It is used as a baseline reference.

• ORVP/ISS: this variant uses the ISS sleep model as described in Section 8.3.1. It embeds the OpenRISC ISS into the VP using the wrapper shown in Figure 8.3. 118 Chapter 8. Processor Sleep Models

• ORVP/DES: this variant employs the extended wrapper shown in Figure 8.5 in order to support the DES sleep model as introduced in Section 8.3.2. The remainder of the experimentation section is structured as follows. First, Sec- tion 8.4.1 introduces processor activity tracing in ORVP. Next, the two sleep model designs are compared during sequential operation using a set of industry standard benchmarks, including Linux boot and coremark. Finally, benefits of a combined ef- fort of sleep models and parallel simulation techniques are studied in Section 8.4.3. Each experiment was repeated a fixed number of times and only averages are re- ported here. Detailed information about the runtime of single experiment iterations, repetition count and simulation host can be found in Appendix C.

8.4.1 Processor Activity Tracing In order to be able to evaluate the proposed sleep models, merely benchmarking VP execution speed is insufficient. Processor utilisation by the guest SW running on the VP must also be taken into account. For example, no benefit can be expected from sleep models, when a parallel software is capable of continuously exhausting all cores in the VP. Consequently, a processor activity tracing mechanism has been added to ORVP in order to provide context on processor utilisation during benchmarking. Activity tracing has been implemented according to the SYSTEMC standard for trace files [80]. A new tracing variable indicating the number of active OpenRISC processors has been introduced. It is automatically updated whenever an ISS enters or exits doze mode. At the end of every delta cycle, the SYSTEMC kernel stores any changes of the traced variables to disk. However, given that a new delta cycle is produced at the end of every quantum ∆tq = 5 µs and that the average duration of selected benchmarks is in the order of magnitude of minutes, the straightforward approach produces exces- sive amounts of data and significantly degrades VP simulation performance. To overcome this issue, a two tiered tracing approach has been developed. In the first tier, a trace module accumulates the number of active processors at the end of every delta cycle. Once a sampling interval of ∆tsamp = 200∆tq = 1ms has elapsed, the average number of active processors is reported to the second tier and the pro- cess restarts. The second tier is responsible for timekeeping and storing the values reported from the first tier to disk. Sampling time is stored as multiples of ∆tsamp. Values are stored in differential form, i.e., the number of processors that have become active since the end of the previous sampling interval. All activity graphs shown in the following sections have been generated based on these trace files. The graphs represent processor activity relative to the total number of processors in the system, i.e., a 100% utilisation means that no processor entered an idle phase during the corresponding sampling interval. Note that due to the averaging in tier one, activity values do not need to be discrete and values between 0%, 25%, 50%, 75% and 100% may appear. This happens if one or more processors became active or idle in the middle of a sampling interval. Finally, it should be noted that tracing always starts at system power on. Consequently, idle phases during command line operation before and after executing the benchmark have been cut away manually. 8.4. Experimental Results 119

8.4.2 Sleep Model Comparison Comparison between the two sleep model approaches is driven by a selection of benchmark applications that resemble realistic workloads from the industry. First, the Linux boot benchmark is considered, given that early OS porting efforts as well as driver development are prime use-cases for a VP. Performance gains in this scenario directly translate to increased SW development and debugging productivity and are therefore highly desirable. Second, the coremark program has been established as an industry standard benchmark for evaluating processor performance. Its instruction mix resembles workloads frequently encountered in real-life applications. Finally, the ocean-cp application from the SPLASH2 [217] benchmarking suite represents parallel workloads from the domain of scientific computing. Though sleep models cannot be expected to yield high performance boosts under such circumstances, this scenario serves as a baseline reference. Results for each scenario are discussed in the following:

• Boot scenario: simulation runtime and processor activity trace for this scenario are shown in Figure 8.7a and 8.7b, respectively. As previously expected, both sleep models achieve significant performance gains, given the kernel booting procedure is well understood to be a mostly sequential and control-flow dom- inated workload. Figure 8.7b confirms this assumption: with the exception of the first milliseconds, where all cores set up the caches and TLBs in parallel, no more than two processors are active at the same time on average. Here, the ISS sleep model reduces simulation time until login by 64%, which corresponds to a speedup of 2.77×. The best performance and shortest simulation time is achieved using the DES sleep model. It accelerates VP execution speed by 3.38× and reduces simulator runtime by 70%.

• Coremark scenario: the activity trace of the coremark benchmark in Figure 8.7d depicts, how the application operates in two phases. During the first 5 seconds of operation, coremark employs only a single processor to set up data structures and initialise memory to zero, while the other cores remain idle. Subsequently, the actual benchmarking routine starts in parallel on all four cores. The resulting simulation runtime is presented in Figure 8.7c. Due to the high parallel activity in the second phase of the benchmark, sleep models cannot achieve as signifi- cant performance increases as in the previous scenario. However, the sequential initialisation phase still allows reductions of VP runtime of 22% and 23% for the ISS- and the DES sleep model, respectively.

• Ocean-cp scenario: the final scenario selected for evaluation of the sleep models represents a highly parallel workload from the domain of scientific computing. As the activity trace in Figure 8.7f shows, the application efficiently takes ad- vantage of all processors present in the system, only allowing cores to briefly enter doze mode while waiting at synchronisation barriers. Additionally, the short idle phases cause the heuristics of the underlying OS scheduler to keep the cores active for longer in order to avoid costly and slow processor wakeup 120 Chapter 8. Processor Sleep Models

100% ORVP/NONE 26.84 75% ORVP/ISS 9.66 50%

ORVP/DES 7.94 25% CPU utilization 0% 0 5 10 15 20 25 30 0.0 0.5 1.0 1.5 2.0 2.5 runtime (s) simulation time (s) (a) VP runtime: Linux boot (b) CPU activity: Linux boot

100% ORVP/NONE 209.46 75% ORVP/ISS 163.03 50%

ORVP/DES 160.6 25% CPU utilization 0% 0 60 120 180 240 45s 50s 55s 60s runtime (s) simulation time (c) VP runtime: coremark (d) CPU activity: coremark

100% ORVP/NONE 66.8 75% ORVP/ISS 66.22 50%

ORVP/DES 65.35 25% CPU utilization 0% 0 15 30 45 60 75 41s 42s 43s 44s 45s 46s runtime (s) simulation time (e) VP runtime: ocean-cp (f) CPU activity: ocean-c

Figure 8.7: VP runtime and CPU activity in selected mixed-to-high load scenarios

phases. Consequently, the VP can only marginally take advantage of sleep mod- els, which is also reflected by the minimal performance improvements shown in Figure 8.7e. Using the ISS sleep model, simulator runtime is shortened by 0.9%, while the DES approach achieves a reduction of 2.2%. In summary it can be seen that the DES model on average achieves better perfor- mance compared to the ISS one, given that the deployment scenario allows for enough idle processor time. The reason for the increased simulation speed can be found in the reduced simulation activity thanks to the avoidance of idle activations, as discussed previously. Especially during the important Linux boot benchmark, the difference between both approaches becomes clearly visible. Here, the speedups of the sleep models rival those achieved by parallel simulation alone, as presented in Chapter 7. Therefore, a study on whether both technologies combine well is presented next. 8.4. Experimental Results 121

Benchmark Type Problem Size

boot s Kernel boot from reset to login prompt fibonacci s Iterative calculation of f ib(10000) dhrystone s 106 iterations of the dhrystone [200] benchmark dhrystone-4x p Four concurrent instances of dhrystone coremark p coremark [45] benchmark using pthreads mandelbrot p Parallel calculation of the mandelbrot set barnes p Barnes-Hut hierarchical N-body method algorithm fmm p Body interaction using Fast Multipole Method, 214 particles ocean-c/nc p Ocean current simulator, 256 × 256 grid, four threads radiosity p Computation of light distribution, default scene raytrace p Raytracing using teapot mesh volrend p Volumetric rendering using head-scaleddown4 mesh water-nsquared p Current and force simulation of 512 water molecules water-spatial p Same as above, but uses spatial separation algorithm cholesky p Cholesky factorisation on sparse matrix; input file tk29.O fft p Fast Fourier Transform, 215 − 1 complex data points lu-n/nc p LU matrix factorisation, 512 × 512 matrix size radix p Integer radix sort with 218 keys and a radix of 1024

Table 8.2: Sequential (s) and parallel (p) benchmark applications

8.4.3 Parallel Performance Analysis

The interplay between parallel simulation kernels and sleep models presents certain intricacies to VP designers striving for optimal performance. Sleep models reduce the amount of parallelizable simulation workload. By omitting the simulation of idle processors, a heterogeneous load distribution is created that is dependant on the sleep hints issued by the target SW. In other words, when using sleep models, the parallelization potential of a running VP becomes coupled to the parallelization potential of the SW running within it. To allow a fair assessment, the selection of 20 benchmark applications must therefore be classified into sequential (s) and parallel (p) workloads, as shown in Table 8.2. Performance gains are stated in the form of speedups. These have been derived using the runtime of ORVP/NONE as a baseline. For the parallel variant, the number of threads for the ORVP/NONE has been increased to four, matching the number of OpenRISC processors present in the system. Similarly, the combined variant uses either ORVP/ISS or ORVP/DES, also with four threads. Parallel simulation results for both sleep models are presented in Figure 8.8 and Figure 8.9. 122 Chapter 8. Processor Sleep Models

5x iss-sleep parallel combined 4x 3x 2x speedup 1x 0x fft lu-c boot fmm lu-nc radix barnes ocean-c volrend raytrace cholesky ocean-nc radiosity fibonacci coremark dhrystone mandelbrot water-spatial dhrystone-4x water-nsquared

Figure 8.8: Combined speedup with parallel simulation and ISS sleep models

5x des-sleep parallel combined 4x 3x 2x speedup 1x 0x fft lu-c boot fmm lu-nc radix barnes ocean-c volrend raytrace cholesky ocean-nc radiosity fibonacci coremark dhrystone mandelbrot water-spatial dhrystone-4x water-nsquared

Figure 8.9: Combined speedup with parallel simulation and DES sleep models

Experimental results present a total of three noteworthy cases on how sleep mod- els affect simulation performance. These are investigated in the following:

1. Parallel performance increases with sleep models.

2. Parallel performance decreases with sleep models.

3. Sequential sleep models outperform parallel ones.

The first group of benchmarks is comprised out of coremark, mandelbrot, barnes, raytrace, volrend, water-spatial and cholesky. They all benefit from parallel simulation and sleep models, resulting in the shortest simulation time being achieved only when using both techniques together. On average, sleep models improve the performance of 8.5. Limitations and Outlook 123 this benchmark group by 7.5%. When also taking parallel simulation into account, an average speedup of 3.3× is reached. The reason for these performance improvements lies in the actual CPU utilisation of the applications. All have a sequential setup phase, which is accelerated by the sleep model, followed by a computationally intensive parallel phase where all cores are active (c.f. coremark activity in Figure 8.7d). The second group of benchmarks consists of the applications dhrystone-4x, fmm, ocean-c/nc, lu-c and radix. They expose only short or no sequential phases at all, so no benefit from sleep models can be expected (c.f. ocean-c activity in Figure 8.7f). Due to the long CPU intensive phases, parallel simulation still achieves an average speedup of 3.2×. The overhead of the sleep models reduces this speedup down to 3.1×. Finally, the third group consists entirely out of the sequential applications. The benefits of sleep models are most visible here, as generally only one processor is active while the others are idle. For this group, the DES sleep model achieves an average speedup of 3.9× using single threaded simulation kernel and only 3.8× with the parallel one. In such sequential workloads, the parallel simulator struggles delivering optimal performance due to its synchronisation overhead.

8.5 Limitations and Outlook

Based on the results of the performance evaluation, a key limitation of the sleep mod- els can be identified and used as a basis for future improvements. Speedups achieved with the DES model show that it frequently performs worse in a parallel simulation scenario than its ISS counterpart. Additionally, with the exception of purley sequen- tial scenarios, the benefits of the DES approach appear marginal at best. This becomes especially visible in the ocean-c/nc, raytrace and fft scenarios, where the performance of the ISS sleep model is superior. The common aspects of these scenarios are the short idle times between active parallel phases. It is not uncommon that processors are put into doze mode for merely a single quantum, before being woken up again. A deeper investigation shows that the reasons for this behaviour are barrier synchronisations, which switch processors into an idle state until all other processors have also reached that barrier. Since the work of all processors is evenly balanced, only a few quanta of difference exist between the first and last processor entering the barrier. Such short Idle durations leave the DES sleep model struggling to keep up per- formance with the ISS one. Even though the latter suffers from a few idle activations of its driver process, the overhead induced by notifying a wakeup event consumes the potential gains of skipping an idle activation, lowering the overall performance of the DES approach. Consequently, the ISS sleep model should be preferred whenever short idle phases are expected, while the DES one should be employed when it can be ascertained that the processor will remain idle at least for multiple quanta. Un- fortunately, the duration of idle phases solely lies in the control of the target SW and can therefore not be determined exactly by the VP. Future work should therefore be directed towards finding an heuristic that efficiently predicts the length of future idle phases. A straightforward approach could, for example, employ the ISS sleep model 124 Chapter 8. Processor Sleep Models for the first couple of idle quanta. Once a fixed number of quanta without proces- sor activity has elapsed, the heuristic could switch to the DES model, expecting the situation to stay the same and therefore the processor to remain idle for longer. Finally, the assumed interface between sleep models and the ISS asks for addi- tional investigation. The sleep models have proven to be efficiently integrable into interpretative simulators, given that dedicated registers are used to signal sleep hints. However, many architectures employ specialised instructions for the same purpose, which must be intercepted, unless the source code of the ISS is available. Most models offer a register holding the currently executing instruction, which can be exploited for this purpose. Simulators that employ DBT pose a special challenge in this domain, as it would be required to modify the dynamically generated host code for sleep signal instructions. Future work should therefore investigate whether the assumed sleep model interface is applicable to other ISSs employed within the EDA industry today.

8.6 Synopsis

The introduction of sleep models into multi-core VPs enables the exploitation of OS sleep hints to skip over idle simulation phases, which can lead to significant perfor- mance improvements in all but extremely parallel workloads. Using sleep models, it is possible to reduce the runtime of a VP booting the Linux kernel – an important use case within the EDA industry – by 70%. To ease integration efforts with existing VPs, two approaches have been presented. The ISS approach can be applied locally within the CPU models, but may suffer from performance losses in case of frequent idle activations. This problem is overcome by the DES sleep models, which utilises SYSTEMC mechanisms to prevent model activations during sleep phases of extended length. However, because of the required interface between ISS and SYSTEMC wrapper, integration effort is elevated. For optimal simulation performance, sleep models should be deployed in conjunc- tion with parallel simulation technologies. These are capable of efficiently handling parallel workloads, where sleep models naturally do not perform well. The combi- nation of both technologies achieves linking the performance of a VP directly to the workload induced by the SW running within. Previous VPs have only considered always-on processors, resulting in a homogeneous and static load distribution. With sleep models in place, parallel simulation techniques are now challenged by dynamic heterogeneous workloads. The SCOPE simulation kernel has proven to deal efficiently with this challenge, reducing simulation runtime for most evaluated application sce- narios. Even in corner cases, where VP performance is reduced compared to pure parallel simulation without sleep models, the runtime losses are only minor. Chapter 9

Parallel SystemC using Time-Decoupled Segments

Parallel simulation is one of the main drivers for achieving high performance with multi-core VPs. So far, this work has presented two parallel simulation technologies, i.e., time-decoupling (c.f. Chapter 5) and flexible time-decoupling (c.f. Chapter 6), which are based on SYSTEMC. Their actual implementation is the parallel SYSTEMC ker- nel SCOPE. By building the time-decoupling directly into the simulation kernel, inte- gration of existing VPs is facilitated and fine-grained dynamic load balancing becomes possible. However, the approach also suffers some shortcomings. First, each parallel VP may be subject to race-conditions: while SCOPE can efficiently detect and prevent shared access to SYSTEMC channels, inter model communication using global variables bypasses these checks. Moreover, non-blocking TLM IMCs, such as the debug and di- rect memory interfaces are not protected from races by SCOPE. Secondly, VP designers wishing to take advantage of time-decoupling must replace their original, potentially customised SYSTEMC kernel and thereby lose other functionality. Consequently, alter- native approaches that do not suffer from these shortcomings must be investigated. SYSTEMC-LINK [204] has been developed as a tool to overcome these issues and offer developers an alternative way to harness parallel simulation, even for legacy code that uses shared global variables excessively. Fundamentally, a SYSTEMC-LINK based simula- tion is comprised out of various segments, each featuring its own copy of the SYSTEMC kernel as well as models for processors and other components, such as memories, buses and I/O peripherals. A simulation controller links these segments together and orchestrates their simulation, thereby forming a complete VP. For optimal paral- lel performance, each segment receives its own time zone, so that an asynchronous synchronisation scheme based on time-decoupling can be employed. Furthermore, every SYSTEMC-LINK segment provides a virtual sequential environment with global state replication. This enables the use of legacy models by allowing operation on global state without risking race conditions during parallel operation. The remainder of this chapter is structured as follows. First, Section 9.1 gives an overview on the fundamental concepts of SYSTEMC-LINK and introduces its core com- ponents, which are elaborated by subsequent sections. Next, Section 9.2 describes various aspects of the simulation controller, before Section 9.3 presents the communi- cation infrastructure used for inter-segment communication. Both concepts are com- bined in Section 9.4 to give a big-picture overview. Subsequently, Section 9.5 presents experimental results, including those of the contemporary multi-core ORVP based on the OpenRISC architecture. Finally, Section 9.6 presents limitations and future work for SYSTEMC-LINK, before Section 9.7 summarises the results and gives a conclusion.

125 126 Chapter 9. Parallel SystemC using Time-Decoupled Segments

Models Models Models Models

Global State Global State Global State Global State comm. comm. comm.

SYSTEMC Kernel SYSTEMC Kernel SYSTEMC Kernel SYSTEMC Kernel Segment A Segment B Segment C Segment D Thread 0 Thread 1 Thread 2 Thread 3

SYSTEMC-LINK Simulation Controller

Figure 9.1: SYSTEMC-LINK simulation architecture

9.1 The SystemC-Link Concept

A key motivation behind the design of SYSTEMC-LINK has been to remedy certain short- comings of SCOPE that became apparent at later design stages. A VP designer wishing to take advantage of parallel simulation must replace the previous SYSTEMC kernel with SCOPE and thereby renounces any extra benefits that a custom kernel might have offered, such as advanced tracing and introspection facilities as commonly provided by EDA vendors. As a consequence, SYSTEMC-LINK was developed to be agnostic of the simulation kernel that ultimately drives the simulation. Its design combines a hypervisor based architecture, as it is often encountered in virtualized environments, such as server farms, with a co-simulation approach that is typically employed to link simulators from different application domains. Figure 9.1 gives an overview of the simulation architecture of SYSTEMC-LINK. Fun- damentally, it consists of simulation segments, which are interconnected by communi- cation channels. A simulation controller aggregates all segments and orchestrates their simulation in parallel. In this context, a segment represents a self contained simula- tion of a subsystem of the entire VP, for example an application, network or graphics processor with its local memories or caches. Each segment therefore needs to pro- vide its own simulation kernel, models and other – potentially global – state. This architecture brings with it the following benefits for parallel VPs:

• Vendor independent simulation kernel. As depicted in Figure 9.1, each seg- ment must provide its own simulation kernel and is therefore free to choose the regular OSCI implementation or an augmented custom variant. Should compat- ibility constrains for simulation models require diverging SYSTEMC versions, only individual segments need to be adapted while others remain unaffected.

• Enhanced virtual sequential environments. While SCOPE offered simulation models protection from race conditions on regular SYSTEMC communication chan- nels, it cannot prevent races that bypass those channels, e.g., races on global variables. The enhanced virtual sequential environments of SYSTEMC-LINK auto- matically replicates all such global state of models contained within a segment and prevent access from other threads unless explicitly allowed, e.g., for DMI. 9.2. Simulation Controller 127

• Race-free non-blocking IMCs. The enhanced virtual sequential environment furthermore supports cross-thread IMCs that guarantee race-free execution, even for those that SYSTEMC forbids to block, i.e., call wait. This relaxes implemen- tation cautiousness for various TLM communication interfaces, such as DMI, debug and non-blocking transactions.

• Fine-grained lookahead specification. Channels between segments allow an- notation of individual communication latencies that allow a more fine-grained time-decoupling than a single global lookahead, as employed by SCOPE. The amount of time a segment is allowed to simulate ahead in time is automati- cally derived from the channels it holds to its peers. Fine-grained control of the lookahead enables a tradeoff between simulation performance and timing accuracy down to the individual component models.

While bringing these benefits to the table, the SYSTEMC-LINK approach also suf- fers some shortcomings compared to SCOPE. By moving the time-decoupled parallel scheduler into the SCOPE kernel, integration into existing VPs is possible merely by recompiling and relinking against the SCOPE sources. However, SYSTEMC-LINK requires extra steps before a VP can be constructed. First, the designer must decide on its partitioning, i.e., which subsystems should be represented as segments for parallel execution. Next, specialised communication primitives must be inserted that allow cross-segment communication and stub disconnected TLM ports. Finally, a platform configuration must be created, that tells the simulation controller how to assemble and interconnect the individual segments into a fully operable VP. It is reasonable to expect for these steps to not only impact VP programming, development and op- eration, but also aspects, since the reorganisation into segments affects source code organisation and versioning control.

9.2 Simulation Controller

It is the task of the simulation controller to combine the individual simulation seg- ments to form a complete VP. This process is called elaboration and is based on a platform configuration file, which lists all segments that make up the VPs including their interconnections. After successful elaboration, the controller begins the simula- tion phase. During this phase, time-decoupling is employed in order to synchronise the local simulation times of each segment and prevent causality violations. This section presents those central aspects of the simulation controller. First, Sec- tion 9.2.1 investigates the time-decoupled synchronisation scheme used for the seg- ment co-simulation and introduces the concept of channel latency networks. Subse- quently, Section 9.2.2 discusses the scheduler employed by the SYSTEMC-LINK controller, its interface to the segments as well as its supported scheduling modes. Finally, the enhanced virtual sequential environment including its IMC race protection and global state replication concepts are presented in Section 9.2.3 128 Chapter 9. Parallel SystemC using Time-Decoupled Segments

segment 0 "primary" segment 1 "secondary"

t0 = 80ns t1 = 100ns

MEM channel CPU1 ∆c1,0 = 40ns CPU0

Figure 9.2: Time-decoupled segments interconnected via latency channels

9.2.1 Time-Decoupled Co-Simulation

As already identified in Chapter 5, synchronous simulation schemes do not offer enough parallelization potential to significantly accelerate ESL simulators. SYSTEMC- LINK therefore makes use of time-decoupling [207] to allow individual segments to run ahead in time for a certain duration before synchronising again. However, this complicates inter-segment communication, since each segment i now operates in its own local time zone with a local time ti. In order to bridge time zones between in- ∆ terconnected segments, channels are assigned a latency, denoted as ci,j for a channel between segments i and j. Conceptually, channel latencies describe the amount of time allowed to pass after putting a communication message token into the channel and before it must be fetched by the receiver.

∆ tlim,i = min tj + ci,j (9.1) j∈peers(i)   Segments and latency channels form a channel latency network. It is used by the simulation controller to determine the amount of time a segment may simulate ahead of time before risking to miss a channel token from its peers. It is the task of the simulation controller to make sure that a segment has never advanced too far ahead in time. To that extent, the controller computes a limit time tlim,i for each segment i as shown in Equation 9.1. Segments that have not reached their limit time, i.e. < ti tlim,i, are considered ready to simulate by the SYSTEMC-LINK controller, while others are considered waiting for their peers to catch up. Figure 9.2 presents a SYSTEMC-LINK simulation consisting of two segments: a pri- mary segment containing models for processor, bus and memory components, and a secondary segment holding only a processor. The connection between the secondary processor and memory on the primary segment is made using a channel with a la- tency of ∆c1,0 = 40 ns. Assuming that the local time stamp of the second segment t1 is currently at 100 ns, the first segment does not need to check for new channel tokens ∆ earlier than tlim,0 = t1 + c1,0 = 140ns. Given its local time is t0 = 80ns, segment 0 may simulate uninterruptedly for a duration of tlim,0 − t0 = 60ns. 9.2. Simulation Controller 129

9.2.2 Segment Scheduling

SYSTEMC-LINK uses a cooperative scheduling approach for simulating segments using coroutines similar to those used by SYSTEMC for thread processes. Once started, the scheduler picks the first segment from the ready to simulate queue and switches to its execution context. If the scheduler is allowed to utilise multiple host threads for par- allel simulation, each thread picks a segment until the queue is empty or every thread is busy. Within its execution context, a segment is first initialised before simulation is started, using the interface functions init and step. A default implementation for step is provided by SYSTEMC-LINK. It is presented in Algorithm 9.1. The init routine must be added by the VP developer in order to construct the module hierarchy. Once the simulation finishes (either by reaching the end time stamp or by calling stop in any segment), the simulation controller invokes the exit interface function to allow for module hierarchy deconstruction and cleanup.

Algorithm 9.1: Default step routine for SYSTEMC-LINK segments (simplified)

1 Function STEPSEGMENT(tlim,i) < 2 while ti tlim,i do 3 while pending activity at current time do 4 simulate one delta cycle 5 end

6 e ← next(EQi); 7 tnext ← min(tlim,i, te); 8 simulate until ti = tnext; 9 report new segment time ti to controller; 10 end

Algorithm 9.1 presents the default implementation for the step interface function. While it may be replaced with custom designs in order to address VP specific intrica- cies, it is convenient to use and sufficient for the general case. The function receives the limit time tlim,i that its segment is allowed to run up to and executes the following routine until it reaches this timestamp and returns. First, it processes all delta and im- mediate notifications at its current time (c.f. lines 3–5), ensuring that simulation time does not advance. Next, it retrieves the next pending event e from the event queue of the SYSTEMC kernel embedded in its segment EQi (c.f. line 6). Its future trigger time- stamp te is compared to tlim,i in order to derive the next timestamp tnext it is safe to simulate up to without missing events or causing causality violations (c.f. lines 7 and 8). Once simulation has advanced to tnext, the updated local segment time ti = tnext is reported back to the SYSTEMC-LINK controller. The scheduler can be configured with different operation modes. These modes control the conditions under which the coroutine that executes the simulation loop and step function of a segment yields control back to the scheduler. Two different scheduling modes are available called As Soon As Possible (ASAP) and As Late As Possible (ALAP), indicating when a segment should yield control. 130 Chapter 9. Parallel SystemC using Time-Decoupled Segments

• ASAP scheduling: a segment yields control back to the scheduler once it wants to advance time to a different timestamp, i.e., while reporting its new simulation time to the controller (c.f. line 9 in Algorithm 9.1). This results in shorter channel communication delays since segment timestamps are generally closer together, but also increases the number of coroutine context switches, reducing overall simulation performance.

• ALAP scheduling: a segment yields control back to the scheduler only once it reaches its limit time tlim,i and its step routine returns (c.f. Algorithm 9.1). This reduces the amount of context switch throughout the simulation and thereby increases simulation speed, but local timestamps of segments generally differ by a larger amount, leading to increased communication latencies.

The choice of scheduling mode presents a trade-off between simulation perfor- mance and timing accuracy. While currently scheduling modes can only be specified globally, future work might enable per-segment configuration to support fine-tuning.

9.2.3 Virtual Sequential Environment Segments form virtual sequential environments for all included models, meaning that their simulation processes do not execute concurrently within one segment. Com- pared to SCOPE, the SYSTEMC-LINK variant of these environments is further enhanced by additionally offering support for race-free IMCs and per segment global state replica- tion. This frees the programmer from protecting shared state using locks and enables the use of non-thread-safe, off-the-shelf models in parallel simulators. Thanks to the global state replication technique, singleton models that utilise global state (e.g., an ISS using global variables for storing its register values), may be instantiated multiple times in a SYSTEMC-LINK simulation without costly redesign, simply by moving each instance into its own segment. The simulation controller conducts the simulation of all segments within one host process and a single address space. A shared address space allows segments to ex- change memory pointers, which is a fundamental requirement for technologies en- abling high simulation performance, such as DMI and DBT based ISSs. However, for conventional parallel simulators, the combination of a unified address space and multi-threading frequently leads to race-conditions, usually when shared data is acci- dentally accessed in an unsynchronized fashion. In SYSTEMC-LINK, such accidents are effectively avoided, since each segment is individually compiled and therefore exclu- sively controls and references only its own state. Pointers to memory regions of other segments must be explicitly requested, for example using TLM DMI. Segments are implemented as shared object files. All software that uses global variables, which should be replicated for each segment, must be statically linked. For example, the OSCI SYSTEMC kernel uses the global variable simcontext to store status information, such as local simulation time and the event queue. Since this data needs to be replicated for each segment, a copy of the chosen SYSTEMC library must be stati- cally linked to the segment. However, it is not necessary to choose the same copy or 9.2. Simulation Controller 131

Memory Address Module Name

0x00200000 - 0x0020c000 libc.so, libm.so, libxml.so ...

schedule.o SYSTEMC-LINK Controller 0x0020c000 - 0x003a0000 channel.o

main.o

bus.o iss.o

0x003a0000 - 0x0063c000 libscl.a libsystemc.a segment0.so

iss.o

0x0063c000 - 0x008d8000 libscl.a libsystemc.a segment1.so

0x01000000 - 0xc0000000 Heap & Stack

Figure 9.3: Host memory layout for the VP from Figure 9.2

version for each one. When a SYSTEMC-LINK simulation is started, the simulation con- troller loads the shared objects associated with the individual segments into different distinct memory locations. For the VP given in Figure 9.2, an exemplary memory layout is depicted in Figure 9.3. It can be seen how heap, stack and dynamically linked libraries, such as libc, libm and libxml, are shared among all segments in the given example. Global SYSTEMC- and model state is replicated, since two copies of libsystemc and iss are placed in memory at different memory locations. Finally, one may notice the addition of an extra library to each segment called libscl. This library provides an interface between its segment and the simulation controller. For example, it implements the step interface routine as well as other communication primitives. In some situations, it might be efficient to load a segment multiple times. Con- sidering the example in Figure 9.2, the VP could easily be extended to a quad-core system by instantiating the secondary segment multiple times and creating new chan- nel interconnections. However, the dynamic loaded of the Linux based host computer does not normally load the same shared object twice. Instead, it only provides a refer- ence to the first one, whenever it detects that a shared object has already been loaded before. Since this violates the global state replication guarantee for enhanced virtual sequential environments, a workaround must be deployed. For Linux based hosts, it is possible to load the same shared object into a different namespace and access it natively using a wrapper. The detection mechanism may also be circumvented by renaming the shared object file for each segment instance to be loaded. 132 Chapter 9. Parallel SystemC using Time-Decoupled Segments

segment 0 segment 1

slave master CPU conn. channel conn. MEM block block

Figure 9.4: Cross segment communication via connector blocks

9.3 Communication Infrastructure

Inter-segment communication is handled using channels. To access these channels, specialised connector blocks must be placed into the simulation and connected using regular SYSTEMC signals or TLM ports as illustrated by Figure 9.4. Slave connector blocks receive transactions and send them over the channel. Master connector blocks receive those transactions and forward them to their destination using regular TLM communication interfaces. During simulation elaboration, a channel entry point must always be linked to one slave connector block and, similarly, an exit point to one master block. In order to identify connector blocks simulation wide, their hierarchical name is preceded with the unique ID of the segment they were instantiated in. A key task of these connector blocks is the deterministic and thread-safe operation in order to maintain the guarantees of the enhanced virtual sequential environments of SYSTEMC-LINK. Furthermore, temporal correct transmission of messages must be ascertained to prevent causality errors. In this context, a message encapsulates a TLM transaction, timing information as well as chosen TLM communication interface. The required connector block modelling primitives are provided in a dedicated SYSTEMC- LINK support library, which must be statically linked to each segment, similar to how it is done with the SYSTEMC kernel as shown in Figure 9.3. Channels support two different methods for transmitting messages: the queue- based approach is presented in Section 9.3.1, while Section 9.3.2 introduces the IMC- based one. The choice of transmission method is based on the TLM communication interface that the transaction originates from. In order to relay blocking transport calls, the queue-based communication approach is used. DMI or debug calls employ the IMC-based approach for message transmission.

9.3.1 Queue-based Communication

The queue-based communication flow is used exclusively for transactions transmitted using the TLM blocking transport interface. It respects the timing semantics by anno- tating the time spent to forward the transaction through the channel to the receiver using the local time offset parameter ∆ttx of the TLM blocking transport interface. Race conditions are prevented by forwarding all transactions within the context of the virtual sequential environment of the receiver. Similar to the enhanced TLM target sockets of SCOPE, this approach employs per channel transaction queues for this task. 9.3. Communication Infrastructure 133

Connector Block Connector Block (slave) (master) (1) TLM Target Socket Transaction poll (3) wait Queue (5) assign b_transport (2) (4)

wakeup event etx thread pool (9) (8) (10) notify (6) forward (7) collect TLM Initiator Socket b_transport

Figure 9.5: Queue-based communication flow

The full queue-based transmission procedure is outlined in Figure 9.5. Internally, the channel uses two queues for forward and return directions. Once a transaction tx is received by the target socket of the slave connector block (1), it is put into the forward queue (2). The calling SYSTEMC simulation process is then suspended, waiting for a wakeup event etx (3). On the receiver side, a SYSTEMC process continuously polls the forward queue for new messages (4). Polls must be carefully timed in order to avoid wasting execution time but also not miss messages. The scheduler defines a limit time tlim,j for the receiving segment j to which it is safe to run up to without missing messages. Consequently, polls must be schedule to run no later than tlim,j.

∆ ∆ ttx + ti − tj if ttx > tj − ti ∆ttx ← (9.2) (0 otherwise Timing annotation is performed when the transaction is removed from the queue (4). Equation 9.2 shows, how the local time offset ∆ttx is adjusted by the time differ- ence between the sending segment at the time stamp ti when the transaction was put into the queue (2) and the receiving segment at the time stamp tj, when it took the transaction out of the queue tj (4). Note that time-decoupling guarantees that the time ∆ difference between two segments never exceeds the channel latency ci,j (c.f. Equa- tion 9.1). Therefore, transactions must be stated sufficiently ahead of time, so that ∆ ∆ ∆ε ∆ ∆ ttx ≥ ci,j. Otherwise, a timing error tx = tj − ti − ttx ≤ ci,j is incurred. Once a message has been received by the poll process and its timing has been ad- justed to the local time zone of the receiving segment, the transaction must be passed on to the receiving TLM target port. However, transactions cannot be forwarded di- rectly, since receivers might call wait and thereby block the poll process, stopping it from fetching any further messages. Instead, a SYSTEMC thread pool is utilised to as- sign each transaction to its own simulation process (5), which then forwards it to its target using a regular TLM initiator socket (6). 134 Chapter 9. Parallel SystemC using Time-Decoupled Segments

A similar approach is taken on the return path. Once the TLM initiator sockets receives the transaction response (7), it is put into the backward queue (8), before the associated simulation process is returned to the thread pool for efficient re-use. On the sending side, a collect process continually polls the backward queue for new incoming responses. Its timing is identical to that of the poll process on the receiver side: polling attempts are performed once simulation time reaches tlim,i and messages are retrieved until the queue is empty. Once a transaction tx has been collected from the backward queue in such a way (9), its associated wakeup event etx is retrieved and notified (10). Given that this operation cannot block, no thread pool is necessary for receiving responses. The notification allows the initiating process to resume operation, fetching the transaction response and returning the result to the caller.

9.3.2 IMC-based Communication

The IMC-based communication flow is used for TLM communication interfaces that cannot normally yield control, such as the debug and direct memory interfaces. Con- sequently, the wakeup event notification procedure utilised for the queue-based flow cannot be applied, since waiting on etx is forbidden (c.f. Figure 9.5, step 3). Instead, master connector blocks propagate an IMC interface to their peering channels at the beginning of the simulation. This interface holds function pointers with the call sig- natures of both considered TLM protocols, i.e., DMI and debug. Using this interface, slave blocks can forward IMCs directly into the connected segment. When the TLM target socket of a slave connector block receives an IMC, it first queries the channel interface for access to the function pointers. Access may not be granted if the master block of the peered segment does not support the requested TLM communication interface. Otherwise, the call is forwarded by invoking the IMC that is pointed to by the channel interface. The call invokes upcall methods within the master connector block, which in turn forward the request using the regular TLM protocol methods provided by the local TLM initiator socket. Note, however, that since caller and callee of the IMC reside in different segments, race conditions on internal model state may appear if it is exposed to other threads in this way. To avoid those, channels also support exclusive IMCs. A channel marked for ex- clusive IMC propagation ensures that calls into peer segments are guaranteed to be race free. Fundamentally, this is achieved by allowing an exclusive IMC of a segment to execute only when all other segments are paused by the SYSTEMC-LINK simulation controller. The mechanism used for this purpose is illustrated in Figure 9.6. Once an exclusive IMC is issued (1), it is received by the TLM target socket of the slave connec- tor block. The block then continues to retrieve the IMC interface from the channel (2). Should the channel not have a cached copy of the interface, it may re-request it from the peered master connector block (3), which responds accordingly if the interface is supported by the receiver (4). The result of this operation is subsequently reported to the initiating slave connector block (5), which may either abort the operation if no suitable interface was returned, or prepare the execution of the actual IMC. This procedure so far has been identical to that of a non-exclusive IMC. 9.3. Communication Infrastructure 135

Connector Block Connector Block (slave) (master) (1) (2) (3) Shared TLM Target Socket Interface TLM Initiator Socket Model (11) (5) (4) State (10) (9) call transport_dbg return

(8) resume (6) yield (7) yield

SYSTEMC-LINK Simulation Controller

Figure 9.6: IMC-based communication flow

At this point, a non-exclusive IMC would just perform the actual call and return the result to the initiator on its segment. For exclusive IMCs, however, a different route must be taken. To ensure atomic execution, all other currently executing seg- ments must be paused, so as to prevent all active simulation processes from inter- fering. An according request is stated by the initiating segment to the SYSTEMC-LINK simulation controller. Since the controller employs a cooperative scheduling approach (see Section 9.2.2), it cannot directly halt actively running segments and must instead wait until they yield control voluntarily. Given the request cannot be fulfilled imme- diately by the simulation controller, the execution of the initiating segment must first be paused (6). All currently active segments are allowed to finish simulation of their current quantum, i.e., until each segment i reaches the tlim,i timestamp. Afterwards, they are not rescheduled and instead put into an inactive state (7). Once all other segments that could potentially cause race conditions have been paused, the initiating segment is allowed to resume operation (8). While all other segments are waiting, it is now allowed to execute its IMC by invoking the function pointer from the channel interface (9). Note that should the call be further forwarded to another segment, the exclusivity guarantee remains intact, even if connecting chan- nels have not been marked for exclusive IMCs. Once this call returns (10), the end of the exclusive IMC is signalled to the simulation controller, which in turn allows all other segments to resume execution. Finally, the response is returned to the initiating component via the local TLM target socket of the slave connector block (11). Note that it is technically not required to pause all simulation segments. Instead, it would be sufficient to only halt those, which interact with the segment the exclusive IMC is directed at. However, since TLM interface call chains may span multiple seg- ments, it is difficult to keep track which segments are involved and may therefore be subject to race conditions. Until more elaborate analysis techniques become available that can keep track of function pointer based IMC call chains, the decision has been made to assume all other segments may be affected and must therefore be paused. 136 Chapter 9. Parallel SystemC using Time-Decoupled Segments

Segment 0 "primary" Segment 1 "secondary"

cpu.o libscl libscl

bus.o COM. COM. cpu.o

SYSTEMC SYSTEMC library CTRL CTRL library

init, step, exit

Task Queue Task Task Channels interface

Thread Pool Thread 0 Thread 1 interface

SYSTEMC-LINK Controller

Figure 9.7: Composition of a fully featured VP based on SYSTEMC-LINK

9.4 Simulation Structure and Composition

Figure 9.7 gives an overview on the composition of a fully featured VP utilising SYSTEMC-LINK to achieve maximum parallel performance. It is based on the example dual-core platform depicted in Figure 9.2. Fundamentally, it shows the main compo- nents of a SYSTEMC-LINK simulator: the simulation controller, the simulation segments and their interconnecting TLM communication channels. However, Figure 9.7 addi- tionally depicts the interactions carried out between those components, enabling a holistic perspective for a detailed explanation:

• The Simulation Controller lies at the heart of the design. As described in Sec- tion 9.2, its main task lies in driving the simulation by orchestrating the execu- tion of segments using a cooperative scheduling scheme with coroutine seman- tics. Internally, each segment is modelled as a task, which has to be executed. In this context, task execution refers to either simulating a segment up to its limit timestamp or only until it wants to advance its time, depending on the chosen scheduling mode (c.f. Section 9.2.2).

• Segment tasks are organised within a tasks queue, which draws upon a thread pool for parallel execution. By separating tasks from threads, it is possible to ex- actly control the amount of host resources that should be allocated towards the simulation. For example, even if running a simulator composed out of multiple segments, it is still possible to only utilise a single thread, thereby eliminating race conditions and producing a deterministic scheduling behaviour. Other- wise, the thread pool can be configured to spawn one thread per task to achieve optimal parallel performance. 9.5. Experimental Results 137

• The simulation controller also holds channel interfaces for each channel present between segments. Using these interfaces, communication blocks can not only transmit data, but also directly talk to the simulation controller and scheduler, similar to a system call in a modern OS.

• Within the simulation segments, the support library libscl serves as the coun- terpart to the channel interfaces from the simulation controller. It contains the actual implementations of the master and slave connector blocks, supporting both queue and IMC based communication flows as introduced in Section 9.3.1 and Section 9.3.2, respectively. Once instantiated, these blocks automatically register themselves with the simulation controller, which in turn grants access to the internal forward and backwards queues as well as the IMC call interface of the associated channel.

• The second important task of libscl is to provide the control interface used by the tasks within the controller to advance simulation. Fundamentally, SYSTEMC-LINK requires its segments to implement the init, step and exit functions. For example, step is called whenever a task is executed. Unless a user provided alternative is specified, libscl includes a default implementation for this purpose, which directly interfaces with the SYSTEMC kernel as described in Algorithm 9.1.

• Finally, each segment provides an enhanced virtual sequential environment, enabling multiple use of models that rely on global state, e.g. the processor model depicted in Figure 9.7, thanks to global state replication. The design of simulation segments as individually compiled shared objects renders race conditions impossible, unless state gets explicitly shared via the standard TLM communication channels, such as DMI.

It should be noted that despite SYSTEMC-LINK being designed for use with SYSTEMC based VPs, its design is flexible enough to also incorporate simulators employing a different SLDL, e.g., SPECC. In this case, a custom supporting library must be imple- mented that provides the interface functions init, step and exit for that specific domain. For inter-segment communication, custom connector blocks must be provided as well. Afterwards, it is possible for simulation models from different domains to communi- cate with each other, thanks to the common channel interface.

9.5 Experimental Results

The experimental evaluation of SYSTEMC-LINK is performed in three parts. First, a synthetic VP is employed in Section 9.5.1 to study the impact of the scheduling mode on performance and timing accuracy Next, Section 9.5.2 studies the potential benefits of the channel latency network compared to a static global lookahead as it is employed by SCOPE. Lastly, Section 9.5.3 adapts the OpenRISC multicore platform ORVP for use with SYSTEMC-LINK. Its parallel performance results are assessed using the industry standard OSCI SYSTEMC kernel as a baseline reference. 138 Chapter 9. Parallel SystemC using Time-Decoupled Segments

4 send ∆c4,2 = 10ns ∆c3,4 = 10ns 3 → 4

∆c2,1 = 1µs ∆c2,3 = 10ns recv 1 2 3 work 2 → 3

∆c5,3 = 1µs recv ∆c1,5 = 10ns 5 → 3

tightly coupled 5 segment groups

Figure 9.8: Channel latency network experiment setup

Each experiment was repeated a fixed number of times and only averages are reported here. Measurement was performed on a quad-core Intel i7 workstation PC clocked at 2.67 GHz with 12 GiB RAM. To ensure consistent benchmarking results, temperature based dynamic overclocking (Intel Turbo-Boost) was disabled. More detailed information about the simulation host, as well as repetition count and the runtime of single experiment iterations can be found in Appendix C.

9.5.1 Scheduling Mode Analysis The first set of experiments analyzes the impact of the choice of scheduling mode ∆ ∆ when sending cross segment transactions insufficiently ahead of time, i.e., ttx < ci,j for a transaction tx sent from segment i to segment j. First, a simulation is con- structed that consists of five segments, which are interconnected with latency chan- nels as shown in Figure 9.8. For each channel, the corresponding channel latencies are annotated next to it. The scenario modelled in this figure corresponds to a sys- tem consisting of two loosely coupled subsystems, which feature low latency internal communication. Such setups are frequently encountered in multi processor systems featuring a DSP subsystem for audio or video processing, for example. Furthermore, Figure 9.8 also illustrates the three main components residing inside each segment:

• Sender module: for each outgoing channel, a segment has a sender module, denoted as send. Its task is to continuously send regular TLM transactions at a ∆ fixed frequency fsend with ttx = 0. Setting the local time offset to zero allows straightforward extraction of the transaction timing error ∆εtx, which occurs due to the time-decoupling employed between sending and receiving module.

• Receiver module: in order to receive transactions transmitted by sender mod- ules, segments contain a reception module (recv) for each incoming channel con- nection. After validating correct transmission of the transaction, it immediately replies to the sender with a success response. Reception of this response by the sender indicates the end of a transaction. 9.5. Experimental Results 139

4 4 1.4∆c4,2 1.4∆c3,4 1.56∆c4,2 1.57∆c3,4

0.06∆c2,1 1.4∆c2,3 1.96∆c2,1 1.57∆c2,3 1 2 3 1 2 3

1.3∆c 1,5 1.95∆c5,3 1.47∆c1,5 0.07∆c5,3 5 5

(a) ASAP mode (b) ALAP mode

Figure 9.9: Average transaction timing error ∆εtx per channel

• Worker module: a worker module is used to model simulation load incurred during simulation of active components, such as processors or accelerators. The worker module (work) executes NOP instructions in a loop until the desired amount of host cycles cwork has been consumed. This process is repeated contin- uously at a fixed frequency of fwork.

Similar to the channel latencies, the actual values chosen for cwork, fwork and fsend differ segment by segment in order to accurately model the intended scenario. The exact values selected for this set of experiments are shown in Table 9.1.

Segment cwork fwork fsend 1 10k host cycles 100 MHz 1 MHz 2 10k host cycles 100 MHz 10 MHz 3 10k host cycles 100 MHz 10 MHz 4 10k host cycles 100 MHz 10 MHz 5 1M host cycles 1 MHz 1 MHz

Table 9.1: Experiment segment configuration

The experiment consists of running the presented simulation setup for a total du- ration of 1 ms using two host threads. Afterwards, the average transmission time the transaction spent in the channel is extracted by measuring the time until a response from the peered segment has been received. Given that the local time offset used by the sending modules is ∆ttx = 0, the measured time delta directly yields the transac- tion timing error ∆εtx. Its average ∆εtx is derived from repeated experiment iterations. ∆ε ∆ Figure 9.9 shows tx for each channel i → j as a multiple of its channel latency ci,j for both ASAP and ALAP scheduling modes. ∆ε ∆ In general, it can be observed that 0 ≤ tx < 2 ci,j. The upper bound is given as the transactions need to pass through a channel twice: once on the forward path 140 Chapter 9. Parallel SystemC using Time-Decoupled Segments

1.4x 5 global 1.3x 4 local 1.2x 3 1.1x speedup threads 2 1x 1 0.9x 0 3 6 9 12 15 1 2 3 4 5 runtime (s) threads (a) runtime (b) speedup

Figure 9.10: Performance results for the global and local channel latency network experiment variants

∆ and once on the return path, thereby being delayed by up to ci,j each time. The lower bound takes effect when upon sending the transaction, the sender is ahead of the receiver and, upon responding, the receiver is ahead of the sender. In this case, no extra delays need to be added to ∆ttx to ensure the intended reception timestamp has not yet elapsed in the context of the receiving segment. This preserves causality and prevents future transactions from affecting past state. Furthermore it can be seen that ASAP scheduling yields an 8.3% lower ∆εtx com- pared to ALAP mode, resulting in smaller overall simulation time alterations. Since segments yield execution each time before advancing to their next timestamp, their simulation times are kept closer together. However, due to the reduced amount of context switches, simulation in ALAP mode ran 15% faster than in ASAP mode.

9.5.2 Channel Latency Network Analysis

The next set of experiments analyses the benefits of a channel latency network when comparing it to a static global lookahead approach, such as SCOPE. For this purpose, the experiment setup described by Figure 9.8 and Table 9.1 is reused. This regular setup is denoted as local variant, because it features local, per channel latencies. To emulate a global lookahead and ensure comparable timing behaviour, these individ- ∆ ∆ ual channel latencies must be set to their overall minimum, i.e., cglobal = min ci,j = 10 ns. The resulting experiment setup is henceforth denoted as global variant. The experiments consists of measuring the simulator runtime of both variants for a simulation duration of 1 ms. The experiment is repeated while increasing the number of threads available to the controller from one to five. Average runtimes of both variants are depicted in Figure 9.10a, while Figure 9.10b presents the speedup of the local variant using the global lookahead approach as a baseline. 9.5. Experimental Results 141

Overall, performance improves when using a channel latency network versus a static global lookahead for the presented scenario. The improvement varies between 1% (1.01×) in the single thread case and 22% (1.28×) when using three threads. Using a global lookahead, the simulation has to wait for the worker module of segment 5 every microsecond to finish execution of its comparably long NOP loop (c.f. Table 9.1). Since all other segments may only simulate ahead by up to 10ns as dictated by the channel latency towards their peers, not enough parallelizable workload is available to utilise more than a single thread on the host machine. Permitting different latencies on each channel causes the formation of tightly cou- pled segment groups as illustrated by Figure 9.8: the first group consists out of seg- ments 1 and 5, the remaining segments 2, 3 and 4 form a second group. Those groups may simulate ahead of each other by 1 µs according to the latencies of channels 2 → 1 and 5 → 3. While one thread is busy simulating the worker module of segment 5, other threads are free to advance the simulation of segments from the second group independently, yielding a better utilisation of the available simulation threads. Situations with loosely coupled groups of tightly coupled modules are frequently encountered in typical ESL scenarios: load spikes such as those produced by the worker of segment 5 can be caused by I/O peripheral components, for example a virtual hard disk that stores its content infrequently to the file system of the host computer. Another example is a virtual network adapter causing load spikes when sending or receiving Internet packets via the host network.

9.5.3 OpenRISC Multi-Core Platform

The final set of experiments studies the applicability of SYSTEMC-LINK within a realistic VP from the EDA domain. The OpenRISC multi-core platform ORVP presents itself as an interesting candidate here for multiple reasons: first, it is detailed enough to run an unmodified version of the Linux kernel and therefore allows performance studies of the Linux boot scenario, which is a prime use-case for VPs targeting software de- velopment. Furthermore, given its multi-core nature, enough parallelizable workload in the form of ISSs is available to effectively utilise parallel simulation. Finally, the reference instruction set simulator for OpenRISC, or1ksim [18], relies heavily on global state and presents therefore an opportunity to evaluate the global state replication feature of the enhanced virtual sequential environment of SYSTEMC-LINK. Before the experiment can be conducted, the reference ORVP must first be con- verted to be usable with SYSTEMC-LINK. This process is driven by three stages: first, it must be decided which components of the platform should be grouped into segments. Given that the majority of runtime is spent in the ISSs, it can be stated that it is best to have one ISS per segment as a rule of thumb. After partitioning, the second stage establishes the communication channels and their latencies between the segments. This is done by placing connector blocks at segment boundaries, where the regular TLM communication channels have been cut off. Finally, the last stage recomposes the VP using a platform description file, which describes the individual segments as well as their channel interconnections. Here, a benefit of the SYSTEMC-LINK approach 142 Chapter 9. Parallel SystemC using Time-Decoupled Segments

TLM Connection Segment 1

Interrupt Signal OpenRISC RAM (or1ksim) #1

ROM MPIC Segment 2 OpenRISC (or1ksim) CB CB UART #2

Segment 3 ETH OpenRISC (or1ksim) CB CB #3 FB

Segment 4 OpenRISC (or1ksim) CB CB SPI #4

Figure 9.11: ORVP split into segments for use with SYSTEMC-LINK is that the number of processors to be simulated can easily be increased by adding another ISS segment to the platform description. The resulting SYSTEMC-LINK variant of the ORVP is presented in Figure 9.11 after applying these three stages. Based on this partitioning approach four different variants of ORVP have been constructed:

• OSCI: this variant of the ORVP represents the OSCI reference implementation using only standard, sequential SYSTEMC. Because the chosen ISS, or1ksim, uses global state, this variant only features a single processor. The remainder of this platform is composed out of the interconnects and peripheral components shown in Segment 1 of Figure 9.11.

• SCL/UP: a uniprocessor variant featuring only Segment 1 from Figure 9.11. Since the simulated components match those considered by the OSCI variant, this variant can therefore serve as a baseline for assessing simulation overhead introduced by the SYSTEMC-LINK simulation controller.

• SCL/SMP2 and SCL/SMP4: thanks to the global state separation of SYSTEMC- LINK, construction of multi-core platforms based on or1ksim become possible. Consequently, both variants feature secondary processors, which are modelled by instantiating CPU segments (e.g. Segment 2 in Figure 9.11) multiple times. 9.5. Experimental Results 143

OSCI/UP SCL/UP SCL/SMP2 SCL/SMP4

20 MIPS

15 MIPS

10 MIPS

5 MIPS simulator performance

0 MIPS 1 2 4 max. threads

Figure 9.12: Simulation speed of ORVP using OSCI and SYSTEMC-LINK

The Linux boot scenario has been chosen as the application benchmark. This is a prime use-case for VPs, given that most early work during HW/SW codesign ad- dresses kernel porting and adaption to the new target design. During system boot, most architecture dependant code is executed, further implementing the need for a fast VP in absence of a physical prototype. Faster simulation speeds in this scenario directly lead to increased programmer productivity, since debug cycles are shortened. This set of experiments consists of booting the OpenRISC Linux kernel on the previously introduced variants of ORVP while recording simulation runtime using one, two, and four threads. In order to allow the VP to finish the booting procedure, simulation duration is set to 2s. The channel latencies are set to match the TLM quantum of 5µs and the scheduler uses ALAP scheduling for optimal performance. However, it should be noted that because or1ksim does not support SYSTEMC sleep models, all cores remain active and produce parallelizable simulation load. Given that the SCL/SMP2 and SCL/SMP4 variants simulate significantly more pro- cessors, a direct comparison of the runtimes of all variants appear unfair, though. Instead, assessment of the different variants is done in terms of Million Instructions Per Second (MIPS) of all processors in the system combined: the more OpenRISC in- structions can be processed by the VP per wall-clock second, the higher is its absolute performance. Consequently, Figure 9.12 presents total MIPS achieved by each variant over the maximum number of threads it is allowed to use. In the single threaded case, all variants perform similarly, showing only a marginal maximum performance difference between the uniprocessor and quad-core processor variants of 6.8%. In this case, the highest performance of 5.42 MIPS is achieved by the OSCI variant, while the SCL/SMP4 variant only reaches the minimum of 5.05MIPS. Naturally, the overhead of a simulation controller interfacing with the simulation of four segments is expected to show a performance reduction of this magnitude for the 144 Chapter 9. Parallel SystemC using Time-Decoupled Segments given scenario. The direct competitor of the OSCI variant, i.e., the SCL/UP variant, achieves 5.34 MIPS, only a reduction of 1.5%. This comparably low overhead indicates further optimisations in the design of the simulation controller might be futile. When increasing the number of threads available to the simulation, performance of the OSCI and SCL/UP variants remain the same. Since the former variant em- ploys sequential simulation technology, it is not capable of scaling with the number of threads. While this is not the case for the second variant, it does not have enough parallelizable workload to utilise the additional threads effectively, due to it only sim- ulating a single segment. However, performance improves for the SCL/SMP2 and SCL/SMP4 variants, showing a speedup over OSCI of 1.83× and 1.75×, respectively. The maximum performance of 17.53MIPS is achieved by the SCL/SMP4 variant using four threads. At this point, all other variants have stopped scaling as their worker threads starve from a lack of work, i.e, not enough segments are present. Using four threads, SCL/SMP4 achieves a speedup of 3.23× over regular SYSTEMC.

9.6 Limitations and Outlook

SYSTEMC-LINK has proven to be an effective approach for combating the performance bottleneck that threatens the viability of VPs as SW development tools. However, its design also exposes deficits of the approach that might require additional research in order to be tackled effectively. In this context, a fundamental issue is the separation of a fully featured VP into segments, which can currently only be done based on expert experience. Multiple decisions need to be made while considering the conservation of functional and temporal correctness of the parallel simulator: which components can be grouped into segments in order to produce a balanced load distribution? Which channel latencies must be chosen to reduce the temporal error while retaining enough scheduling flexibility to avoid thread starvation? Clearly, more work is required in this domain in order to bring forth tools and methodologies that provide VP designers with sufficient insight into their designs to tackle those challenges. For example, a SYSTEMC profiling tool could be employed to identify performance bottlenecks, i.e., hotspots such as ISSs. Conventional profilers for C/C++ programs lack the ability to provide context for the SYSTEMC domain, such as the runtimes of simulation processes or the detection of an excessive amount of delta cycles for a single timestamp. However, such information plays a crucial role in the identification of performance bottlenecks and consequently must be taken into account when deciding on the optimal segment partitioning for SYSTEMC-LINK. However, even after the segmentation decision has been made, further work is required. The actual partitioning procedure is still a manual process and can only reasonably be completed by a VP expert. While model source code is not required, the approach still depends on the availability of and insight into the high level sources, e.g., SYSTEMC wrappers and interconnect modules in order to place connector blocks. Separation of the monolithic VP into multiple segments usually also requires a reor- ganisation of its source code, which has further implications on project management, 9.6. Limitations and Outlook 145 development and operation. Tools like Synopsys PlatformArchitect [182] can help in this context, because it provides a graphical user interface for the placement and interconnection of SYSTEMC models. This tool already provides a mechanism for in- terconnecting components that use different TLM protocols for communication by providing automatically generated bridge components. A similar approach could be used to place connector blocks automatically at locations, where TLM connections have been cut by segment boundaries. Besides conceptual limitations, the prototype implementation of SYSTEMC-LINK also suffers a set of practical limitations. For example, currently, connector blocks only support cross segment communication via single valued SYSTEMC signals or a subset of the standard TLM interfaces, i.e., blocking, debug and DMI. Models that rely on the non-blocking transportation protocol of TLM currently need to use an adaptor mecha- nism to bridge from the non-blocking to the blocking transportation interface, thereby sacrificing timing accuracy in the process. Given the availability of exclusive IMCs in SYSTEMC-LINK, support for non-blocking communication becomes feasible, even for legacy models. Future implementations for this interface may be done analogously to the present designs of the debug and DMI protocols. However, it is questionable if the timing inaccuracies caused by the time-decoupling that is employed between segments produce acceptable results. Further research is required in this domain. Moreover, several opportunities present themselves for improving the performance while retaining timing accuracy. For example, experiment results from Section 9.5.1 indicate that the ASAP scheduling mode produces simulations with more accurate timing, while the ALAP mode increases overall simulation performance. Currently, the scheduling mode is fixed and defined at simulation start, which forces VP users that require high timing accuracy during a short time interval to globally forgo the performance benefits of the ALAP mode. A better solution would be to allow users to dynamically adapt the selected scheduling modes according to their needs. For exam- ple, during Linux boot, exact timing behaviour might not be necessary, and simulation should be speed up using the ALAP scheduling mode. When a developer debugs an issue that only occurs during an interrupt and requires high accuracy while it exe- cutes, the simulation controller temporarily adapts and switches to ASAP scheduling. Once the interrupt routine has completed, it could be switch back automatically. Finally, it could be investigated whether SYSTEMC-LINK can be effectively used as a foundation for multi-domain simulators. Currently, EDA tool providers witness the trend to embed a VP into its physical or algorithmic environment by combining its simulator with those of other domains. A popular example of this is the automotive industry: a VP models an engine control unit, receives sensor input and controls actu- ators, which are themselves part of a physical simulation and are typically designed with model based design tools, such as Matlab Simulink. Given that those simula- tors are also C/C++ based – just like SYSTEMC– native interoperability seems feasible. In order to tap into those new simulation domains, future works may include new support libraries, i.e., new libscls, which translate the control and communication messages sent by the SYSTEMC-LINK controller to that specific domain. 146 Chapter 9. Parallel SystemC using Time-Decoupled Segments

9.7 Synopsis

This chapter has presented SYSTEMC-LINK, a new concept for parallel SYSTEMC simula- tion that works on a hypervisor level and allows VP designers to use their SYSTEMC kernel of choice. Previous parallel SYSTEMC approaches have generally embedded themselves deeply into the simulation kernel, requiring a kernel replacement before one can take advantage of the parallel performance boost. This can cause compati- bility issues and may render any user added value of augmented kernel implemen- tations, such as profiling or tracing functionality, unusable. SYSTEMC-LINK circumvents this issue by modelling a VP as a collection of segments, each of which embedding their own kernel. Moreover, segments are interconnected by a channel latency net- work, which employs time-decoupling to further boost simulation performance by allowing minor timing alterations. Besides being kernel agnostic, SYSTEMC-LINK link offers the benefits of an enhanced virtual sequential environment. This allows the use of legacy models within a par- allel environment by automatically protecting them from race conditions that would otherwise occur when concurrent components communicate via unprotected IMCs. Moreover, SYSTEMC-LINK allows multiple instantiation of modules that rely on global state. Segments offer global state replication, which creates per segment copies of all global variables. This way it was possible to construct a multi-core OpenRISC VP based on the reference ISS or1ksim. This has been impossible before, since or1ksim stores all processor state for a single core globally and therefore cannot be instan- tiated multiple times. Starting from a single core design at 5.4 MIPS, the resulting multi-core simulation showed a speedup of 3.2×, achieving a maximum performance of 17.5 MIPS thanks to SYSTEMC-LINK. Chapter 10

Conclusion

VPs have a scalability problem. Unlike conventional hardware, VPs do not gain per- formance the more processors are added, but rather become slower. This is a huge problem for EDA tool providers, given that most computing platforms – embedded and high performance computing – are designed today with multi-core processors. In order for VPs to remain viable as SW development tools, this performance bottleneck must be overcome. At root of the problem lies the DES approach itself. Its sequential nature cannot easily be parallelized and therefore does not scale with todays multi- core host computers that are used to run VPs. The research domain for PDES has identified various ways to parallelize simula- tors, but all proposed solutions have so far been domain specific. Yet, no approach has been identified that can be applied universally to accelerate SYSTEMC based VPs. Their nature introduces new challenges that render conventional parallelization approaches difficult, e.g., nondeterministic model behaviour, a lack of a priori communication knowledge and race-prone legacy source code. This work has presented various approaches that apply the concepts of PDES, while also providing solutions to the aforementioned problems. Special attention has been paid to ensure their applicability in realistic scenarios by testing them with realistic VPs as employed within the EDA industry. The contributions are briefly sum- marised in Section 10.1. Afterwards, Section 10.2 concludes this work with potential future research directions for fast simulation technologies for ESL design.

10.1 Summary

Hardware inherently operates in parallel. Consequently, SLDLs aiming to model hardware must also incorporate features tailored specifically to describe this inherent concurrency. SYSTEMC employs the concept of simulation processes, which are defined for each model by its designers. Utilisation of such explicit concurrency information appears as an efficient strategy to construct a parallel simulator. Chapter 5 has fol- lowed this route and presented the time-decoupled parallel SYSTEMC kernel SCOPE. It employs a novel synchronisation algorithm that relaxes timing constraints and allows individual processes to run ahead of simulation time. Reducing synchronisation over- head proved an efficient approach, accelerating the simulation of the VEP by up to 7.6× using 8 threads and outperforming comparable DES approaches, such as PARSC, by 24 – 63%. Moreover, SCOPE takes special care to prevent race conditions within model code by offering a virtual sequential environment. This enables efficient reuse of legacy source code and facilitates integration of SCOPE into an existing VP.

147 148 Chapter 10. Conclusion

The implications of the timing alterations of SCOPE are further investigated in Chapter 6. During parallel operation, SCOPE normally requires all cross-thread com- munication to be stated sufficiently ahead of time in order to bridge the time differ- ence between the simulation processes. However, such a requirement poses a hurdle for the integration of conventional VPs that typically do not have such a priori com- munication knowledge. Moreover, since SCOPE ensures deterministic and race free operation within its virtual sequential environment, certain TLM protocols must be disabled. For example, the DMI protocol cannot be used, since it implies access- ing shared memory via pointers from concurrent threads, which is a prime source of nondeterminism. However, without DMI, maximum simulation performance appears unlikely. To work around this issue, Chapter 6 introduces new operation modes for SCOPE, that trade determinism and timing accuracy for a boost in parallel performance and integration facilitation. These new modes enabled the parallelization of the GVP, gaining a performance boost of 3.5× over regular SYSTEMC using four threads. In order to provide useful solutions for fast ESL simulation, it is not sufficient to provide parallel simulation engines alone. Problems raised by introducing par- allelism into a VP can sometimes only be solved by combining support from those engines with novel modelling primitives. One such primitive is introduced by Chap- ter 7, which tackles the issue of increased requirements of exclusive memory access operations in a parallel VP. In such a VP, it is no longer sufficient for an atomic operation to execute uninterruptedly only in simulation time. Instead, it must now also be atomic in real time, since concurrently operating processes of the simulation engine might otherwise interfere with the RMW sequence. Unfortunately, embed- ded exclusive access operations like LL/SC do not have equivalent counterparts on the x86 host architecture, so a simple one-to-one mapping is impossible. However, using the primitives introduced in Chapter 7, it became possible to construct a par- allel version of ORVP, which models an architecture that heavily relies on LL/SC to synchronise memory access. In this context, the combination of SCOPE and the new exclusive memory access model achieved a performance boost of 3.1× while running relevant industry benchmarks, such as booting Linux and Coremark. The processor models used in the previous chapters produced a homogeneous load distributions due to the fact that they were always active. Realistic systems, however, frequently employ processor sleep states in order to conserve energy during low load situations. Chapter 8 presents how these states can be exploited using sleep models in order to skip simulation of idle processors. On the one hand, this speeds up simulation as there is less activity to be processed by the SYSTEMC kernel. On the other hand, the models cause the amount of parallelizable work for SCOPE to be linked to the actual workload of the SW executed within the VP. With the load distribution now defined by the target SW, Chapter 8 studies whether the application of both technologies – sleep models and parallel simulation – remains feasible. Results show that in general, both approaches can be seen as orthogonal and augment each other. This claim is further substantiated with the fact that the best simulation performance of ORVP running mixed or parallel workloads can only be achieved when using both approaches in combination. 10.2. Outlook 149

Chapter Simulation Technology VEP GVP ORVP

5 SCOPE SYSTEMCkernel 3.9× N/A N/A 6 FlexibleTime-Decoupling – 3.5× N/A 7 Exclusive Memory Access Models – – 3.3× 8 Processor Sleep Models – – 3.8× 9 SYSTEMC-LINK Approach – – 3.2×

Table 10.1: Overview of simulation technologies and their peak speedups achieved for realistic VPs when using four host threads

Finally, Chapter 9 has addressed the practical issue of achieving parallel SYSTEMC without replacing the simulation kernel. As previously introduced, most PDES ap- proaches in EDA directly target the simulation kernel, given its explicit parallel de- scription in the form of simulation processes. However, a kernel replacement is not a favourable strategy for many EDA vendors, because of extra functionality embedded into their old sequential kernel. The SYSTEMC-LINK approach has been conceived to provide a solution for this problem. It splits the VP into multiple segments, which each carry their own SYSTEMC kernel. A simulation controller orchestrates their sim- ulation in parallel and handles cross-segment communication using channels, which offer regular TLM protocol interfaces. For optimal simulation performance, the time- decoupling approach of SCOPE has been reused. Special care has been taken to im- prove the virtual sequential environment for legacy models. The enhanced environ- ment also guarantees race free IMCs as well as global state replication. Using SYSTEMC- LINK, a new multi-core variant of ORVP has been constructed based on the reference ISS that achieves a total of 17.5MIPS – a speedup of 3.2× over regular SYSTEMC. Table 10.1 summarises the speedups achieved for the realistic VPs considered in this work. To allow a fair assessment, only results from experiments that utilise four host threads are presented. While the table does not show any intricacies of the individual approaches, such as determinism and thread-safety, it still allows to draw the conclusion that parallel simulation techniques are an effective approach to counter the performance bottleneck that is threatening the usefulness of VPs today.

10.2 Outlook

Parallel SYSTEMC is a difficult nut to crack. Consequently, most chapters in this work have already outlined potential future research directions that build on top of the presented tools and fill their feature gaps. These can generally be of two types: con- ceptual and implementation improvements. Because the field of SYSTEMC based EDA tools is vast, neither SCOPE nor SYSTEMC-LINK can possibly claim to be perfectly suit- able for every application scenario. While focus has been directed towards their use as fast SW development tools, features usually employed during DSE, such as non- blocking TLM, have received less attention. In this context, the demand for parallel 150 Chapter 10. Conclusion simulation is less dramatic than for SW development, given that DSE inherently of- fers a higher potential for scalability, e.g., by running individual test configurations in parallel. Nevertheless, extending the implementation coverage of both simulation engines should still prove beneficial and facilitate model integration. Overcoming the conceptual issues of time-decoupled SYSTEMC simulation can be expected to require significant additional research. One problem introduced with this concept is nondeterminism, which is a consequence of the use of DMI from parallel threads. However, reproducible behaviour is an often desired feature, especially while debugging race conditions of the modelled target HW. Currently, developers have to pick one: deterministic execution or parallel simulation. Future work therefore might investigate analysis tools and methods that identify concurrent access to shared mem- ory regions and enforce a strict ordering upon them to retain deterministic execution. Another issue posed by time-decoupled simulators is the tradeoff between tim- ing accuracy and optimal simulation speed. Similar to determinism, exact timing cannot be guaranteed when components that reside in different time zones exchange messages with each other without a priori communication knowledge, i.e., without stating them ahead of time. However, these timing alterations are necessary to retain causality as discussed in Chapter 2 and 6. Optimistic DES approaches might pose an option here, since detection of causality violations are trivial and could be used to roll the simulation back to a valid state. Rollback functionality is already present in many SW emulators, but is absent from SYSTEMC model libraries. Given that contemporary VPs are a conglomerate of models from various EDA vendors, new tools and method- ologies for state-save and state-restore for black box models are needed. Only once this is done, one can realistically evaluate optimistic parallel SYSTEMC approaches. Other opportunities beyond parallel SYSTEMC for fast and scalable simulation on ESL may also warrant the attention of researchers. One such opportunity is to move the focus of optimisation into the simulation hotspot and directly parallelize within the ISSs. Given that a typical VP spends the overwhelming majority of its runtime within its ISSs, they have traditionally been a prime candidate for optimisation. A parallel ISS could offload the execution of its cores to individual host threads. For DBT based engines the extra opportunity exists to offload the conversion of target code to host code from the main simulation thread to a secondary worker thread. As the industry moves forward, it also claims new application domains. Recently, full system simulators have become popular within the automotive industry, opening up a new VP application field as multi-domain full system simulators. They combine simulation of the physical domain with ESL simulators, e.g., by coupling a brake or gear shift model directly with a VP. Such simulators offer straightforward paralleliza- tion potential by offloading the simulation of the algorithm and physical domains to other threads. Moreover, processing systems in modern cars usually share a dis- tributed nature, with control units being connected to individual components, such as an engine control unit, and communicating over a shared bus infrastructure. Such designs may favour similarly distributed simulation approaches that capitalise on do- main specific intricacies to achieve optimal simulation performance in a similar way to how it has been done for traditional ESL design in this thesis. Appendix A

SystemC/TLM Simulation Overview

This chapter gives technical background information on SYSTEMC and TLM. It is in- tended as an update or refresher for readers who already have experience with sim- ulator design or have previously worked in the VP domain, but does not replace a comprehensive introduction for novices. For that purpose, further resources exist that provide an in-depth tutorial and an extensive reference for SYSTEMC and TLM [79, 80, 67, 19, 44, 26]. Subsequently, Section A.1 presents fundamental aspects of SYSTEMC, before Section A.2 introduces common TLM communication interfaces.

A.1 SystemC Core Concepts

Fundamentally, SYSTEMC is a C/C++ class library containing data types, interfaces and algorithms that are useful and required for constructing simulators. A keystone of SYSTEMC is its Discrete Event Simulation (DES) algorithm, which uses events and processes to describe model behaviour as presented in Section A.1.1. All simulation state in SYSTEMC must be assigned to modules, which in turn are organised in a module hierarchy. This hierarchy is briefly discussed in Section A.1.2, before Section A.1.3 depicts SYSTEMC facilities for realising inter-module communication.

A.1.1 Events and Processes Events represent discrete points in simulation time, where simulation state is updated. For this purpose, events may be notified, which causes them to trigger at a designated time stamp. However, events do not execute code and are on their own incapable of affecting or altering any simulation state. This task is handled by simulation processes, which are linked to events by means of sensitivity lists, similar to how it is done in VHDL [111] and Verilog [189]. If a process is sensitive to an event, this process will be executed whenever the event is triggered. In SYSTEMC, there are two fundamental types of processes: methods and threads. They are explained in the following:

• Methods are essentially functions, which are called whenever the process should execute, i.e., when one of the events it is sensitive to has been triggered. The called function is then always run until it returns. Methods generally incur low overhead and are most efficient when their runtime is short and their model state alterations are minimal. Many VPs employ method processes to react to interrupts and report the change of an interrupt line to the processor model, as this generally only boils down to setting the interrupt flag of the ISS.

151 152 Appendix A. SystemC/TLM Simulation Overview

• Threads differ from methods in the fact that they are allowed to pause for a user definable amount of time and thereby voluntarily yield execution to another process. When a thread yields, all its local data, i.e. its stack, is saved and restored once it resumes. Because this procedure is expensive, threads generally suffer lower performance compared to methods. However, the ability to yield and resume execution is unique to threads only.

Finally, SYSTEMC offers two kinds of sensitivity that can be used to bind the ex- ecution of aforementioned processes to the triggering of events: static and dynamic sensitivity. Static sensitivity must be declared upon process creation, usually dur- ing simulation elaboration. Once a process has received its static sensitivity, it can no longer be changed and remains active until the simulation terminates. Dynamic sensitivity can be altered at any time during execution, but remains only valid until the event triggers next time. Afterwards, dynamic sensitivity is cleared and must be restored manually. Whenever a process is both dynamically and statically sensitive to different events, the dynamic sensitivity receives priority and the statically bound event is ignored until the dynamic one has triggered once.

A.1.2 Module Hierarchy

SYSTEMC organises simulation models in so-called modules. Any aforementioned event or process can only be instantiated within a module and simulation processes can only execute code that has been defined as a member function of a module or a derived class. Modules may contain further modules, forming a module hierarchy. In such cases, the outer module is commonly referred to as parent module, while the contained one is called child module. Modules without a parent are called top level modules and are usually instantiated within the main routine of the simulator. Note that SYSTEMC generally allows multiple top level modules to coexist. VPs typically employ so-called wrapper modules in order to incorporate C/C++ models, such as ISSs, into a SYSTEMC simulation. This is possible, because SYSTEMC is also based on C/C++ and therefore binary compatible. Wrapper modules are conse- quently only needed as an interface between the models and the rest of the VP. They can be assembled swiftly, since it is usually sufficient to embed an instance of the ISS into the wrapper and create driver and interrupt processes to receive SYSTEMC signals and forward them as needed to the model.

A.1.3 Communication Infrastructure To allow different modules to communicate with each other in a standardised way, SYSTEMC offers the port and the signal modelling primitives. Ports must be placed within a module and can then either be used for sending or receiving simple data, such as integer or Boolean values. Ports used for sending are called out-ports, while receiving ports are denoted as in-ports. Ports are connected with signals, similar to how it is handled in VHDL. During elaboration, SYSTEMC checks that every out-port is connected to at least one in-port via A.2. Transaction Level Modelling 153 a signal and that every in-port is connected to exactly one out-port. Out-ports offer a write method that allows propagation of the written value to all connected in-ports. In-ports offer events that trigger whenever the signal value changes. Processes can be made sensitive to these events in order to get notified about these changes. Signals are clocked resources, which means that their value only updates after the next simulation delta cycle. If multiple values have been written during a single cycle, only the last one is considered. The update happens after all processes have finished execution, but before simulation time updates. This allows updated signals to notify their events and immediately produce another delta cycle at the current time stamp, where processes can be executed in response to a changed signal.

A.2 Transaction Level Modelling

Transaction Level Modelling (TLM) accelerates SYSTEMC simulations by abstracting away unneeded communication detail. Instead of simulating every signal and wire ac- cording to the bus protocol, data transmission is modelled in the form of transactions. A transaction is essentially an object that stores information on the communication request, such as operation type (read/write), target address and access width. The communication flow starts at the so-called initiator, which creates the trans- action object and fills in the required information about the operation it wishes to perform. The transaction is then passed along to its intended receiver, denoted as tar- get, using Interface Method Calls (IMCs). On its way there, it may additionally pass through multiple interconnect components, such as buses or bridges. Finally, once the transaction reaches its target, a user provided upcall function is invoked, allowing the model to service the request. To facilitate deployment of TLM, SYSTEMC provides default implementations and interfaces that are shared among EDA vendors. Instead of ports and signals, TLM uses initiator and target sockets placed within models to allow inter-module commu- nication on transaction level. By default, four communication interfaces are provided by these sockets, catering to the various needs of different application fields. These interfaces are discussed in the following.

A.2.1 Blocking Transport Interface The TLM blocking transport interface offers unidirectional communication between an initiator module with its target using a straightforward protocol. The correspond- ing IMC is shown in Listing A.1. void i_socket::b_transport( tlm_transaction & tx, sc_time & offset);

Listing A.1: TLM Blocking Transport Interface

Besides a transaction object, this IMC furthermore requires a local time offset, denoted as ∆ttx, which is used to annotate communication timing. Unless the ini- 154 Appendix A. SystemC/TLM Simulation Overview

tiator is operating ahead of simulation time, ∆ttx must be initialised to zero and is subsequently incremented by the target and potential interconnects to indicate time consumed to service the request. The communication protocol is as follows: first, the initiator initialises the transac- tion tx and local time offset ∆ttx as previously described. It then invokes the blocking transport IMC and passes the function arguments accordingly. On the target side, a user defined upcall will be invoked, which receives tx and ∆ttx as parameters. The target may now either act upon the request or forward it accordingly. Should this require simulation time to pass, the target may either annotate this by incrementing ∆ttx, or it may call wait to suspend the driving simulation process until the service time has elapsed. Completion of the communication is indicated by setting a response status flag within tx and returning from the upcall. The underlying communication infrastruc- ture subsequently causes the IMC invoked by the initiator to return with the updated values for tx and ∆ttx. This allows the initiator to resume execution and to check whether its request succeeded by investigating the response status flag of tx. In order to retain temporal correctness, it must furthermore acknowledge any increments to ∆ttx, either by calling wait to compensate for the elapsed time or by noting it for future transactions as an initial time offset. In the latter case, the component is henceforth denoted as operating ahead of simulation time. VPs making use of the blocking transport interface are often called loosely timed. While this way of communication fundamentally allows timing annotation, it is too simplistic to accurately represent complex transactions, i.e., out of order operations. Timing accuracy of loosely timed VPs is further reduced as they frequently use the blocking transport interface in combination with DMI and quantum based simulation.

A.2.2 Non-blocking Transport Interface In case higher timing accuracy is required, TLM also offers the non-blocking trans- port interface. It relies on a more complex communication protocol that separates transaction transmission into multiple phases and decouples a response from its cor- responding request. The necessary IMCs are depicted in Listing A.2. enum tlm_sync {TLM_ACCEPTED,TLM_UPDATED,TLM_COMPLETED}; enum tlm_phase {BEGIN_REQ,END_REQ,BEGIN_RESP,END_RESP};

tlm_sync i_socket::nb_transport_fw( tlm_transaction & tx, tlm_phase & phase, sc_time & offset); tlm_sync t_socket::nb_transport_bw( tlm_transaction & tx, tlm_phase & phase, sc_time & offset);

Listing A.2: TLM Non-blocking Transport Interface A.2. Transaction Level Modelling 155

From those two IMCs, initiator sockets may only call the fw variant to send trans- actions to a target. Targets must call the bw variant to inform initiators about an update or fulfilment of their requests. By separating requests from responses, the non-blocking transport interface facilitates modelling the out of order behaviour of modern processors and bus interconnects. The communication protocol revolves around the four phases shown in List- ing A.2. First, the initiator must send its request during the begin request phase using the fw IMC. The request is considered accepted, when the target issues its own IMC on the backward path during the end request phase. Afterwards, the target is free to prepare the response, while the initiator either waits or continues operation asyn- chronously. Once the target has finished computing its response, it must indicate that to the initiator by invoking the bw IMC and specifying begin response as the phase argument. The entire transaction is considered complete, once the initiator responds during the end response phase, acknowledging proper reception. To ease implementation effort, the target is allowed to signal early completion during the begin request phase by immediately returning its response on the return path. Furthermore, TLM offers standard implementations for interoperability be- tween blocking and non-blocking communication interfaces, so that both designs can be combined within a single VP. VPs taking advantage of the non-blocking transport interface are called approxi- mately timed. Because of the performance penalty introduced by the complexity of the communication protocol, approximately timed VPs are rarely encountered in SW development and are instead deployed more frequently in DSE scenarios.

A.2.3 Direct Memory Interface Performance-wise, the retrieval of memory words via IMCs as suggested by the pre- vious two communication interfaces is suboptimal. In order to keep an ISS running, at least one memory access is required in order to fetch the next instruction word. For DBT-based ISSs, the situation is even worse, given that they usually operate on trans- lation blocks covering tens of target instructions to achieve optimal speed. The over- head introduced by the previous two TLM protocols stands diametrically opposed to this goal, which is why another interface, called TLM Direct Memory Interface (DMI), is presented in Listing A.3. bool i_socket::get_direct_mem_ptr( tlm_transaction & tx, tlm_dmi& dmi); bool t_socket::invalidate_direct_mem_ptr( uint start, uint end);

Listing A.3: TLM Direct Memory Interface The central idea of the DMI protocol is to allow an initiator to directly access a memory array owned by a target by the means of regular C/C++ pointers. These pointers are provided with the dmi parameter shown above, including its correspond- ing target address range as well as the type of operation (read or write) it is valid 156 Appendix A. SystemC/TLM Simulation Overview for. To orchestrate the exchange of DMI pointers in a standardised manner, DMI pro- vides two IMCs for pointer requests and invalidation. Because this protocol favours simulation speed over accuracy, it is commonly only applied in loosely timed VPs. The protocol works as follows: a target may indicate its support for DMI by set- ting a DMI hint flag during a regular transaction using the blocking or non-blocking transport protocols. Upon receiving the hint, the initiator may request and acquire a DMI pointer using its IMC. Once the target receives this request, it is obliged to return a pointer that is at least valid for the target address ranged specified in the transaction parameter, but may optionally provide a larger range and higher access privileges. During simulation, situations may occur that require invalidation of DMI pointers, e.g., after requesting exclusive memory access (c.f. Chapter 7). For this purpose, DMI provides a second IMC that can be used by a target to request invalidation of all previously handed out pointers that span a given target address range. Initiators are obliged to immediately seize usage of these DMI pointers and must instead use either the blocking and non-blocking protocols until the pointer can be reacquired.

A.2.4 Debug Interface A major benefit of VPs is that they allow non-intrusive debugger access, enabling complete system introspection ranging from processor states over I/O register val- ues down to signal values of individual interrupt wires. The TLM debug interface is provided as a central means for constructing this non-intrusive debugger access functionality. It is based around a single IMC shown in Listing A.4. unsigned int i_socket::transport_dbg( tlm_transaction & tx);

Listing A.4: TLM Debug Interface

The IMC is similar to the blocking transport interface, but omits the local time offset parameter. This is due to the fact that debugger accesses in SYSTEMC must be non-intrusive and may consequently not alter simulation time. Similarly, receivers of this IMC are also forbidden from calling wait. The transaction object tx encodes the type of operation the initiating debugger component wishes to perform, including the target address and size. The access range usually exceeds regular bus widths, as debuggers commonly retrieve entire virtual memory pages of 4KiB size in one go. If desired by the user, the TLM debug interface can also be used to modify the contents of memories or I/O registers to facilitate debugging. However, it is the task of the model designers to decide whether or not to reflect these debug changes within their component model, given that support for the debug protocol is not mandatory in TLM. Appendix B

The Virtual Components Library

The Virtual Components Library (VCL) is a modelling library that contains frequently needed design primitives and programming utilities, but also features a set of fully modelled virtual components. It is written in C/C++ using SYSTEMC and its compo- nents are designed for use within VPs that employ a loosely timed modelling style. The platforms presented in Chapter 4 rely on the VCL to provide a comprehensive set of virtual components and sustain their claim to realism. This chapter first gives an overview over the fundamental design primitives of VCL in Section B.1, before enumerating the set of fully modelled components in Section B.2.

B.1 Modelling Primitives

The TLM communication protocols presented in Appendix A rely on IMC interfaces, which must receive a concrete implementation by the VP designer. Furthermore, given that TLM interfaces need to be as generic as possible in order to be widely applicable, glue code is required to convert regular transactions into user-definable upcalls that facilitate modelling of peripheral components. The concrete primitives provided by VCL for this purpose are outlined in the following. First, Section B.1.1 introduces VCL ports and sockets, which have been optimised for use within a loosely timed transaction level VP. Next, Section B.1.2 presents the modelling framework that facilitates design of slave components, such as UARTs, storage controllers and inter- face devices. Properties are introduces by Section B.1.2 to facilitate runtime configura- tion of a VP without the need to recompile. Finally, Section B.1.4 addresses the issue of logging for virtual components, including a functionality for tracing transactions and errors that provides assistance identifying bugs in the hardware models or their corresponding driver source code.

B.1.1 Ports and Sockets

VCL ports implement the SYSTEMC port interface and are used to connect to regular, simple valued SYSTEMC signals. Ports are commonly used to model interrupt lines connecting I/O components to the main processors. Beyond the regular implementa- tion of SYSTEMC, VCL ports optionally allow multiple writers by deferring the actual write operation to a single driver process. While regular SYSTEMC ports must always be connected to a signal, VCL ports can be auto-stubbed. If auto-stubbing is enabled for a port, it is no longer considered an error if the port is left unconnected. Instead, write operations are ignored and a warning message is generated only when the port

157 158 Appendix B. The Virtual Components Library is actually read. This allows VP designers to swiftly state whenever they are not in- terested in interrupts generated by a component instead of falling back to a manual workaround. Finally, VCL ports can be aggregated to port lists, allowing C-array like access to an arbitrary number of ports with non-consecutive indices. These port lists are generally used to model the incoming interrupt ports for processors that support a large number of physical interrupt lines, e.g. OpenRISC [184, 97, 105]. VCL sockets implement the TLM blocking transport, debug and DMI interfaces. A master socket provides two methods for reading and writing data to a specified target address. An additional flag parameter specifies the preferred communication protocol to use. If no preference is stated, the socket first attempts to transfer the data via DMI. If no suitable pointer can be retrieved, the data is converted to a transaction and sent via the blocking transport interface. Master sockets can optionally send an extended transaction, which additionally holds extra information about the issuing processor. This extension can then be used by targets to implement specific functionality that depends on the sender, such as banked registers and LL/SC. A VCL slave socket is designed to receive transactions sent via the blocking trans- port or debug routes. Furthermore, it can be instructed to provide DMI pointers for accesses to a user-specified target address range. Reception of a transaction causes the slave socket to invoke a virtual upcall function within its parent SYSTEMC module. Components wishing to react to transaction may override this method to implement their own handling logic. Finally, VCL slave sockets also handle byte order conver- sion, if their parent module and a received transaction specify a different endianess.

B.1.2 Peripherals and Registers

A VCL peripheral is a specialised SYSTEMC module that can contain an arbitrary num- ber of VCL registers. Every peripheral module must at least contain one VCL slave socket that receives transactions and forwards them internally to the contained regis- ter with matching target address. To assign registers to a peripheral, it is sufficient to instantiate them within its constructor, e.g. by defining them as private or public class members. Upon their instantiation, registers must be assigned a name, target address and size. Optionally, users can specify read and write upcalls, which are invoked whenever a transaction accesses the register to update its model state. Finally, reg- isters can be marked as read- or write-only, causing transactions with invalid access privileges to be ignored and instead return an error flag to the issuing processor.

B.1.3 Properties VCL properties represent simple valued data types, such as Boolean and integers, but also strings and SYSTEMC time, which are automatically initialised by VCL during simulation elaboration according to a user specified value. By replacing simple valued model parameters with their corresponding attribute representation, VP designers can quickly experiment with various component configurations without the need for time-consuming recompilation. B.1. Modelling Primitives 159

Properties receive their values from VCL property value providers. These providers identify properties globally via their hierarchical SYSTEMC name as keys and assign their initial values based on user provided key/value pairs. Currently, the providers can draw on the following three sources:

• Configuration Files are text files with one key/value pair per line. Comments are supported to facilitate documentation of configuration files for large VPs.

• Environment Variables that match the hierarchical SYSTEMC name of a property are used to initialise it.

• Command Line arguments in the form of -c name=value can also be used to specify the initialisation value for a VCL property.

Should two or more sources specify different initialisation values for the same property, VCL uses a source priority system to resolve this conflict. Configuration files receive the lowest priority in this system and may therefore be overwritten by any of the two other sources. Initialisation values specified via the command line enjoy the highest priority and overwrite the other two.

B.1.4 Logging The logging subsystem of VCL provides the central means for analysis and debug of virtual models but also for the software that interacts with them. A VCL log message is a tuple consisting of a message string, log level (error, warning, information, de- bug and trace), a timestamp and the issuing module. These messages are produced for certain events by default, e.g., the reception of an erroneous transaction, miss- ing access privileges or address errors. To issue custom log messages, VCL modules provide API calls that automatically generate log messages based on user-provided string messages. Once a log message has been created, it is forwarded to the log sinks, which form the backend of the logging subsystem and handle filtering and reporting. The VCL supports a range of log sinks that output prefiltered log messages to terminals, text files or via the network. To reduce performance impact of log message creation and transmission, rigorous filtering is performed. Log sinks can specify the log level range they are interested in receiving. Before creation of a log message, it is first checked whether at least one sink is interested in the specified log level. Otherwise, logging of the message is immediately aborted so that the performance impact remains minimal. A special application field of the logging subsystem is transaction tracing. To that extent, every VCL master and slave socket is equipped with attributes that control tracing verbosity. By default, tracing is disabled, but can be turned on to trace all or only erroneous transactions on a per socket basis. When it is activated, tracing allows to follow the route a transaction takes from its initiator until it reaches its target, as well as all modifications it undergoes while passing through interconnects. This allows VP users to identify the exact location and time a transaction is modified in an undesirable way, e.g., when an error flag is set. 160 Appendix B. The Virtual Components Library

B.2 Component Models

This section presents a set of fully modelled components available with the VCL that allow swift construction of VPs based on off-the-shelf components. It begins with the presentation of generic TLM models for memories and buses in Section B.2.1 and Section B.2.2, respectively. These models do not have real HW counterparts as their design goal is fast simulation and functional correctness. Subsequently, Sections B.2.3–B.2.6 present various virtual components that offer non-essential, but advanta- geous capabilities to VPs, such as serial I/O, VGA video and Ethernet networking. Those models are based on concrete HDL designs with openly available specifications.

B.2.1 Memory Model The VCL provides an abstract model for memory components, such as RAM, ROM or scratchpad memories. While its focus lies on functional correctness and adaptability, it provides rudimentary access timing annotation for requests originating from the TLM blocking transport interface. However, for optimal simulation performance, the model should be used with DMI, ideally directly accessed from the ISS or, if that is not possible, via the VCL master sockets. The size of the memory as well as read and write latencies are statically con- figurable via VCL properties. Furthermore, a Boolean property exists that allows switching the memory into read-only mode. In this mode, write transactions will be ignored and DMI pointers are only handed out with read access permissions. Tracing functionality exists to swiftly identify processors that erroneously attempt to write to such a read-only memory.

B.2.2 Memory-mapped Bus The memory-mapped bus component of VCL provides a fundamental interconnect between initiator and target sockets. The bus consists of provides a set of TLM target sockets that initiators, such as ISSs can connect to, as well as a series of outgoing TLM initiator sockets that can be bound to slave components, such as memories or I/O peripherals. Transactions received on the incoming side are forwarded based on their specified target memory address to one of the outgoing ports. VP developers can specify a memory map that assigns memory address ranges to specific outgoing sockets. Address lookup and translation are then performed by the bus automatically. If no matching map entry exists for a specific address, an address error is signalled back to the initiator and reported to the VCL tracing subsystem. In its generic version, VCL buses do not annotate timing and instead forward all transactions to their target instantaneously. Consequently, congestion is not modelled. While this does not pose a problem in practice, since most memory accesses are ex- pected to be performed using DMI, the bus also offers support for timed simulation. In this operation mode, it uses a transaction queue and an arbiter process in order to forward only one transaction at a time with an annotated delay. B.2. Component Models 161

Register Offset Width Access Description

THR 0x0 8 R/W Transmit Hold Register IER 0x1 8 R/W Interrupt Enable Register IIR 0x1 8 R/W Interrupt Identification Register LCR 0x3 8 R/W Line Control Register MCR 0x4 8 R/W Modem Control Register LSR 0x5 8 R Line Status Register MSR 0x6 8 R/W Modem Status Register SCR 0x7 8 R/W Scratch Register

Table B.1: VCL UART 8250 model registers

B.2.3 Universal Asynchronous Receiver/Transmitter 8250

The 8250 UART [132] is an Universal Asynchronous Receiver Transmitter (UART) built by and was first introduced with the IBM PC in 1981. It is the progenitor for the UART devices still found in modern PCs and many em- bedded devices. The UART is a peripheral component that allows a processor to communicate with external I/O devices, such as terminal screens and keyboards via a serial connection. Because it offers a straightforward register interface, that driver code that is necessary to access the UART, is minimal. Consequently, it is often the first choice of SW developers for outputting debug data and system status information during early system boot, but sometimes also during regular operation. The VCL contains a virtual model of the traditional 8250 UART. Table B.1 shows the register interface that has been modelled according to specifications using the peripheral subsystem of VCL. The concrete functions of each register can be found in its datasheet [132]. A VCL slave port has been added to allow access via the TLM blocking and debug transport interfaces. Transactions directed to the UART are dispatched to the registers according to their offset addresses as detailed in Table B.1. Since all registers are single byte wide, processors must ensure that load and store operations addressing these registers are of matching size. Attempts to access any register with a 16 or 32bit operation result in a bus error. When a single byte is written to the THR register, its value is forwarded to the transmission FIFO, ultimately resulting in the character to be output via the serial connection. The virtual UART supports various ways for outputting data as well as multiple sources to read input from. By default, the UART uses the standard input and output streams for this purpose. As this approach can quickly become confus- ing when multiple devices are being used, output of each UART can optionally be directed into unique text files. Finally, UART input and output can also be made available over the network, allowing terminal screens to connect to a simulation run- ning on a remote host, for example a server in a computing centre. 162 Appendix B. The Virtual Components Library

Register Offset Width Access Description

SPCR 0x0 8 R/W SPI Control Register SPSR 0x1 8 R/W SPI Status Register SPDR 0x2 8 R/W SPI Data Register SPER 0x3 8 R/W SPI Extensions Register SPCS 0x4 8 R/W SPIChipSelectRegister

Table B.2: VCL SPI controller registers

B.2.4 OpenCores SPI Controller The Serial Peripheral Interface (SPI) bus offers a serial communication interface be- tween a single master and multiple slave components. Since its inception in the late 1980s, it has become the de facto standard for accessing memory cards, such as Mul- timedia Cards (MMCs) and Secure Digital (SD) cards. Because these cards offer a cost effective way for adding exchangeable persistent storage in the gigabyte range, a corresponding SPI controller can be found in many embedded systems. The VCL offers a virtual SPI controller model that is based of an HDL design from OpenCores [73]. The register interface for the device is shown in Table B.2. Using the SPCS register, this controller can support up to eight SPI slaves simultaneously. Furthermore, the device has a VCL slave socket to receive transactions sent via the TLM blocking or debug interfaces, as well as a VCL port to generate interrupts. These interrupts are used to signal operation fulfilment back to the initiating processor. They can be optionally ignored, in which case device status must be polled from the SPSR register. If the latter approach is desired, the auto-stubbing feature of VCL ports may be used to immediately stub the interrupt line of the SPI controller. In essence, the SPI controller works as a bus bridge that translates between mem- ory mapped TLM transactions and the SPI bus protocol. The controller therefore de- fines a TLM-like interface function that must be implemented by all slave components, wishing to communicate with the controller. This interface is shown in Listing B.1. It is called whenever the controller has received a data word from the initiating proces- sor, which is then forwarded to the currently selected slave component.

void spi_slave::transfer(uint8_t data);

Listing B.1: SPI Slave Interface

The VCL also offers a model for a slave memory card that can be connected to the controller. It has been modelled according to the Micron M25P80 family serial flash embedded memory component [122]. All data stored by this device is represented as a binary file on the host computer and may also be inspected and modified using external tools, such as libguestfs [89]. B.2. Component Models 163

Register Offset Width Access Description

CTRL 0x00 32 R/W Control Register STAT 0x04 32 R/W Status Register HTIM 0x08 32 R/W Horizontal Timing Register VTIM 0x0c 32 R/W Vertical Timing Register VBARA 0x14 32 R/W VideoMemoryBaseRegisterA

Table B.3: VCL SPI controller registers

B.2.5 OpenCores VGA/LCD 2.0 Core The OpenCores VGA/LCD 2.0 core [74] is a peripheral component that enables em- bedded systems to output video data via VGA to LCD displays. Its virtual represen- tation from VCL models only a subset of the capabilities of the original device, but is sufficient to achieve video output on VPs running bare-metal applications or the Linux kernel. It supports output at various display resolutions with a colour depth of either 32 bit, 24 bit, 16 bit, 8 bit grayscale and 8 bit colour palette. It does not bring its own video memory, but must rather be programmed to utilise part of the system memory for the framebuffer. The subset of the registers that have been modelled for the VCL component are shown in Table B.3. Beyond the register set, the VGA controller also models a master and a slave VCL socket. The slave socket is required for accessing the registers from an ISS via the TLM blocking and debug transport interfaces. The controller utilises DMI to access the memory it has been assigned for its framebuffer through its VCL master socket. Output of the framebuffer happens at a fixed rate of 60 Hz. The VGA controller begins this process by first converting the contents of the framebuffer into a default format with 24 bit colour depth per pixel. The resulting two dimensional integer array can then optionally be compressed before forwarding it to the actual output backend. Two such backends are supported:

• Bitmap: this backend outputs the preconverted contents of the framebuffer into a bitmap file, which can be opened with regular image manipulation programs. By default, the backend overwrites the old image whenever new output is gen- erated. Users can optionally choose to keep each individual image. In this case, the simulation timestamp, at which the image was generated, is appended to its filename in order to make it distinguishable.

• Stream: this backend sends the image contents via the network to a streaming application. This streaming application is provided as a utility with the VCL. It receives the images at the rate they are sent by the VP and displays them immediately in a video frame. Consequently, the fluidity of a video directly depends on simulation performance. 164 Appendix B. The Virtual Components Library

Figure B.1: ORVP (left) rendering Mandelbrot set on VGA (right)

Figure B.1 shows the VGA controller in action. It presents the ORVP in the middle of rendering the Mandelbrot set from within a running Linux environment. The left-hand side shows a terminal window used for interfacing with the UART of the system, while the right-hand side presents the VCL streaming application displaying the current contents of the framebuffer. The refresh rate of 119 Frames Per Second (FPS) indicates that the VP is currently operating at almost twice its real time speed.

B.2.6 OpenCores 10/100 Mbps Ethernet

Using the virtual model of the OpenCores Ethernet IP core [126], VPs can be equipped with Ethernet connectivity, enabling them to participate in local networks as well as providing them with Internet access. The register interface modelled for this core is shown in Listing B.4. Furthermore, the component is equipped with a VCL slave socket for providing access to these registers via the TLM blocking and debug trans- port interfaces, as well as a master socket for retrieving outbound data and storing received Ethernet frames to main memory. Conceptually, the Ethernet core is split into modules: host interface and physi- cal backend. The host interface receives commands and transmission data from the controlling processor via the memory mapped bus. Ethernet packet transmission is then performed by the physical backend. For the virtual variant, only the host in- terface has been modelled. Beyond the aforementioned registers, it also has 1 KiB of local memory, allowing the storage of 256 buffer descriptors. Each buffer descriptor encodes information necessary for transmitting its contents via Ethernet, as well as pointers to the actual payload located in main system memory. The Ethernet core handles packet transmission at a default operation frequency of 15 MHz. However, if a different configuration is desired, this clock can easily be modified using the properties subsystem of VCL. When transmitting, the Ethernet core chooses the next ready buffer descriptor from its local memory and fetches the associated payload data from main memory using its VCL master socket and DMI. Once that operation has completed successfully, the core proceeds to create an Eth- B.2. Component Models 165

Register Offset Width Access Description

MODER 0x00 32 R/W ModeRegister INT_SOURCE 0x04 32 R/W InterruptSourceRegister INT_MASK 0x08 32 R/W InterruptMaskRegister IPGT 0x0C 32 R/W BacktoBackInterPacketGapRegister IPGR1 0x10 32 R/W NonBacktoBackInterPacketRegister1 IPGR2 0x14 32 R/W NonBacktoBackInterPacketRegister2 PACKETLEN 0x18 32 R/W PacketLengthRegister COLLCONF 0x1C 32 R/W CollisionRetryConfigurationRegister TX_BD_NUM 0x20 32 R/W Transmit Buffer Descriptor Number Register CTRLMODER 0x24 32 R/W ControlModuleModeRegister MIIMODER 0x28 32 R/W MIIModeRegister MIICOMMAND 0x2C 32 R/W MII Command Register MIIADDRESS 0x30 32 R/W MIIAddressRegister MIITX_DATA 0x34 32 R/W MII Transmit Data Register MIIRX_DATA 0x38 32 R/W MII Receive Data Register MIISTATUS 0x3C 32 R/W MIIStatusRegister MAC_ADDR0 0x40 32 R/W LowerMACAddressRegister MAC_ADDR1 0x44 32 R/W UpperMACAddressRegister ETH_HASH0_ADR 0x48 32 R/W HASH0Register ETH_HASH1_ADR 0x4C 32 R/W HASH1Register ETH_TXCTRL 0x50 32 R/W Transmission Control Register

Table B.4: VCL Ethernet model registers ernet packet by combining a packet header derived from the buffer descriptor and the payload. Normally, the fully assembled packet would then be passed on to the physical backend for regular transmission. However, the VCL representation of the Ethernet core replaces this backend with a different output subsystem. Users may choose between two variants: • logfiles: packets scheduled for transmission are written to a text-based logfile. This file includes simulation timestamp at which the packet was output by the core as well as its destination address and payload. • virtual network: a virtual network interface [101] is created on the host com- puter, receiving all Ethernet packets generated by the VCL Ethernet core. This enables the creation of a virtual network between the VP its host. Reception of Ethernet packets as well as Internet access is only possible when the virtual network backend is chosen. Furthermore, network address translation must be enabled on the host, effectively making it act like a router for the VP. 166 Appendix B. The Virtual Components Library Appendix C

Experimental Data

This chapter holds detailed measurement data that form the foundation for the ex- perimental evaluation of the tools presented in this thesis. Table C.1 summarizes the measurements by chapter, linking figures shown in the respective experimental evaluation sections to the data they are based on. Each table presents measurement data for individual experiment runs as well as averaged results used in the creation of aforementioned figures. Speedups are derived by accumulating baseline simulator runtimes and dividing it by the sum of simulation times for a sub-benchmark. For EURETILE experiments, clean runtimes are reported, which correspond to the actual simulation time without elaboration time. This distinction has been made to provide a fair comparison of simulation speedup, as the elaboration is only parallelized by SCOPE and not by PARSC or regular OSCI SYSTEMC. The simulation hosts used for driving these experiments are presented in Ta- ble C.2. Both hosts share the same software environment, i.e., operating system kernel, compiler and C library. The hosts used for each experiment are outlined in Table C.1.

Chapter Figure Data Tables Simulation Host

Figure5.9,5.10 TableC.3,C.4 Chapter 5 serious Figure5.11,5.12 TableC.5,C.6,C.7

Figure 6.8a, 6.8c, 6.9a Table C.8, C.9 Chapter 6 adamus Figure 6.8b, 6.8d, 6.9b Table C.10, C.11

Figure 7.4 Table C.12 Chapter 7 serious Figure7.6,7.5 TableC.13

Figure 8.7a, 8.7c, 8.7e Table C.14 Chapter 8 Figure8.8 TableC.15,C.16 serious Figure8.9 TableC.15,C.17

Figure 9.9 Table C.18 Chapter 9 Figure9.10 TableC.19 serious Figure9.12 TableC.20

Table C.1: Measurement data overview

167 168 Appendix C. Experimental Data

Hostname serious adamus

Processor Intel®Core™i75960X Intel®Core™i7920 Cores 8 (+8 via hyperthreading) 4 (+4 via hyperthreading) Clock 3GHz 2.67GHz L1-DCache 8 × 32 KiB 8-way set associative 4 × 32 KiB 8-way set associative L1-ICache 8 × 32 KiB 8-way set associative 4 × 32 KiB 4-way set associative L2-Cache 8 × 256 KiB 8-way set associative 4 × 256 KiB 8-way set associative L3-Cache 20 MiB 20-way set associative 8 MiB 16-way set associative Memory 16GiBRAM 12GiBRAM Kernel Linux 2.6.32-642.13.1.el6.x86_64 Linux 2.6.32-642.13.1.el6.x86_64 LibC GNUlibc2.12 GNUlibc2.12 Compiler GCC4.4.7 GCC4.4.7

Table C.2: Simulation hosts referred to by Table C.1 169 × × × × × × × × × Average Clean Speedup × × × × × × × × × s 0.81 s 1.16 s 0.91 1.93 s 1.00 01 s 71.34 s 72.10 s 71.90 s 70.67 s 7.11 Average Clean Speedup .71 s 130.45 s 130.17 s 130.44 s 130.75 s 129.47 s 3.88 .98 s 265.20 s 265.57 s 294.56 s 268.16 s 266.49 s 1.89 .77 s 527.64 s 527.37 s 527.76 s 528.45 s 526.04 s 0.96 .98 s 285.61 s 285.34 s 285.13 s 285.55 s 283.69 s 1.77 .15 s 333.06 s 333.35 s 331.99 s 332.53 s 330.67 s 1.52 .42 s 434.89 s 435.13 s 435.42 s 434.70 s 432.85 s 1.16 .59 s 589.20 s 588.57 s 590.21 s 596.05 s 594.19 s 0.85 s 503.42 s 506.92 s 504.06 s 503.02 s 504.31 s 502.49 s 1.00 Experiment Iteration Experiment Iteration 1 2 3 EURETILE measurement data: fft application scenario EURETILE measurement data: presto application scenario 1 15.00 s 17 377.37 s 17 381.60 s 17 369.18 s 17 376.05 s 17 361.05 2 15.01 s 12 123.36 s 12 096.32 s 12 152.59 s 12 124.09 s 12 109.08 1 15.77 s 15 479.69 s 15 482.06 s 15 499.69 s 15 487.15 s 15 471.38 48 15.04 s 15.00 s 8541.50 s 6869.13 s 8532.50 s 6873.44 s 8573.44 s 6850.54 s 8549.15 s 6864.37 s 8534.11 s 6849.37 s 1.65 2.05 248 8.36 s 4.63 s 2.81 s 7590.99 s 3450.76 s 7603.31 s 1603.71 s 3470.13 s 7590.24 s 1597.14 s 3438.49 s 7594.85 s 1627.01 s 3453.13 s 7586.49 s 1609.29 s 3448.50 s 1.85 1606.48 s 4.08 8.76 Table C.4: Table C.3: 1 2 3 4 5 6 7 8 910 ope SC OSCI 1 14.99 s 14 105.70 spar 14 125.75 s 14 029.32 s 14 086.92 s 14 07 SC Kernel Threads Elaboration 8 1.23 s 72.10 s 72.64 s 71.53 s 71.33 s 72.10 s 71.14 s 72.74 s 72. 4 1.28 s 130.19 s 130.83 s 130.52 s 130.18 s 132.79 s 131.25 s 130 2 1.67 s 265.23 s 266.16 s 264.41 s 265.20 s 265.32 s 264.99 s 264 1 2.41 s 528.99 s 528.73 s 527.46 s 528.51 s 530.29 s 528.94 s 528 8 1.86 s 285.74 s 286.66 s 285.16 s 286.95 s 284.83 s 285.09 s 284 4 1.86 s 332.99 s 332.62 s 331.47 s 332.14 s 333.48 s 331.04 s 333 2 1.85 s 434.39 s 432.97 s 434.30 s 436.63 s 434.83 s 436.06 s 432 1 1.86 s 589.95 s 589.67 s 651.23 s 590.65 s 590.24 s 590.20 s 590 ope SC OSCI 1 1.82 s 506.17 spar 503.08 s 504.23 s 504.75 s 504.25 s 503.15 SC Kernel Threads Elaboration 170 Appendix C. Experimental Data hed okha Elaboration Lookahead Threads 2 1 0 s16 6.3s261 6.1s252 6.2s264.99 269.84 s 265.32 s 269.44 s 265.20 s 269.89 s 264.41 s 270.22 s 266.16 s 269.98 s 265.23 s 270.32 s 1.67 s 1.67 ns 200 ns 100 528.94 527.83 s 530.29 s 528.07 s 528.51 s 528.87 s 527.46 s 526.79 s 528.73 s 528.20 s 528.99 s 528.08 s 2.41 s 2.42 ns 200 ns 100 0n .6s219 8.3s215 8.7s213 8.8s 281.98 s 287.96 s 281.30 s 314.92 s 285.45 s 281.57 s 302.62 s 290.79 s 284.84 s 281.54 s 333.05 s 303.75 s 291.01 s 287.00 s 281.43 s 334.11 s 302.74 s 291.22 s 285.15 s 281.92 s 333.75 s 335.87 s 290.22 s 285.69 s 331.92 s 302.73 s s 290.62 1.66 s 333.22 s s 302.97 1.67 s s 333.49 1.66 ns s 50 1.67 ns s 1.66 40 s 530.06 ns 30 s 528.05 s ns 528.89 20 s 527.38 s ns 526.78 s 10 528.17 s 528.31 s 528.97 s 527.71 s 529.06 s 530.87 s 530.10 s 528.25 s 528.17 s 527.80 s 528.97 s 530.51 s 529.13 s 528.35 s 526.93 s 527.46 s 528.40 s 528.82 s 529.04 s 528.67 s 528.96 s s 528.23 2.43 s 528.76 s s 530.13 2.43 s s 528.58 2.44 ns s 50 2.42 ns s 2.41 40 ns 30 ns 20 ns 10 s16 9.6s320 2.0s321 8.7s365 3 s 386.53 s 389.77 s 392.14 s 425.90 s 392.00 s 391.36 s 1.67 ns 1 5 s 572.86 s 573.96 s 573.99 s 571.45 s 574.86 s 574.75 s 2.42 ns 1 s16 4.4s302 5.1s385 5.7s390 3 s 349.03 s 351.47 s 348.51 s 352.21 s 350.27 s 349.34 s 1.66 ns 2 5 s 531.93 s 530.64 s 531.27 s 531.76 s 530.70 s 531.20 s 2.41 ns 2 s16 4.6s324 4.3s331 4.0s327 3 s 342.76 s 342.60 s 343.18 s 340.73 s 342.43 s 342.96 s 1.65 ns 3 5 s 530.79 s 530.02 s 528.46 s 531.76 s 529.64 s 531.77 s 2.42 ns 3 s16 3.3s392 3.1s395 3.9s391 3 s 339.19 s 338.89 s 339.51 s 338.41 s 339.29 s 338.33 s 1.66 ns 4 5 s 529.68 s 529.46 s 528.63 s 530.53 s 529.06 s 531.49 s 2.42 ns 4 s16 3.7s370 3.9s364 3.9s367 3 s 336.75 s 336.29 s 336.43 s 336.79 s 337.01 s 336.87 s 1.67 ns 5 5 s 528.82 s 530.00 s 529.46 s 528.98 s 529.71 s 529.81 s 2.41 ns 5 al C.5: Table 10 9 8 7 6 5 4 3 2 1 UEIEmaueetdt:pet okha oeadtoth two and (one lookahead presto data: measurement EURETILE xeietIteration Experiment 6.8s252 6.7s245 6.6s264 1.89 1.87 s 266.49 s 268.36 s 268.16 s 270.03 s 294.56 s 269.61 s 265.57 s 269.94 s 265.20 s 270.95 s 264.98 s 270.14 s s 0.96 0.96 s 526.04 s 525.26 s 528.45 s 527.68 s 527.76 s 527.08 s 527.37 s 527.69 s 527.64 s 526.75 s 528.77 s 527.41 s s 8.9s217 8.7s215 8.1s299 1.79 1.77 s 279.95 1.72 s 284.09 s 281.61 1.65 s 291.60 s 285.76 s 281.54 1.52 s 304.45 s 293.26 s 285.17 s 281.47 s 331.61 s 306.12 s 290.61 s 284.94 s 281.70 s 333.27 s 302.04 s 291.12 s 284.95 s 281.69 s 333.71 s 302.21 s 290.88 s 286.42 s 332.72 s 303.08 s 291.17 s 333.79 s 303.15 s 332.92 0.96 0.96 s 525.80 0.96 s 525.99 s 528.23 0.95 s 525.98 s 528.42 s 528.51 0.95 s 527.09 s 528.42 s 529.93 s 527.78 s 526.46 s 529.51 s 528.51 s 529.59 s 528.05 s 528.87 s 529.65 s 528.85 s 528.18 s 527.09 s 528.33 s 528.67 s 528.14 s 528.35 s 529.76 s 531.25 s 527.94 s 528.63 s 529.10 s 528.62 22 8.5s330 9.4s346 9.1s1.28 s 393.01 s 394.68 s 393.94 s 393.05 s 389.85 s 92.27 0.88 s 570.98 s 573.40 s 572.16 s 570.67 s 576.98 s 72.34 31 5.9s395 4.5s304 4.6s1.44 s 348.76 s 350.42 s 349.25 s 349.56 s 351.39 s 53.13 0.95 s 528.72 s 531.13 s 530.35 s 530.13 s 530.88 s 32.45 32 4.1s322 4.6s325 4.0s1.47 s 340.90 s 342.55 s 342.46 s 342.21 s 342.91 s 43.21 0.95 s 527.81 s 530.23 s 529.80 s 529.02 s 530.65 s 30.41 08 3.7s390 3.7s321 4.1s1.48 s 340.51 s 342.17 s 338.37 s 339.02 s 339.87 s 70.80 0.95 s 527.06 s 529.48 s 528.71 s 528.41 s 530.28 s 28.51 83 3.6s373 3.1s368 3.3s1.50 s 335.13 s 336.80 s 336.11 s 337.38 s 335.96 s 38.38 0.95 s 526.98 s 529.39 s 529.78 s 529.41 s 529.17 s 28.76 reads) vrg la Speedup Clean Average × × × × × × × × × × × × × × × × × × × × × × × × 171 × × × × × × × × × × × × × × × × × × × × × × × × Average Clean Speedup t threads) ss 83.99 s 72.01 s 83.53 s 71.34 s 84.31 s 72.13 s 84.38 s 71.91 s 83.14 s 70.68 s 6.04 7.11 55.06 s 255.17 s 254.87 s 254.89 s 255.07 s 253.78 s 1.98 59.31 s 257.36 s 261.88 s 257.67 s 258.82 s 257.64 s 1.95 63.25 s 262.00 s 261.69 s 262.18 s 262.56 s 261.27 s 1.92 79.97 s 280.00 s 280.66 s 279.47 s 280.26 s 279.09 s 1.80 75.80 s 277.58 s 278.56 s 277.23 s 277.53 s 276.25 s 1.82 25.41 s 324.65 s 324.10 s 325.96 s 324.77 s 323.48 s 1.55 09.86 s 310.20 s 318.62 s 310.09 s 311.67 s 310.39 s 1.62 20.22 s 421.42 s 420.12 s 419.54 s 420.30 s 419.11 s 1.20 03.84 s 402.43 s 404.84 s 402.74 s 403.21 s 401.92 s 1.25 91.43 s 688.58 s 694.54 s 688.58 s 690.57 s 689.39 s 0.73 235.24 s193.36 s 236.24 s172.44 s 193.19 s 236.34 s166.73 s 172.59 s 193.09 s 236.64 s159.96 s 166.51 s 172.41 s 193.34 s 235.82 s 160.31 s 166.54 s 173.09 s 193.47 s 234.53 s 160.88 s 166.28 s 172.66 s 192.19 s 2.14 160.60 s 166.43 s 171.37 s 2.61 160.36 s 165.15 s 2.93 159.07 s 3.04 3.16 203.58 s144.39 s 203.38 s120.26 s 144.10 s 203.60 s110.24 s 121.96 s 144.73 s 203.31 s103.13 s 110.46 s 120.76 s 144.48 s 204.21 s 102.77 s 111.07 s 120.73 s 144.93 s 202.99 s 103.44 s 110.64 s 121.15 s 143.80 s 2.48 103.21 s 110.66 s 119.95 s 3.49 103.11 s 109.49 s 4.19 101.89 s 4.59 4.93 ss 144.14 s 130.71 s 144.70 s 130.45 s 145.14 s 130.17 s 144.16 s 130.44 s 144.51 s 130.75 s 143.22 s 129.47 s 3.51 3.88 Experiment Iteration EURETILE measurement data: presto lookahead (four and eigh 1 2 3 4 5 6 7 8 910 Table C.6: 5 ns 1.29 s 255.12 s 255.39 s 254.62 s 254.77 s 255.20 s 255.61 s 2 5 ns 1.18 s 260.42 s 257.95 s 258.35 s 258.75 s 258.14 s 258.36 s 2 4 ns 1.29 s 262.24 s 262.12 s 264.08 s 262.64 s 262.47 s 262.93 s 2 4 ns 1.17 s 281.04 s 279.60 s 281.33 s 279.53 s 280.45 s 280.54 s 2 3 ns 1.28 s 278.01 s 277.78 s 277.49 s 277.27 s 277.49 s 278.04 s 2 3 ns 1.29 s 325.78 s 326.70 s 324.63 s 322.75 s 322.99 s 324.75 s 3 2 ns 1.28 s 310.81 s 309.32 s 308.56 s 317.93 s 310.74 s 310.60 s 3 2 ns 1.19 s 420.37 s 421.09 s 419.51 s 418.46 s 422.22 s 420.01 s 4 1 ns 1.29 s 404.98 s 400.84 s 403.69 s 402.90 s 403.36 s 402.52 s 4 1 ns 1.18 s 691.83 s 689.20 s 690.59 s 693.33 s 689.55 s 688.07 s 6 10 ns20 ns30 ns40 1.29 s ns 1.2850 s ns 1.29 235.93 s s 1.28 193.72 s s 235.11 s 1.29 172.44 s s 193.69 s 235.72 s 166.12 s 172.42 s 193.80 s 235.68 s 160.07 s 165.95 s 173.49 s 193.47 s 235.72 s 160.22 s 166.82 s 172.55 s 193.03 s 235.62 s 160.08 s 166.44 s 172.81 s 193.96 s 160.7610 s 166.54 ns s 172.37 s 20 160.56 ns s 166.39 s 30 ns 160.15 s 40 1.22 s ns 1.1350 s ns 1.20 204.67 s s 1.17 144.76 s s 204.26 s 1.22 120.79 s s 147.32 s 204.70 s 110.17 s 122.39 s 145.21 s 204.83 s 102.36 s 111.95 s 121.29 s 144.15 s 205.19 s 103.08 s 110.50 s 121.19 s 144.83 s 204.59 s 103.22 s 110.76 s 121.77 s 145.30 s 102.93 s 110.49 s 120.36 s 103.22 s 110.32 s 103.69 s 100 ns200 ns 1.29 s 1.28 s 144.63 s 130.19 s 144.01 s 130.83 s 144.58 s 130.52 s 144.53 s 130.18 s 144.51 s 132.79 s 144.69 131.25 100 ns200 ns 1.24 s 1.23 s 83.97 s 72.10 s 85.11 s 72.64 s 85.33 s 71.53 s 84.42 s 71.33 s 84.40 s 72.10 s 84.10 s 71.14 s 84.59 72.74 4 8 Threads Lookahead Elaboration 172 Appendix C. Experimental Data

Experiment Iteration Threads Lookahead Elaboration Average Clean Speedup 1 2 3

1 ns 15.74 s 15 619.98 s 15 593.09 s 15 582.63 s 15 598.57 s 15 582.83 s 0.90× 2 ns 15.76 s 15 547.12 s 15 499.18 s 15 566.21 s 15 537.50 s 15 521.74 s 0.91× 3 ns 15.74 s 15 516.35 s 15 523.84 s 15 495.14 s 15 511.78 s 15 496.04 s 0.91× 4 ns 15.73 s 15 512.30 s 15 549.93 s 15 530.68 s 15 530.97 s 15 515.24 s 0.91× 5 ns 15.75 s 15 595.34 s 15 493.16 s 15 533.05 s 15 540.52 s 15 524.77 s 0.91× 10 ns 15.73 s 15 534.65 s 15 465.85 s 15 450.87 s 15 483.79 s 15 468.06 s 0.91× 1 20 ns 15.74 s 15 477.36 s 15 486.68 s 15 509.23 s 15 491.09 s 15 475.35 s 0.91× 30 ns 15.86 s 15 476.88 s 15 486.87 s 15 487.51 s 15 483.75 s 15 467.89 s 0.91× 40 ns 15.73 s 15 516.60 s 15 483.39 s 15 491.21 s 15 497.07 s 15 481.34 s 0.91× 50 ns 15.89 s 15 566.21 s 15 477.96 s 15 460.17 s 15 501.45 s 15 485.56 s 0.91× 100 ns 15.73 s 15 644.48 s 15 490.89 s 15 476.83 s 15 537.40 s 15 521.67 s 0.91× 200 ns 15.77 s 15 479.69 s 15 482.06 s 15 499.69 s 15 487.15 s 15 471.38 s 0.91×

1 ns 8.37 s 8976.94 s 8949.24 s 8985.73 s 8970.64 s 8962.27 s 1.57× 2 ns 8.36 s 8614.49 s 8578.98 s 8607.76 s 8600.41 s 8592.05 s 1.64× 3 ns 8.36 s 8503.47 s 8474.85 s 8464.34 s 8480.89 s 8472.53 s 1.66× 4 ns 8.35 s 8436.74 s 8453.78 s 8403.35 s 8431.29 s 8422.94 s 1.67× 5 ns 8.38 s 8390.16 s 8461.24 s 8435.71 s 8429.04 s 8420.66 s 1.67× 10 ns 8.37 s 8326.78 s 8358.02 s 8408.72 s 8364.51 s 8356.14 s 1.68× 2 20 ns 8.36 s 7666.13 s 7661.27 s 7663.68 s 7663.69 s 7655.33 s 1.84× 30 ns 8.35 s 7642.61 s 7640.30 s 7606.30 s 7629.74 s 7621.39 s 1.85× 40 ns 8.36 s 7611.01 s 7666.56 s 7611.28 s 7629.62 s 7621.26 s 1.85× 50 ns 8.35 s 7622.14 s 7618.49 s 7601.26 s 7613.96 s 7605.61 s 1.85× 100 ns 8.35 s 7592.82 s 7588.25 s 7604.05 s 7595.04 s 7586.69 s 1.85× 200 ns 8.36 s 7590.99 s 7603.31 s 7590.24 s 7594.85 s 7586.49 s 1.85×

1 ns 4.64 s 7021.85 s 7052.60 s 6999.57 s 7024.67 s 7020.03 s 2.00× 2 ns 4.65 s 5659.98 s 5677.54 s 5646.43 s 5661.32 s 5656.67 s 2.49× 3 ns 4.64 s 5150.82 s 5168.08 s 5158.01 s 5158.97 s 5154.33 s 2.73× 4 ns 4.63 s 4943.48 s 4933.14 s 4968.88 s 4948.50 s 4943.87 s 2.85× 5 ns 4.61 s 4837.72 s 4812.85 s 4812.78 s 4821.12 s 4816.51 s 2.92× 10 ns 4.63 s 4541.19 s 4531.67 s 4529.06 s 4533.97 s 4529.34 s 3.11× 4 20 ns 4.63 s 3612.20 s 3601.63 s 3604.87 s 3606.23 s 3601.60 s 3.91× 30 ns 4.64 s 3555.23 s 3557.72 s 3524.96 s 3545.97 s 3541.33 s 3.97× 40 ns 4.62 s 3500.96 s 3513.86 s 3526.04 s 3513.62 s 3509.00 s 4.01× 50 ns 4.63 s 3494.38 s 3509.18 s 3504.41 s 3502.66 s 3498.03 s 4.02× 100 ns 4.63 s 3445.77 s 3450.00 s 3489.72 s 3461.83 s 3457.20 s 4.07× 200 ns 4.63 s 3450.76 s 3470.13 s 3438.49 s 3453.13 s 3448.50 s 4.08×

1 ns 2.80 s 17 133.49 s 17 313.41 s 17 261.52 s 17 236.14 s 17 233.34 s 0.82× 2 ns 2.82 s 8792.22 s 8835.66 s 8841.37 s 8823.08 s 8820.26 s 1.60× 3 ns 2.79 s 5827.54 s 5876.58 s 5827.90 s 5844.01 s 5841.22 s 2.41× 4 ns 2.79 s 4611.79 s 4570.40 s 4621.73 s 4601.31 s 4598.52 s 3.06× 5 ns 2.82 s 4109.22 s 4073.93 s 4095.42 s 4092.86 s 4090.04 s 3.44× 10 ns 2.80 s 2852.64 s 2814.84 s 2830.00 s 2832.49 s 2829.69 s 4.97× 8 20 ns 2.80 s 1853.27 s 1858.27 s 1860.41 s 1857.32 s 1854.52 s 7.59× 30 ns 2.82 s 1743.57 s 1738.68 s 1732.90 s 1738.38 s 1735.56 s 8.11× 40 ns 2.78 s 1679.61 s 1692.23 s 1689.35 s 1687.06 s 1684.28 s 8.35× 50 ns 2.80 s 1667.11 s 1659.82 s 1666.35 s 1664.43 s 1661.63 s 8.47× 100 ns 2.79 s 1609.82 s 1617.36 s 1620.74 s 1615.97 s 1613.18 s 8.72× 200 ns 2.81 s 1603.71 s 1597.14 s 1627.01 s 1609.29 s 1606.48 s 8.76×

Table C.7: EURETILE measurement data: fft lookahead 173 × × × × × × × × × × × × × × × × × × × × × × Average Speedup s 524.78 s 527.38 s 519.55 s 526.17 s 3.58 7 s 600.94 s 609.08 s 602.96 s 605.88 s 3.11 4 s 558.74 s 558.55 s 562.14 s 558.95 s 3.37 9 s 550.33 s 549.28 s 544.31 s 545.82 s 3.45 5 s 540.63 s 543.59 s 547.16 s 543.21 s 3.47 1 s 541.43 s 544.51 s 536.16 s 540.49 s 3.48 39 s33 s 529.92 s58 s 534.32 s 532.7895 s s 530.50 s 532.3456 s s 534.22 529.57 s s 542.43 s 536.55 530.04 s s 535.83 536.10 s s 529.88 s 532.82 533.01 s s 530.48 s 3.51 532.56 s 521.29 s 3.53 530.45 s 3.54 528.69 s 3.55 3.56 871.35 s 1884.75 s 1894.14 s 1878.68 s 1882.92 s – 1076.12 s 1070.71 s 1077.49 s 1084.53 s 1080.95 s 1.74 s 1123.00 s 1135.86 s 1140.58 s 1131.37 s 1130.98 s 1.66 ss 1094.76 ss 1093.42 s 1092.92 ss 1092.48 s 1093.31 s 1095.14 s 1083.84 s 1095.91 s 1088.63 s 1099.66 s 1095.50 s 1092.12 s 1094.15 s 1097.16 s 1091.88 s 1095.84 s 1089.55 s 1.72 1081.42 s 1093.46 s 1.73 1090.15 s 1.72 1.73 4 s 1076.66 s 1075.55 s 1076.46 s 1093.22 s 1078.16 s 1.75 1 s 1071.23 s 1092.39 s 1081.96 s 1085.48 s 1082.84 s 1.74 8 s 1073.17 s 1090.15 s 1082.43 s 1088.96 s 1081.45 s 1.74 1 s 1074.99 s 1082.93 s 1083.22 s 1079.27 s 1081.91 s 1.74 7 s 1084.57 s 1085.13 s 1082.47 s 1090.03 s 1086.64 s 1.73 Experiment Iteration GEMSCLAIM measurement data: fast simulation mode runtime Table C.8: 1 2 3 4 5 6 7 8 910 1 µs 1080.34 s 1081.76 s 1091.24 s 1075.70 s 1087.38 s 1084.17 s 1 µs 529.40 s 521.98 s 525.55 s 531.87 s 522.34 s 527.29 s 531.59 10 ns 1139.91 s 1124.91 s 1122.50 s 1127.28 s 1134.16 s 1130.24 10 ns 610.26 s 611.47 s 603.20 s 606.53 s 607.21 s 604.30 s 602.8 20 ns30 ns40 1112.20 ns s50 1081.55 ns s 1097.75 s 1084.28 s 1095.88 s 1101.02 s 1096.93 s 1091.60 s 1081.74 s 1102.05 s 1094.83 s 1101.09 s 1088.75 s 1091.74 s 1085.21 s 1091.65 s 1082.50 1084.40 s 1101.55 s 1096.76 s 1095.55 1083.72 1092.87 s 1086.65 20 ns 557.56 s 562.19 s 556.28 s 553.27 s 561.26 s 560.52 s 558.9 30 ns 548.20 s 547.33 s 545.65 s 542.68 s 543.80 s 544.03 s 542.5 40 ns 544.51 s 540.94 s 535.08 s 539.79 s 548.53 s 543.79 s 548.0 50 ns 542.79 s 543.61 s 540.17 s 538.46 s 535.55 s 542.65 s 539.6 500 ns 1077.41 s 1070.63 s 1072.16 s 1092.46 s 1063.51 s 1083.5 400 ns 1074.11 s 1090.44 s 1096.47 s 1073.05 s 1081.23 s 1082.0 300 ns 1086.44 s 1066.97 s 1084.94 s 1075.84 s 1084.17 s 1081.4 200 ns 1082.08 s 1072.39 s 1089.48 s 1080.82 s 1085.38 s 1088.5 100 ns 1088.80 s 1099.00 s 1083.91 s 1081.84 s 1081.82 s 1088.8 100 ns200 ns300 544.15 ns s400 530.40 ns s 536.33 s500 534.12 ns s 531.33 s 534.78 527.25 s s 529.89 s 541.43 523.97 s s 536.85 532.60 s s 532.45 s 528.77 529.83 s s 543.88 523.45 s s 530.67 s 529.59 532.28 s s 534.96 526.79 s s 535.76 s 534.14 534.68 s s 530. 535.68 s 530.32 s 529. 528.17 s 533.67 s 529. 523.06 s 528. 530. 1 – 1930.88 s2 1887.46 s 1868.48 s 1871.58 s 1871.00 s 1870.89 s 1 4 Threads Lookahead 174 Appendix C. Experimental Data hed Lookahead Threads 4 1 s 1870.89 s 1871.00 s 1871.58 s 1868.48 s 1887.46 2 s 1930.88 – 1 0 s785 7.3s714 7.4s726 7.4s773. s 773.84 s 772.69 s 777.54 s 771.45 s 777.73 s 768.50 ns 100 0 s740 8.4s796 8.8s762 7.6s783. 763. s 777.96 765. s 753.40 s 786.25 753. s 765.28 s 766.43 s s 780.08 761.83 s 760.20 s 757.79 s s 779.68 754.97 s 756.58 s 765.31 s s 781.94 754.73 s 761.63 s 766.33 s s 784.04 756.66 s 768.78 s 755.11 s 755.70 s ns 751.54 500 s ns 767.29 400 ns 300 ns 200 0 s17.4s16.7s15.0s16.9s16.0s1468.6 s 1464.90 s 1468.29 s 1458.40 s 1463.87 s 1479.24 ns 100 0 s17.4s17.4s18.3s19.3s17.2s1469.3 s 1476.62 s 1490.33 s 1483.53 s 1474.04 s 1474.54 ns 200 0 s19.7s18.3s10.6s10.0s19.6s1491.1 s 1493.36 s 1503.60 s 1501.26 s 1485.83 s 1492.17 ns 300 0 s11.2s10.2s10.6s11.5s11.4s1520.5 s 1516.94 s 1510.85 s 1501.26 s 1508.52 s 1512.92 ns 400 0 s15.3s14.7s13.4s13.3s13.0s1539.8 s 1533.10 s 1538.33 s 1531.64 s 1549.67 s 1559.03 ns 500 0n 9.8s779 9.6s814 9.3s745 805.0 s 794.54 s 798.03 s 801.48 s 797.26 s 797.94 s 798.88 ns 50 0n 1.8s884 1.4s883 1.2s821 809.9 s 812.19 s 813.72 s 818.35 s 814.84 s 808.42 s 813.48 ns 40 0n 3.7s866 3.5s894 2.8s888 826.2 s 828.84 s 828.78 s 829.46 s 838.75 s 836.60 s 833.87 ns 30 0n 5.1s812 4.0s836 4.0s858 843.6 s 845.88 s 849.70 s 843.67 s 844.90 s 851.28 s 850.31 ns 20 0n 0.0s940 9.1s885 9.5s901 902.0 s 900.12 s 891.95 s 898.56 s 897.41 s 904.09 s 905.30 1489.50 ns 10 s 1474.56 1472.71 1468.60 s 1483.43 s 1470.71 s 1501.06 1496.95 s 1478.56 s 1477.56 1676.38 s 1509.67 s 1474.29 s 1474.57 s 1469.37 s 1656.11 s 1506.22 s 1487.24 s 1475.91 s 1465.71 s 1667.04 s 1501.12 s 1476.78 s 1472.18 s 1660.13 s 1494.77 s ns 1467.99 50 s 1666.14 s ns 1497.89 40 s ns 1656.88 30 ns 20 ns 10 s738 8.3s738 8.5s702 2.2s735.98 s 827.22 s 740.28 s 783.85 s 783.81 s 782.13 s 783.85 µs 1 s14.2s14.4s14.9s13.8s17.0s17.9s 1578.29 s 1577.90 s 1630.08 s 1642.69 s 1648.24 s 1542.12 µs 1 al C.9: Table 10 9 8 7 6 5 4 3 2 1 ESLI esrmn aa eemnsi iuainmode simulation deterministic data: measurement GEMSCLAIM xeietIteration Experiment 461 417 457 417 408 1.28 s 1470.87 s 1471.70 s 1495.73 s 1471.75 s 1466.14 s 3 466 430 483 462 402 1.27 s 1480.27 s 1486.25 s 1488.38 s 1483.00 s 1476.66 s 8 474 499 583 474 460 1.26 s 1496.04 s 1497.40 s 1508.30 s 1489.91 s 1497.40 s 8 583 572 517 522 540 1.24 s 1514.06 s 1512.25 s 1511.77 s 1527.23 s 1518.32 s 4 514 513 597 547 558 1.22 s 1545.89 s 1554.77 s 1559.76 s 1531.37 s 1561.41 s 1 458 446 477 457 469 1.27 1.28 s 1476.95 1.27 s 1476.00 s 1485.72 1.26 s 1477.61 s 1473.64 s 1477.72 1.13 s 1500.01 s 1478.53 s 1471.96 s 1484.60 s 1663.24 s 1499.48 s 1487.83 s 1471.94 s 1475.89 s 1656.13 s 1510.19 s 1460.84 s 1483.31 s s 1650.69 s 1484.28 s 1477.08 s s 1669.59 s 1495.43 s s 1673.32 s s 519 456 503 525 529 1.20 s 1572.99 s 1562.55 s 1560.39 s 1435.67 s 1551.98 7.5s18.5s19.4s17.8s18.2s– s 1882.92 s 1878.68 s 1894.14 s 1884.75 s 871.35 0s712 6.6s741 7.3s2.44 s 773.03 s 774.18 s 769.96 s 771.22 s 20 6s724 7.7s717 8.0s2.41 2.46 s 781.30 2.47 s 764.09 2.48 s 781.73 s 761.43 s 774.27 s s 775.87 757.91 s 765.07 s 766.03 s s 782.44 760.73 s 753.11 s 772.59 s s 06 762.11 s 766.77 s 67 s 751.56 s 31 s 54 1.0s772 9.1s796 2.35 s 799.62 s 795.51 s 797.22 s 810.30 s 3 2.8s836 1.0s836 2.31 s 813.61 s 810.70 s 813.66 s 820.78 s 2 2.9s840 3.1s816 2.26 s 831.64 s 834.51 s 834.08 s 825.29 s 4 4.4s810 5.0s863 2.22 s 846.31 s 852.50 s 841.03 s 840.14 s 9 0.3s934 0.2s913 2.09 s 901.38 s 904.02 s 903.48 s 906.83 s 6 1.5s799 7.6s706 2.41 s 780.66 s 775.86 s 779.92 s 813.75 s runtime vrg Speedup Average × × × × × × × × × × × × × × × × × × × × × × 175 % % Error Average 57 845 µs 257 845 µs 257 845 µs 257 845 µs 257 845 µs – – 257 815 µs 257 734 µs 257 722 µs 257 850 µs 257 730 µs 115 µs 0.04% 257 692 µs 257 790 µs 257 747 µs 257 726 µs 257 747 µs 98 µs 0.04% ss 257 734 µss 257 708 µs 257 725 µss 257 687 µs 257 774 µs 257 746 µss 257 755 µs 257 841 257 µs 778 µs 257 809 µs 257 803 µs 257 717 µs 257 779 µs 257 730 µs 257 778 µs 257 795 257 µs 682 µs 257 716 µs 257 762 µs 68 µs 257 717 µs 257 830 µs 257 759 µs 84 µs 0.03% 257 808 µs 257 786 µs 86 µs 0.03% 257 806 µs 60 µs 0.03% 40 µss 0.02% s 257 753 0.02% µss 257 752 µs 257 837 µss 257 888 µs 257 837 257 µs 739 µss 257 693 µs 257 766 µs 257 788 µs 257 791 µs 257 912 µs 257 810 257 µs 803 µs 257 806 µs 257 768 µs 257 774 µs 257 729 µs 257 696 µs 257 775 µs 78 µs 257 805 µs 257 700 µs 257 753 µs 70 µs 0.03% 257 774 µs 257 766 µs 93 µs 0.03% 257 806 µs 80 µs 0.04% 39 µs 0.03% 0.02% µsµs 257 968 µs 257 782 µs 257 696 µs 257 727 µs 257 886 µs 257 591 µs 257 709 µs 257 703 µs 257 834 µs 257 719 µs 12 µs 126 µs 0.00% 0.05 µsµs 257 874 µsµs 257 665 µs 257 989 µsµs 257 890 µs 257 793 257 µs 913 µsµs 257 753 µs 257 876 µs 257 880 µs 257 753 µs 257 639 µs 257 743 257 µs 759 µs 257 960 µs 257 810 µs 257 695 µs 257 795 µs 257 801 µs 257 804 µs 35 µs 257 750 µs 257 746 µs 257 777 µs 42 µs 0.01% 257 765 µs 257 752 µs 68 µs 0.02% 257 698 µs 93 µs 0.03% 147 µs 0.04% 0.06 µsµs 257 933 µsµs 257 881 µs 257 798 µs 257 705 µs 257 876 µs 257 661 µs 257 687 257 µs 680 µs 257 773 µs 257 872 µs 257 726 µs 257 756 µs 257 805 µs 257 811 µs 89 µs 257 761 µs 34 µs 0.03% 85 µs 0.01% 0.03% Experiment Iteration GEMSCLAIM measurement data: fast simulation mode timing Table C.10: 1 2 3 4 5 6 7 8 9 10 abs.rel. 1 µs 257 695 µs 257 636 µs 257 744 µs 257 642 µs 257 820 µs 257 642 µs 1 µs 257 646 µs 257 728 µs 257 722 µs 257 672 µs 257 823 µs 257 929 µs 10 ns 257 954 µs 257 670 µs 257 698 µs 257 782 µs 257 894 µs 257 763 µ 10 ns 257 734 µs 257 904 µs 257 733 µs 257 730 µs 257 673 µs 257 782 µ 20 ns30 ns40 257 ns 742 µs50 257 ns 757 µs 257 856 µs 257 767 µs 257 749 µs 257 831 µs 257 922 µs 257 964 µs 257 695 µs 257 745 µs 257 662 µs 257 904 µs 257 735 µs 257 722 µs 257 905 µs 257 811 µs 257 814 µs 257 732 µ 257 944 µs 257 723 µs 257 818 µ 257 775 µs 257 703 µ 257 725 µ 20 ns30 ns40 257 ns 744 µs50 257 ns 650 µs 257 708 µs 257 851 µs 257 726 µs 257 838 µs 257 812 µs 257 724 µs 257 731 µs 257 737 µs 257 736 µs 257 688 µs 257 664 µs 257 677 µs 257 907 µs 257 854 µs 257 846 µs 257 861 µ 257 654 µs 257 760 µs 257 755 µ 257 814 µs 257 848 µ 257 877 µ 100 ns200 ns 257 901 µs 257 724 µs 257 896 µs 257 670 µs 257 817 µs 257 845 µs 257 780 µs 257 732 µs 257 823 µs 257 692 µs 257 859 257 729 100 ns200 ns 257 690300 µs ns 257 934400 µs ns 257 844 µs 257 723500 µs ns 257 686 µs 257 823 µs 257 755 µs 257 717 µs 257 781 µs 257 708 µs 257 652 µs 257 749 µs 257 646 µs 257 828 µs 257 711 µs 257 831 µs 257 729 µs 257 743 µs 257 740 µs 257 799 257 712 µs 257 796 µs 257 839 µs 257 769 257 632 µs 257 733 µs 257 775 257 681 µs 257 721 257 622 300 ns400 ns 257 705500 µs ns 257 865 µs 257 821 µs 257 777 µs 257 908 µs 257 758 µs 257 771 µs 257 862 µs 257 730 µs 257 805 µs 257 778 µs 257 706 µs 257 711 µs 257 737 µs 257 678 257 692 µs 257 796 257 780 1 – 257 845 µs 2572 845 µs 257 845 µs 257 845 µs 257 845 µs 257 845 µs 2 4 #T Lookahead 176 Appendix C. Experimental Data TLookahead #T 4 2 µs 845 257 µs 845 257 µs 845 257 µs 845 257 µs 845 2 257 µs 845 257 – 1 0 s2306µ 8 1 s2276µ 8 9 s2346µ 8 334 283 034 272 µs 416 283 701 µs 269 818 275 µs 095 281 724 265 µs 710 269 µs 131 274 µs 716 282 693 µs 260 658 265 µs 858 269 µs 220 274 µs 318 283 µs 733 261 µs 552 265 µs 737 269 µs 188 276 µs 046 283 µs 790 260 µs 660 265 µs 578 µs 274 899 273 µs 670 260 µs 806 265 ns µs 500 764 269 µs 717 ns µs 260 400 617 265 ns µs 300 744 260 ns 200 631 275 ns 100 955 µs 271 744 275 179 268 µs 920 271 µs 782 275 569 µs 264 331 268 µs 878 271 µs 507 275 120 259 µs 691 264 µs 182 268 µs 053 272 µs 246 280 µs 354 259 µs 755 264 µs 272 268 µs 901 µs 271 274 280 µs 139 259 µs 614 264 µs 346 268 µs 988 271 µs 286 259 µs 683 ns µs 264 500 246 268 µs 244 259 ns µs 400 602 264 ns µs 300 138 259 ns 200 ns 100 0n 5 8 s2972µ 5 9 s2964µ 5 5 s2975µ 705 259 µ 216 259 µs 650 259 µ 908 µs 258 509 259 µs 634 259 µ 693 258 µs 875 258 µs 405 259 µs 791 259 µ 301 µs 258 594 258 µs 831 258 µs 284 259 µs 742 259 µs 121 258 µs 475 258 µs 278 258 µs 287 259 µs 885 259 µs 307 258 µs 688 258 µs 665 µs 258 238 259 µs 277 258 µs 698 258 µs 034 ns 259 50 µs 969 µs 257 657 ns 258 40 µs 011 ns 258 30 ns 20 µ 386 259 ns 10 µ 043 µs 259 369 259 µ 703 258 µs 114 259 µs 487 259 µ 433 µs 258 201 258 µs 251 259 µs 507 259 µ 026 258 µs 335 258 µs 914 258 µs 141 259 µs 355 259 µs 105 258 µs 438 258 µs 837 258 µs 073 µs 259 477 259 µs 195 258 µs 300 258 µs 750 258 µs 131 259 µs 966 257 µs 309 µs 258 985 ns 258 50 µs 225 258 µs 532 ns 258 40 µs 238 ns 258 30 ns 20 ns 10 s2715µ 8 1 s2647µ 8 2 s2204µ 0 0 µs 207 301 µs 074 272 µs 526 286 µs 467 286 µs 415 286 µs 115 287 µs 1 µs 044 280 µs 271 280 µs 445 295 µs 124 295 µs 385 295 µs 639 277 µs 1 0as rel. abs. 10 9 8 7 6 5 4 3 2 1 al C.11: Table ESLI esrmn aa eemnsi iuainmode simulation deterministic data: measurement GEMSCLAIM xeietIteration Experiment s2347µ 8 0 s2196µ 8 6 s2288µ 503µ 9. 6. µs 043 25 5. µs 481 17 µs 888 282 3.0 µs 405 13 µs 326 275 µs 1.1 360 283 µs 7836 µs 250 271 µs 912 278 µs 996 µs 281 2986 µs 681 265 µs 708 274 µs 240 276 µs 102 283 µs 831 260 µs 687 265 µs 866 269 µs 892 275 µs 497 283 µs 767 260 µs 636 265 µs 981 269 µs 927 275 µs µs 628 260 µs 801 265 µs 595 274 µs µs 855 260 µs 669 265 µs µs 714 7. 260 µs 5. µs µs 122 20 4. µs 073 14 µs 967 277 2.6 µs 386 10 µs 918 271 µs 0.5 435 280 µs 6798 µs 231 268 µs 978 271 µs 172 µs 280 1398 µs 643 264 µs 129 268 µs 894 271 µs 568 275 µs 243 259 µs 570 264 µs 166 268 µs 756 271 µs 313 280 µs 366 259 µs 581 264 µs 349 268 µs 857 271 µs µs 369 259 µs 550 264 µs 113 268 µs µs 131 259 µs 813 264 µs µs 285 259 µs µs 5 6 s2978µ 5 0 s2953µ 5 2 s18 s0.73 0.58 µs 1882 0.35% µs 1506 0.29% µs 728 259 µs 910 0.13% µs 351 259 µs 573 259 µs 741 µs 755 258 µs 424 259 µs µs 802 326 259 µs 586 258 µs 638 258 µs 415 259 µs 728 259 µs 171 258 µs 488 258 µs 602 258 µs 363 259 µs 767 259 µs 331 258 µs 490 258 µs 039 259 µs 368 259 s µs 277 258 µs 560 258 µs 680 258 s µs 053 258 µs 521 258 s 0.63 µs 066 258 s 0.49 µs s 1620 0.32% µs 1253 0.20% µs 465 259 µs 835 0.11% µs 098 259 µs 467 259 µs 523 µs 680 258 µs 985 258 µs µs 579 281 259 µs 368 258 µs 843 258 µs 128 259 µs 529 259 µs 126 258 µs 310 258 µs 764 258 µs 200 259 µs 493 259 µs 074 258 µs 430 258 µs 656 258 µs 920 258 s µs 162 258 µs 306 258 µs 145 258 s µs 181 258 µs 287 258 s µs 088 258 s s 7 3 s3156µ 8 8 s2482µ 8 4 s2 9 s11.0 µs 498 28 µs 343 286 µs 852 284 µs 682 284 µs 556 301 µs 538 272 9.56 µs 659 24 µs 505 282 µs 106 280 µs 497 280 µs 562 260 µs 975 279 785µ 5 4 s2785µ 5 4 s2785µ – – µs 845 257 µs 845 257 µs 845 257 µs 845 257 µs 845 57 timing Average Error 71% 78% 20% 80% 46% 03% 4% 6% 4% 4% 5% % % % % % 177

Processor Core Variant Operation Total 1 2 3 4

LL 1450671 1990977 1467735 1895364 6804747 ORVP/DMI SC 1450636 1990908 1467657 1895303 6804504

LL 1170865 1895603 1823888 1918654 6809010 ORVP/Mix SC 1170718 1895492 1823813 1918592 6808615

LL 1154778 1574276 2031779 2053242 6814075 ORVP/BT SC 1154723 1574183 2031715 2053178 6813799

Table C.12: LL/SC operations performed by ORVP during Linux boot

SCOPE Application OSCI 2 Threads 4 Threads

boot 26 530 ms 14 526 ms 8404 ms fibonacci 7115 ms 4032 ms 2421 ms mandelbrot 35 334 ms 19 127 ms 10 958 ms dhrystone 62 941 ms 35 378 ms 19 594 ms dhrystone_4x 64 507 ms 35 148 ms 20 228 ms coremark 203 887 ms 116 972 ms 69 575 ms

barnes 909 975 ms 530 856 ms 328 522 ms fmm 3860 ms 2198 ms 1326 ms ocean_c 64 875 ms 35 633 ms 20 133 ms ocean_nc 61 062 ms 33 045 ms 18 700 ms radiosity 883 093 ms 517 129 ms 302 932 ms raytrace 117 066 ms 68 326 ms 40 889 ms volrend 46 163 ms 29 235 ms 19 268 ms water-nsquared 236 633 ms 139 866 ms 82 562 ms water-spatial 316 154 ms 186 571 ms 111 587 ms

cholesky 143 935 ms 78 721 ms 44 349 ms fft 1668 ms 966 ms 563 ms lu_c 37 431 ms 20 312 ms 11 465 ms lu_nc 1529 ms 873 ms 551 ms radix 15 145 ms 8565 ms 4836 ms

Table C.13: ORVP/DMI simulation runtime 178 Appendix C. Experimental Data un 59m 49m 62m 69m 46m 53m 61m 1672 ms 1704 ms 1691 ms 1691 29 15 ms 1593 ms ms 293 1680 219 15 38 ms 1476 ms ms 307 170 15 38 ms 1711 ms 1619 ms ms 395 314 15 38 ms 1707 ms ms ms 332 1642 341 15 38 ms 1706 ms ms 323 082 15 ms 38 3967 ms 1499 46 146 ms 1702 ms ms 368 219 15 38 ms ms 951 007 ms 46 147 3917 ms 1529 ms ms ms 377 848 1693 38 146 ms 879 46 ms 6 4171 ms 801 66 146 854 322 ms 108 119 47 ms 476 62 ms 347 146 ms ms ms 3833 929 269 66 radix 323 ms 462 91 119 ms 15 421 241 47 ms ms 072 307 147 62 lu_nc ms 747 323 ms ms ms 476 943 640 912 119 66 ms ms 4121 269 241 7 lu_c ms 138 47 ms ms 342 723 323 62 ms ms 402 646 911 119 ms ms 660 888 fft 241 66 ms 41 ms ms 128 932 3877 ms 881 323 7273 ms ms 46 ms 920 501 556 910 119 cholesky ms 62 543 241 ms 635 ms 66 939 935 ms ms 905 939 913 119 ms ms ms 154 water-spatial 4190 400 241 ms 7327 62 891 ms 26 ms 209 444 ms 111 81 935 335 65 65 ms 67 197 912 water-nsquared ms ms 714 ms 447 208 183 62 ms 936 ms 790 m 3848 26 ms 069 ms volrend ms 7221 ms 36 725 609 842 64 65 66 ms ms 425 209 720 raytrace 932 ms 951 ms 26 088 ms ms 36 045 564 65 65 ms ms 7334 radiosity 507 ms 209 800 937 ms 795 ocean_nc ms 26 090 ms ms ms 36 588 786 812 64 65 209 ms 7251 ocean_c ms ms 857 798 ms 208 26 099 ms ms 36 204 378 65 65 fmm ms 7308 ms barnes 819 ms 26 152 ms ms 36 000 971 65 65 coremark ms ms 802 7254 ms 26 128 36 dhrystone_4x dhrystone mandelbrot fibonacci boot Application 10 9 8 7 6 5 4 3 2 1 al C.14: Table RPNN esrmn aa igetra runtime thread single data: measurement ORVP/NONE xeietIteration Experiment s2147m 4 2 s2158m 4 9 s2112m 4 400.5 241 ms 182 241 ms 499 241 ms 578 241 ms 525 241 ms 437 241 ms 8 s3225m 2 3 s3254m 2 0 s3283m 2 5. m 957.9 322 ms 873 322 ms 508 323 ms 594 322 ms 039 322 ms 225 322 ms 5 s9495m 1 6 s9338m 1 0 s92865ms 836.5 912 ms 206 915 ms 388 913 ms 663 910 ms 955 914 ms 253 3 0 s1634m 4 3 s1624m 4 6 s1678ms 728 146 ms 868 146 ms 274 146 ms 836 146 ms 769 119 ms 324 146 ms 022 120 ms 903 ms 447 119 ms 471 119 ms 995 120 ms 264 ms 456.4 209 ms 511 209 ms 681 209 ms 583 209 ms 005 210 ms 469 s9093m 3 6 s9448m 3 6 s94318ms 341.8 934 ms 687.2 65 ms 969 934 ms ms 710 498 65 934 ms 466 932 ms 911 65 ms 983 930 ms 782 64 ms 6 ms 350 66 ms 1 617m 605m 615m 613m 6136ms 103.6 36 ms 103 36 ms 105 36 ms 065 36 ms 137 36 s 435m 499m 441m 534m 4872ms 877.2 64 ms 314 65 ms 471 64 ms 979 64 ms 335 64 2 s6 5 s6 6 s6 1 s6 8 ms 486 62 ms 612 62 ms 464 62 ms 452 62 ms 423 2 6 s4 0 s4 4 s4 6 s4 5. ms 055.7 47 ms 867 46 ms 840 47 ms 802.2 66 ms 504 46 ms 613 66 ms 968 ms 603 66 ms 978 66 ms 559 s1 9 s1 0 s1 9 s1 4. ms 340.9 15 ms 294 15 ms 407 15 ms 395 15 ms 5 s3 7 s3 9 s3 0 s3 5. ms 250.4 38 ms 204 38 ms 199 38 ms 379 38 ms ms 843 26 ms 781 26 ms 970 26 ms 833 26 ms 2 s73 s79 s78. ms 7281.4 ms 7290 ms 7236 ms 320 s13 s10 s12. ms 1626.4 ms 1608 ms 1935 ms 67m 63m 644ms 1694.4 ms 1673 ms 1677 ms 3958.6 ms 3858 ms 3804 Average ms s 179 s Average ms 563.5 ms 52 ms 539.3 ms 1262 ms 1257 ms 1302.8 ms s 8436 ms 8433 ms 8400.9 ms ms 4834 ms 4780 ms 4830.5 ms 378 ms 2358 ms 2371 ms 2389.4 ms ms 11 830 ms 11 854 ms 11 876 ms 11 842.6 ms 435 ms 20 248 ms 20 224 ms078 ms 20 304 ms 15 200 ms 20 244.3 ms 16 261 ms 15 169 ms 15 220 ms 0 853 ms 69 356 ms 68 754 ms 71 025 ms 69 968.4 ms 4 651 ms 44 817 ms 44 570 ms 44 503 ms 44 507.8 ms 8 614 ms 18 615 ms9 450 ms 18 695 ms 40 152 ms 18 741 ms 39 551 ms 18 679.7 ms 40 144 ms 39 810.6 ms 19 420 ms 19 401 ms 19 442 ms 19 545 ms 19 453.4 ms s 10 967 ms 10 997 ms 10 939 ms 10 921 ms 10 950.8 ms 3 ms 20 105 ms 19 995 ms 20 106 ms 20 030 ms 20 124.1 ms 5 ms 328 959 ms 330 144 ms 330 607 ms 330 459 ms 330 035 ms 092 ms 82 560 ms 82 246 ms 82 579 ms 82 019 ms 82 534.5 ms 2 802 ms 302 801 ms 302 945 ms 302 758 ms 303 096 ms 302 614.1 ms ms 111 841 ms 111 930 ms 111 010 ms 112 022 ms 111 825 ms 111 866.2 m Experiment Iteration ORVP/NONE measurement data: four threads runtime Table C.15: 1 2 3 4 5 6 7 8 910 mandelbrotdhrystonedhrystone_4x 10 946 mscoremark 20 19 202 456 10 ms ms 986 ms 19 19 959 385 10 ms ms 865 ms 70 960 ms 20 19 518 468 10 ms ms 964 ms 69 214 ms 20 19 087 482 10 ms ms 976 ms 70 579 ms 20 19 076 433 10 ms ms 947 m 70 603 ms 20 19 16 502 ms 69 017 ms 69 323 ms 7 lu_clu_ncradix 11 831 ms 542 ms 11 842 ms 4832 ms 11 829 ms 512 ms 4814 ms 11 859 ms 527 4808 ms ms 11 794 ms 4812 11 ms 855 ms 549 ms 11 856 4873 ms 531 ms 4844 ms 571 ms 4853 ms 533 ms 4855 523 ms 553 ms 5 Application bootfibonacci 2373 ms 8486 msbarnes 2438 ms 8319 msfmm 2385 ms 8269 ms 330 854 ms 2454 ms 8357 ms 331 322 ms 2379 ms 1208 ms 330 8349 181 ms ms 328 163 ms 2381 ms 1543 ms 8507 ms 329 946 mscholesky 2377 ms 1271 329 ms 71 8493 msfft 2 1306 ms 8360 m 44 582 ms 1227 ms 44 121 ms 1363 ms 43 980 ms 598 ms 1296 ms 44 783 ms 528 ms 44 487 1295 ms ms 44 584 ms 533 ms 4 596 ms 531 ms 600 ms 590 ms 532 ms 533 ms 594 ocean_cocean_ncradiosityraytrace 20 201 msvolrend 18 692 mswater-nsquared 302 528 20 ms 201 ms 18 653water-spatial ms 302 82 837 731 ms ms 39 543 20 ms 383 ms 18 302 687 390 ms 15 ms 112 134 82 095 ms 673 ms ms 40 150 20 ms 130 ms 301 674 ms 18 112 670 157 ms ms 14 685 82 ms 892 ms 39 536 20 ms 302 175 310 ms ms 111 670 ms 18 725 ms 15 194 82 ms 714 30 ms 39 448 112 20 ms 012 142 ms ms 18 705 ms 15 167 82 ms 839 112 ms 100 39 495 20 ms 1 15 146 82 ms 40 637 ms 15 166 ms 3 15 180 Appendix C. Experimental Data un 1 s50m 0 s55m 5 s51m 0 s42m 8 s4 461 ms 486 ms 459 ms 462 ms 460 5002 ms 506 ms 452 ms 4987 ms 551 ms ms 4960 455 ms 555 ms 4943 240 ms 11 459 ms 535 ms 148 ms 11 4951 ms 462 ms 265 11 4 ms ms 4947 509 1 ms ms 466 202 11 ms 351 41 ms 4961 ms 510 ms 1048 ms 192 11 ms 959 42 15 ms 462 ms 4986 ms 3 230 ms 11 ms 1023 ms ms 911 513 686 42 12 ms ms 464 959 ms 31 210 11 ms ms 281 77 641 41 ms 13 1035 1 ms 21 647 33 497 ms ms 109 803 ms 113 77 617 41 13 ms 048 ms 19 971 ms ms 075 154 ms 21 radix 110 592 32 ms 29 ms 009 ms 159 78 796 41 12 ms 859 18 lu_nc ms 750 108 ms ms m 915 987 7546 296 ms 20 ms 192 1050 32 1 ms 012 lu_c ms 78 723 12 ms ms 920 6 842 108 18 ms 463 296 ms 922 fft ms 20 854 ms 31 ms 7561 ms 67 ms 046 ms 539 313 1183 ms 78 034 ms 109 1577 ms 18 199 ms 395 66 881 296 cholesky 18 ms 971 ms ms 20 965 659 313 32 ms ms 045 211 78 ms 297 ms 7547 680 ms ms water-spatial 66 1054 837 ms 1834 18 ms ms 952 ms 906 84 310 997 17 20 ms 20 795 297 water-nsquared ms 504 ms 64 ms 114 ms 432 19 7526 314 ms m 1247 ms 890 ms volrend ms 1596 ms 10 609 745 126 17 20 21 ms 549 ms raytrace 393 314 66 ms ms 827 ms ms 7530 10 702 729 17 20 ms 1866 radiosity ms 391 313 ms 174 64 ocean_nc ms 847 ms ms 10 684 772 17 20 ms 7546 ms 1552 ocean_c ms 811 66 ms 827 ms ms 10 303 776 17 20 fmm ms 7569 ms 1832 barnes ms 866 ms ms 10 712 810 17 20 ms 7560 coremark ms 1579 ms 877 10 dhrystone_4x dhrystone mandelbrot fibonacci boot Application 10 9 8 7 6 5 4 3 2 1 al C.16: Table RPISmaueetdt:fu hed runtime threads four data: measurement ORVP/ISS xeietIteration Experiment s1906m 0 2 s1952m 0 9 s1956m 0 5. m 355.7 109 ms 586 109 ms 394 109 ms 572 109 ms 129 109 ms 016 109 ms 9 s2688m 9 2 s2773m 9 2 s27331ms 323.1 297 ms 722 298 ms 733 297 ms 124 298 ms 878 296 ms 995 6 4 s7 1 s7 3 s7 4 s7 0 s7 3. ms 834.5 77 ms 801 77 ms 947 77 ms 331 77 ms 511 77 ms 840 s3407m 1 5 s3940m 1 6 s34329ms 322.9 314 ms 267 314 ms 450 319 ms 450 314 ms 097 314 ms 6 s2 2 s2 2 s2 8 s2 1 s2 3. ms 833.7 20 ms 116 21 ms 989 20 ms 824 20 ms 727 20 ms 9 081m 082m 082m 085m 0844ms 844.4 10 ms 815 10 ms 812 10 ms 862 10 ms 821 10 s 743m 780m 778m 753m 760ms 660 17 ms 593 17 ms 758 17 ms 870 17 ms 463 17 2 s4 9 s4 2 s4 7 s4 2. ms 928.3 41 ms 372 41 ms 924 42 ms 788 32 ms 190 41 ms ms 905 998 18 ms 31 023 3 ms 201 36 ms 843 18 ms 674 32 ms 868 18 ms 104 ms 2 546.3 ms 65 883 18 ms 990 ms 63 875 8 ms 153 66 ms 249 66 ms 310 4 1 s1 0 s1 6 s1 0 s1 9. ms 398.4 14 ms 703 12 ms 867 12 ms 029.7 21 ms 107 19 ms 074 21 ms 810 ms 150 21 ms 976 20 ms 019 s1 7 s1 6 s1 0 s1 1. ms 212.2 11 ms 200 11 ms 163 11 ms 272 11 ms 8 s13 s17 s16. ms 1663.3 ms 1579 ms 1635 ms 583 s44 s42 s46. ms 4960.8 ms 4929 ms 4942 ms 55m 57m 507ms 7550.7 ms 7567 ms 7555 s 8 s14 s18. ms 1084.1 ms 1046 ms 184 1m 0. ms 509.8 ms 71 s40ms 460 ms Average s 181 s Average ms 508.5 ms 99 ms 565.2 ms 1257 ms 1583 ms 1446.4 ms s 8517 ms 8510 ms 8447.1 ms ms 4875 ms 4887 ms 4983.8 ms 585 ms 1551 ms 1558 ms 1605.5 ms ms 12 293 ms 11 546 ms 11 396 ms 11 932 ms 455 ms 25 733 ms 25 849 ms919 ms 25 779 ms 29 714 ms 25 642 ms 12 783 ms 14 297 ms 14 675.9 ms 4 032 ms 64 005 ms 64 323 ms4 011 64 ms 181 ms 23 963 64 ms 143.41 ms 594 ms 24 001 ms 34 248 ms 24 032 ms 33 335 ms 23 929.6 ms 35 711 ms 34 504.8 ms 1 369 ms 41 840 ms 41 205 ms 41 274 ms 41 550 ms 17 363 ms 17 261 ms 17 276 ms 17 348 ms 17 314.2 ms s 10 708 ms 10 820 ms 10 659 ms 10 811 ms 10 796.5 ms 2 ms 20 038 ms 20 374 ms 19 602 ms 20 080 ms 19 931.9 ms 9 ms 316 873 ms 305 745 ms 309 818 ms 314 944 ms 309 214.4 ms 9 524 ms 313 293 ms 312 922 ms 300 389 ms 312 808 ms 308 686 ms s 95 160 ms 95 734 ms 93 914 ms 94 571 ms 96 080 ms 96 843.9 ms ms 107 954 ms 109 212 ms 107 691 ms 108 900 ms 109 842 ms 108 995.3 m Experiment Iteration ORVP/DES measurement data: four threads runtime Table C.17: 1 2 3 4 5 6 7 8 910 mandelbrotdhrystonedhrystone_4x 10 910 mscoremark 19 17 914 381 10 ms ms 720 ms 20 17 131 274 10 ms ms 827 ms 64 690 msocean_c 19 17 764 424 10 ms ms 799 msocean_nc 63 957 msradiosity 19 17 576 312 10 ms ms 851 ms 64 179raytrace ms 25 20 17 544 198 243 10 ms msvolrend ms 860 m 23 882 63 ms 764 mswater-nsquared 309 403 25 ms 19 17 406 64 260 ms ms 24 127 64water-spatial ms 350 ms 308 94 798 262 ms ms 35 258 25 ms 447 ms 23 296 101 802 63 292 ms 404 953 12 ms ms 109 ms 843 948 ms ms 31 600 25 ms 559 ms 311 101 771 373 ms ms 23 108 722 6 868 ms ms 12 855 ms 36 169 26 ms 311 100 116 660 998 ms ms ms 109 481 ms 23 933 ms 12 824 ms 30 32 377 108 25 ms 94 228 532 943 ms ms m 23 823 ms 12 762 ms 109 829 37 650 25 ms 2 12 930 ms 37 106 ms 12 832 ms 3 12 Application bootfibonacci 1536 ms 8230 msbarnes 2026 ms 8532 msfmm 1531 ms 8346 ms 311 910 ms 1599 ms 8444 ms 308 537 ms 1541 ms 1188 ms 310 8463 442 ms ms 302 031 ms 1594 ms 1343 ms 8513 ms 301 415 mscholesky 1534 ms 1366 310 ms 42 8535 msfft 1 1190 ms 8381 m 41 072 ms 1359 ms 41 293 ms 1298 ms 41 276 ms 618 ms 2523 ms 41 251 ms 497 ms 40 840 1357 ms ms 44 080 ms 506 ms 4 496 ms 490 ms 487 ms 497 ms 492 ms 508 ms 494 lu_clu_ncradix 12 263 ms 604 ms 11 276 ms 5008 ms 12 387 ms 508 ms 5316 ms 12 362 ms 580 4824 ms ms 12 129 ms 4968 11 ms 464 ms 611 ms 12 204 4945 ms 580 ms 4803 ms 497 ms 4988 ms 587 ms 5224 513 ms 573 ms 5 182 Appendix C. Experimental Data

Scheduling Channel Latency Experiment Iteration Error ∆ Mode i → j ci,j 1 2 3 4 5 abs. rel.

1 → 5 10ns 13ns 13ns 13ns 13ns 13ns 13ns 1.30 2 → 1 1µs 61ns 62ns 62ns 63ns 62ns 62ns 0.06 2 → 3 10ns 14ns 14ns 14ns 14ns 14ns 14ns 1.40 ASAP 3 → 4 10ns 14ns 14ns 14ns 14ns 14ns 14ns 1.41 4 → 2 10ns 14ns 14ns 14ns 14ns 14ns 14ns 1.40 5 → 3 1 µs 1951 ns 1949 ns 1950 ns 1949 ns 1950 ns 1.95 µs 1.95

1 → 5 10ns 14ns 15ns 15ns 15ns 15ns 14.7ns 1.47 2 → 1 1 µs 1960 ns 1960 ns 1959 ns 1960 ns 1960 ns 1.96 µs 1.96 2 → 3 10ns 16ns 16ns 16ns 16ns 16ns 15.7ns 1.57 ALAP 3 → 4 10ns 16ns 16ns 16ns 16ns 16ns 15.7ns 1.57 4 → 2 10ns 16ns 16ns 16ns 16ns 16ns 15.6ns 1.56 5 → 3 1µs 70ns 69ns 70ns 72ns 70ns 70ns 0.07

Table C.18: SYSTEMC-LINK channel latency network timing

Scheduling Experiment Iteration Speedup Average Mode 1 2 3 4 5 (vs. other)

ASAP 8.19 s 8.18 s 8.17 s 8.17 s 8.18 s 8.18 s 0.85× ALAP 6.96 s 6.94 s 6.95 s 6.96 s 6.95 s 6.95 s 1.18×

Table C.19: SYSTEMC-LINK channel latency network runtime

Instruction Count Benchmark Threads Runtime MIPS Core 1 Core 2 Core 3 Core 4 Total

OSCI/UP 1 36.87 s 200 000 500 – – – 200 000 500 5.42 MIPS

SCL/UP 1 37.47 s 200 001 000 – – – 200 001 000 5.34 MIPS

1 37.97 s 200 001 000 195 152 000 – – 395 153 000 5.20 MIPS SCL/SMP2 2 19.94 s 200 001 000 195 738 000 – – 395 739 000 9.92 MIPS

1 78.84 s 200 000 500 198 684 500 198 496 000 199 424 500 796 605 500 5.05 MIPS SCL/SMP4 2 41.91 s 200 000 500 196 493 500 198 671 500 199 494 500 794 660 000 9.48 MIPS 4 19.94 s 200 000 500 196 758 000 198 787 500 199 507 000 795 053 000 17.53 MIPS

Table C.20: SYSTEMC-LINK measurement data: Linux boot Glossary

Acronyms ADL Architecture Description Language AIC Advanced Interrupt Controller ALAP As Late As Possible ALU Arithmetic Logic Unit API Application Programming Interface ASAP As Soon As Possible ASIP Application Specific Instruction Set Processor CAS Compare-and-Swap CISC Complex Instruction Set Computer CPU Central Processor Unit DBT Dynamic Binary Translation DES Discrete Event Simulation DMI Direct Memory Interface DNP Distributed Network Processor DRAM Dynamic Random Access Memory DSE Design Space Exploration DSP Digital Signal Processor EDA Electronic Design Automation ESL Electronic System Level EURETILE European Reference Tiled Architecture Experiment FFT Fast-Fourier Transform FIFO First In – First Out FPGA Field Programmable Gate Array FPS Frames Per Second GCC GNU Compiler Collection GEMSCLAIM Greener Mobile Systems by Cross Layer Integrated Energy Management GPU Graphics Processing Unit GVP GEMSCLAIM Virtual Platform

183 184 Glossary

HDL Hardware Description Language HMP Heterogeneous Multiprocessing HPC High Performance Computing IMC Interface Method Call IoT Internet of Things IP Intellectual Property ISA Instruction Set Architecture ISS Instruction Set Simulator ITTB Intel Thread Building Blocks ITRS International Technology Road-map for Semiconductors KPN Kahn Process Network LEM Load and Energy Monitor LISA Language for Instruction Set Architectures LL Load-Linked MIPS Million Instructions Per Second MMU Memory Management Unit MMC Multimedia Card MPI Message Passing Interface MPIC Multi-Processor Interrupt Controller MPSoC Multi-Processor System-on-Chip NoC Network-on-Chip NOP No Operation NRE Non-Recurring Engineering ORVP OpenRISC Virtual Platform OS Operating System OSCI Open SystemC Initiative PDES Parallel Discrete Event Simulation PMU Power Management Unit PRNG Pseudo Random Number Generator pthreads POSIX Threads RISC Reduced Instruction Set Computer RMW Read-Modify-Write RTL Register Transfer Level SC Store-Conditional SCC Single-Chip Cloud Computer SD Secure Digital SDL Simple DirectMedia Library SLDL System Level Description Language Glossary 185

SMP Symmetric Multiprocessing SOAP Simple Object Access Protocol SoC System-on-Chip SPI Serial Peripheral Interface STOC SpecC Open Technology Consortium TAS Test-and-Set TLB Translation Lookaside Buffer TLM Transaction Level Modelling UART Universal Asynchronous Receiver Transmitter VCL Virtual Components Library VEP Virtual EURETILE Platform VLIW Very-Long Instruction Word VP Virtual Platform WFI Wait-For-Interrupt

Notation

ti Local timestamp of thread i (SCOPE) or segment i (SYSTEMC-LINK)

tlim,i Timestamp of thread i or segment i for resynchronization ∆ tla Maximum time difference between two threads in SCOPE ∆ tnotify Time delta between event notification and trigger

∆tsamp Sampling interval for processor activity tracing

∆tq TLM quantum duration

∆ttx Local time offset annotated during blocking TLM calls ∆ ci,j SYSTEMC-LINK channel latency between segments i and j

∆εe Notification delay of event e due to time-decoupling

∆εs Update delay of signal s due to time-decoupling

∆εtx Timing error of transaction tx w.r.t. sequential SYSTEMC

∆εtx Average timing error of all transactions in a simulator 186 Glossary List of Figures

1.1 ITRS forecast: predicted processor count (CPU and GPU) in embedded consumer electronics (adapted from [85])...... 3 1.2 HW/SW Codesign: Virtual Platforms enable concurrent HW and SW development, HW design feedback and produce earlier results...... 5 1.3 AnatomyofaVP ...... 9 1.4 Synopsys Virtualizer [183] (left) and Windriver Simics [46](right) . . . . 10

2.1 Simplified SYSTEMCsimulationloop...... 17 2.2 Nondeterministic simulation exposing a process execution order depen- dency between processes p1 and p2 ...... 18 2.3 Data race: lack of synchronisation between intended atomic regions of two threads sometimes leads to erroneous results...... 19 2.4 General race: without enforcing a specific order between threads ac- cessing shared data, program execution may become nondeterministic. 20 2.5 Causality Error: in order to trigger event A2 at timestamp 30 ns, process A wouldneedtogobackintime...... 22 2.6 Conservative parallel simulation using synchronous eventprocessing . 23 2.7 Conservative parallel simulation using asynchronous event processing . 23 4.1 The EURETILE platform in a 4 × 2 × 2configuration ...... 36 4.2 TheEURETILEsoftwaregenerationtoolchain ...... 37 4.3 The GEMSCLAIM platform using two RISCs and VLIWs ...... 39 4.4 OpenRISC MPSoC in a quad-core configuration ...... 42 5.1 Performance of synchronous and asynchronous simulation approaches for two ISSs with fluctuating execution times ...... 46 5.2 SCOPE parallel simulation loop with extended notification phase . .... 48 5.3 Remoteeventsandremotenotification ...... 52 5.4 Remote event queue notifications and cancellations ...... 54 5.5 Classic and time decoupled TLM transaction timing ...... 56 5.6 Augmented TLM target socket in SCOPE ...... 57 5.7 Thread partitioning for a 4x4x4 VEP configuration ...... 60 5.8 DNPtransactiontiming ...... 61 5.9 Sequential simulation performance in presto (left) and fft (right) . . . . . 62 5.10 Parallel simulation runtime in presto (left) and fft (right)...... 63 5.11 Parallel simulation speedup in presto (left) and fft (right) ...... 64 ∆ 5.12 Speedup with varying lookahead tla ...... 65 6.1 Modified transaction timing in deterministic mode ...... 71 6.2 Modified transaction timing in fast mode when t0 > t1 ...... 72

187 188 LIST OF FIGURES

6.3 Modified transaction timing in fast mode when t0 < t1 ...... 73 6.4 Extended notification phase of SCOPE for zero-delay notifications . . . . 75 6.5 Remote signal including driver stage between processor and clock model 76 6.6 VPusedforsyntheticexperiments ...... 80 6.7 Relativetimingerrorsforsyntheticloads ...... 81 6.8 GVP runtime and timing error for SPLASH2 ocean-ncp ...... 84 6.9 GVP speedup and relative error for SPLASH2 ocean-ncp ...... 85 7.1 SnapshotoftheDMIcachemodelatruntime ...... 92 7.2 Transaction-based model using interconnect monitors ...... 96 7.3 LL/SC monitor placement for mixed operation with DMI ...... 98 7.4 ORVP runtime and LL/SC operations during Linux boot ...... 101 7.5 Parallel performance speedup using four threads ...... 103 7.6 Parallel performance speedup using two threads ...... 103 8.1 Main Components modelled in the OpenRISC ISS ...... 110 8.2 Decoded instruction for an addition operation with an immediate value 111 8.3 SYSTEMC wrapper for one OpenRISC processor ...... 112 8.4 ProcessoractivityusingISSsleepmodel ...... 115 8.5 Extended OpenRISC wrapperwithsleepupcall...... 116 8.6 ProcessoractivityusingDESsleepmodel ...... 117 8.7 VP runtime and CPU activity in selected mixed-to-high load scenarios . 120 8.8 Combined speedup with parallel simulation and ISS sleep models . . . 122 8.9 Combined speedup with parallel simulation and DES sleep models. . . 122

9.1 SYSTEMC-LINK simulationarchitecture ...... 126 9.2 Time-decoupled segments interconnected via latency channels . . . . . 128 9.3 Host memory layout for the VP from Figure 9.2 ...... 131 9.4 Cross segment communication via connector blocks ...... 132 9.5 Queue-basedcommunicationflow ...... 133 9.6 IMC-basedcommunicationflow ...... 135 9.7 Composition of a fully featured VP based on SYSTEMC-LINK ...... 136 9.8 Channellatencynetworkexperimentsetup ...... 138 9.9 Average transaction timing error ∆εtx perchannel ...... 139 9.10 Performance results for the global and local channel latency network experimentvariants...... 140 9.11 ORVP split into segments for use with SYSTEMC-LINK ...... 142 9.12 Simulation speed of ORVP using OSCI and SYSTEMC-LINK ...... 143 B.1 ORVP (left) rendering Mandelbrot set on VGA (right) ...... 164 List of Tables

1.1 SummaryofSLDLsfromindustryandacademia ...... 6

3.1 Overview of related work in parallel SYSTEMCsimulation ...... 31 3.2 Explanation of abstraction levels used in Table 3.1 ...... 32 4.1 Benchmark programs for ORVP including problem size ...... 44 6.1 Cross-thread timing errors with flexible time-decoupling...... 78 6.2 ExperimentparametersforthesyntheticVP ...... 80 7.1 LL/SC in embeddedRISC ISAs (adaptedfrom [201]) ...... 89 7.2 ORVPmemoryaccessconfigurations...... 99 7.3 Benchmarkdescriptionandproblemsize ...... 102 8.1 Sleep signal instructions in popular embedded architectures ...... 108 8.2 Sequential (s) and parallel (p) benchmark applications ...... 121 9.1 Experimentsegmentconfiguration ...... 139 10.1 Overview of simulation technologies and their peak speedups achieved forrealisticVPswhenusingfourhostthreads ...... 149 B.1 VCLUART8250modelregisters ...... 161 B.2 VCLSPIcontrollerregisters ...... 162 B.3 VCLSPIcontrollerregisters ...... 163 B.4 VCLEthernetmodelregisters ...... 165 C.1 Measurementdataoverview...... 167 C.2 SimulationhostsreferredtobyTableC.1 ...... 168 C.3 EURETILE measurement data: presto application scenario...... 169 C.4 EURETILE measurement data: fft application scenario ...... 169 C.5 EURETILE measurement data: presto lookahead (one and two threads) 170 C.6 EURETILE measurement data: presto lookahead (four and eight threads)171 C.7 EURETILEmeasurementdata: fftlookahead ...... 172 C.8 GEMSCLAIM measurement data: fast simulation mode runtime .... 173 C.9 GEMSCLAIM measurement data: deterministic simulation mode runtime174 C.10 GEMSCLAIM measurement data: fast simulation mode timing ..... 175 C.11 GEMSCLAIM measurement data: deterministic simulation mode timing 176 C.12 LL/SC operations performed by ORVP during Linux boot ...... 177 C.13ORVP/DMIsimulationruntime...... 177 C.14 ORVP/NONE measurement data: single thread runtime ...... 178 C.15 ORVP/NONE measurement data: four threads runtime ...... 179 C.16 ORVP/ISS measurement data: four threads runtime ...... 180

189 190 LIST OF TABLES

C.17 ORVP/DES measurement data: four threads runtime ...... 181 C.18 SYSTEMC-LINK channellatencynetworktiming ...... 182 C.19 SYSTEMC-LINK channellatencynetworkruntime ...... 182 C.20 SYSTEMC-LINK measurementdata:Linuxboot ...... 182 List of Algorithms

5.1 Dynamic load rebalancing procedure in SCOPE ...... 49 5.2 Trigger decision algorithm for remote events ...... 53 5.3 Trigger decision algorithm for remote event queues ...... 55 7.1 Thread-safeLL/SCimplementationusingCAS ...... 94

8.1 Time synchronisation between ISS and SYSTEMCwrapper...... 114

9.1 Default step routine for SYSTEMC-LINK segments(simplified) ...... 129

191 192 LIST OF ALGORITHMS Bibliography

[1] Accellera Systems Initiative, “SystemC 2.3.1,” 2016. [Online]. Available: http:// accellera.org/downloads/standards/systemc

[2] O. Almer, I. Böhm, T. E. von Koch, B. Franke, S. Kyle, V. Seeker, C. Thompson, and N. Topham, “Scalable moulti-core simulation using parallel dynamic binary translation,” in 2011 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation, Jul. 2011, pp. 190–199. [Online]. Available: https://doi.org/10.1109/SAMOS.2011.6045461

[3] R. Ammendola, A. Biagioni, O. Frezza, F. L. Cicero, A. Lonardo, P. S. Paolucci, D. Rossetti, A. Salamon, G. Salina, F. Simula, L. Tosoratto, and P. Vicini, “APEnet+: high bandwidth 3d torus direct network for petaflops scale commodity clusters,” Computing Research Repository (CoRR), vol. abs/1102.3796, 2011. [Online]. Available: http://arxiv.org/abs/1102.3796

[4] J. H. Anderson and M. Moir, “Universal constructions for multi-object operations,” in Proceedings of the Fourteenth Annual ACM Symposium on Principles of Distributed Computing, ser. PODC ’95, Aug. 1995, pp. 184–193. [Online]. Available: http://doi.acm.org/10.1145/224964.224985

[5] ARM Holdings, “ARM Launches Cortex-A50 Series, the World’s Most Energy-Efficient 64-bit Processors,” 2012. [Online]. Avail- able: http://www.arm.com/about/newsroom/arm-launches-cortex-a50- series-the-worlds-most-energy-efficient-64-bit-processors.php

[6] ARM Holdings, “ARMv8-A architecture,” 2016. [Online]. Available: http:// www.arm.com/products/processors/armv8-architecture.php

[7] ARM Holdings, “big.LITTLE Technology,” 2016. [Online]. Available: https:// www.arm.com/products/processors/technologies/biglittleprocessing.php

[8] ARM Holdings, “Virtual Prototypes: Fast Models,” 2016. [Online]. Available: http://www.arm.com/products/tools/models/fast-models.php

[9] armdevices.net, “ARM Keynote: ARM Cortex-A53 and ARM Cortex- A57 64bit ARMv8 processors launched,” 2012. [Online]. Avail- able: http://armdevices.net/2012/10/31/arm-keynote-arm-cortex-a53-and- arm-cortex-a57-64bit-armv8-processors-launched/

193 194 BIBLIOGRAPHY

[10] Corporation, “DIOPSIS 940HF Datasheet,” 2011. [Online]. Available: http://www.atmel.com/Images/doc7010.pdf

[11] AT&T Laboratories Cambridge, “Virtual Network Computing,” 2002. [Online]. Available: http://www.cl.cam.ac.uk/research/dtg/attarchive/software.html

[12] D. August, C. Jonathan, S. Girbal, D. Gracia-Perez, G. Mouchard, D. A. Penry, O. Temam, and N. Vachharajani, “UNISIM: An Open Simulation Environment and Library for Complex Architecture Design and Collaborative Development,” IEEE Computer Architecture Letters, 2007.

[13] J. Aycock, “A brief history of just-in-time,” ACM Computing Surveys, vol. 35, no. 2, pp. 97–113, Jun. 2003. [Online]. Available: http://doi.acm.org/10.1145/ 857076.857077

[14] R. L. Bagrodia, “Language support for parallel discrete-event simulations,” in Proceedings of the 26th Conference on Winter Simulation, ser. WSC ’94. San Diego, CA, USA: Society for Computer Simulation International, 1994, pp. 1324–1331. [Online]. Available: http://dl.acm.org/citation.cfm?id=193201.194884

[15] F. Balarin, Y. Watanabe, H. Hsieh, L. Lavagno, C. Passerone, and A. Sangiovanni- Vincentelli, “Metropolis: an integrated electronic system design environment,” Computer, 2003.

[16] F. Bellard, “QEMU, a fast and portable dynamic translator,” in USENIX Annual Technical Conference, ser. ATEC ’05, 2005, pp. 41–46. [Online]. Available: http:// www.usenix.org/events/usenix05/tech/freenix/bellard.html

[17] J. Bennett, “Softcores for FPGA: the free and open source alternatives,” EMBECOSM Blog, Nov. 2013. [Online]. Available: http://www.embecosm. com/2013/11/20/softcores-for-fpga-the-free-and-open-source-alternatives/

[18] J. Bennett, J. Chen, D. Lampret, R. Prescott, and J. Rydberg, “Or1ksim: The OpenRISC 1000 Architectural Simulator,” 2016. [Online]. Available: https:// github.com/openrisc/or1ksim

[19] D. C. Black and J. Donovan, SystemC: From the Ground Up. Springer US, 2004.

[20] F. Blanqui, C. Helmstetter, V. Joloboff, J.-F. Monin, and X. Shi, “Designing a cpu model: from a pseudo-formal document to fast code,” in Workshop on Rapid Simulation and Performance Evaluation: Methods and Tools, 2011.

[21] I. Böhm, B. Franke, and N. Topham, “Cycle-accurate performance modelling in an ultra-fast just-in-time dynamic binary translation instruction set simulator,” in 2010 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation, Jul. 2010, pp. 1–10. [Online]. Available: https://doi. org/10.1109/ICSAMOS.2010.5642102 BIBLIOGRAPHY 195

[22] I. Böhm, T. J. Edler von Koch, S. C. Kyle, B. Franke, and N. Topham, “Generalized just-in-time trace compilation using a parallel task farm in a dynamic binary translator,” in Proceedings of the 32Nd ACM SIGPLAN Conference on Programming Language Design and Implementation, ser. PLDI ’11. New York, NY, USA: ACM, 2011, pp. 74–85. [Online]. Available: http://doi.acm.org/10. 1145/1993498.1993508

[23] N. Bombieri, S. Vinco, V. Bertacco, and D. Chatterjee, “SystemC Simulation on GP-GPUs: CUDA vs. OpenCL,” in Proceedings of the Eighth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis, ser. CODES+ISSS ’12. New York, NY, USA: ACM, 2012, pp. 343–352. [Online]. Available: http://doi.acm.org/10.1145/2380445.2380500

[24] D. Bovet and M. Cesati, Understanding The Linux Kernel. Oreilly & Associates Inc, 2005.

[25] G. Braun, A. Hoffmann, A. Nohl, and H. Meyr, “Using static scheduling techniques for the retargeting of high speed, compiled simulators for embedded processors from an abstract machine description,” in Proceedings of the 14th International Symposium on Systems Synthesis, ser. ISSS ’01, 2001, pp. 57–62. [Online]. Available: http://doi.acm.org/10.1145/500001.500014

[26] L. Cai and D. Gajski, “Transaction level modeling: an overview,” in Proceedings of the 1st IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis, ser. CODES+ISSS, R. Gupta, Y. Nakamura, A. Orailoglu, and P. H. Chou, Eds. ACM, Oct. 2003, pp. 19–24. [Online]. Available: http:// doi.acm.org/10.1145/944645.944651

[27] Center of Embedded Computer Systems, University of California, Irvine, “SpecC reference compiler,” 2014. [Online]. Available: http://www.cecs.uci. edu/~specc/reference/

[28] C. Cernazanu-Glavan, M. Marcu, A. Amaricai, S. Fedeac, M. Ghenea, Z. Wang, A. Chattopadhyay, J. H. Weinstock, and R. Leupers, “Direct FPGA-based power profiling for a RISC processor,” in 2015 IEEE International Instrumentation and Measurement Technology Conference Proceedings, ser. I2MTC ’15, May 2015, pp. 1578–1583. [Online]. Available: https://doi.org/10.1109/I2MTC.2015.7151514

[29] K. M. Chandy and R. Scherman, “The conditional-event approach to distributed simulation,” SCS Multiconference on Distributed Simulation, 1988. [Online]. Available: www.dtic.mil/cgi-bin/GetTRDoc?AD=ADA212009

[30] K. M. Chandy and J. Misra, “Distributed simulation: A case study in design and verification of distributed programs,” IEEE Transactions on Software Engineering, vol. 5, no. 5, pp. 440–452, 1979. [Online]. Available: http://dx.doi. org/10.1109/TSE.1979.230182 196 BIBLIOGRAPHY

[31] A. Chattopadhyay, H. Meyr, and R. Leupers, “LISA: A Uniform ADL for Em- bedded Processor Modeling, Implementation and Software Toolsuite Genera- tion,” in Processor Description Languages, P. Mishra and N. Dutt, Eds. Morgan Kaufmann, 2008, vol. 1.

[32] W. Chen and R. Dömer, “Optimized out-of-order parallel discrete event simulation using predictions,” in Proceedings of the Conference on Design, Automation and Test in Europe, ser. DATE ’13. San Jose, CA, USA: EDA Consortium, 2013, pp. 3–8. [Online]. Available: http://dl.acm.org/citation.cfm? id=2485288.2485293

[33] W. Chen, X. Han, C. Chang, and R. Dömer, “Advances in parallel discrete event simulation for electronic system-level design,” IEEE Design & Test, vol. 30, no. 1, pp. 45–54, 2013. [Online]. Available: http://dx.doi.org/10.1109/MDT. 2012.2226015

[34] W. Chen, X. Han, and R. Dömer, “Out-of-order parallel simulation for ESL design,” in Proceedings of the Conference on Design, Automation and Test in Europe, ser. DATE ’12. San Jose, CA, USA: EDA Consortium, 2012, pp. 141–146. [Online]. Available: http://dl.acm.org/citation.cfm?id=2492708.2492743

[35] W. Chen, X. Han, and R. Dömer, “May-happen-in-parallel analysis based on segment graphs for safe ESL models,” in Proceedings of the Conference on Design, Automation & Test in Europe, ser. DATE ’14. 3001 Leuven, Belgium, Belgium: European Design and Automation Association, 2014, pp. 287:1–287:6. [Online]. Available: http://dl.acm.org/citation.cfm?id=2616606.2617025

[36] CHIST-ERA, “The GEMSCLAIM project,” 2014. [Online]. Available: http:// www.chistera.eu/projects/gemsclaim/

[37] B. Chopard, P. Combes, and J. Zory, “A conservative approach to SystemC parallelization,” in Computational Science - ICCS 2006, 6th International Conference, Proceedings, Part IV, ser. Lecture Notes in Computer Science, V. N. Alexandrov, G. D. van Albada, P. M. A. Sloot, and J. Dongarra, Eds., vol. 3994. Springer, May 2006, pp. 653–660. [Online]. Available: http://dx.doi.org/10. 1007/11758549_89

[38] M. Chung, J. Kim, and S. Ryu, “SimParallel: A high performance parallel SystemC simulator using hierarchical multi-threading,” in IEEE International Symposium on Circuits and Systems, ser. ISCAS ’14, Jun. 2014, pp. 1472–1475. [Online]. Available: http://dx.doi.org/10.1109/ISCAS.2014.6865424

[39] B. Cmelik and D. Keppel, “Shade: A fast instruction-set simulator for execution profiling,” in Proceedings of the 1994 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, ser. SIGMETRICS ’94. New York, NY, USA: ACM, 1994, pp. 128–137. [Online]. Available: http://doi.acm. org/10.1145/183018.183032 BIBLIOGRAPHY 197

[40] P. Combes, E. Caron, F. Desprez, B. Chopard, and J. Zory, “Relaxing synchronization in a parallel SystemC kernel,” in IEEE International Symposium on Parallel and Distributed Processing with Applications, ser. ISPA ’08. IEEE Computer Society, Dec. 2008, pp. 180–187. [Online]. Available: http://dx.doi. org/10.1109/ISPA.2008.124

[41] D. R. Cox, “RITSim: distributed SystemC simulation,” Ph.D. dissertation, Rochester Insitute of Technology, 2005. [Online]. Available: http:// scholarworks.rit.edu/theses/5504/

[42] D. Dechev, P. Pirkelbauer, and B. Stroustrup, “Understanding and effectively preventing the ABA problem in descriptor-based lock-free designs,” in 13th IEEE International Symposium on Object/Component/Service-Oriented Real-Time Distributed Computing, ser. ISORC 2010, 2010, pp. 185–192. [Online]. Available: https://doi.org/10.1109/ISORC.2010.10

[43] R. Dömer, A. Gerstlauer, and D. D. Gajski, SpecC Language Reference Manual, 2002.

[44] Doulos Ltd, “SystemC Guide,” 2017. [Online]. Available: https://www.doulos. com/knowhow/systemc/

[45] EEMBC, “CoreMark - Processor Benchmark,” 2017. [Online]. Available: http:// www.eembc.org/coremark/

[46] J. Engblom, “Wind river simics for software development,” 2015. [On- line]. Available: http://www.windriver.com/whitepapers/simics-for-software- development/

[47] J. Engblom, D. Aarno, and B. Werner, Processor and System-on-Chip Simulation. Boston, MA: Springer US, 2010, ch. Full-System Simulation from Embedded to High-Performance Systems, pp. 25–45. [Online]. Available: http://dx.doi.org/ 10.1007/978-1-4419-6175-4_3

[48] R. Felker et al., “musl libc,” 2017. [Online]. Available: https://www.musl-libc. org

[49] Free Software Foundation, Inc., “GDB: The GNU Project Debugger,” 2016. [Online]. Available: https://www.gnu.org/software/gdb/

[50] Free Software Foundation, Inc., “Atomic Builtins - Using the GNU Compiler Collection (GCC),” 2017. [Online]. Available: https://gcc.gnu.org/onlinedocs/ gcc-4.1.0/gcc/Atomic-Builtins.html

[51] R. M. Fujimoto, “Parallel discrete event simulation,” in Proceedings of the 21st Conference on Winter Simulation, ser. WSC ’89. New York, NY, USA: ACM, 1989, pp. 19–28. [Online]. Available: http://doi.acm.org/10.1145/76738.76741 198 BIBLIOGRAPHY

[52] R. M. Fujimoto, “Parallel discrete event simulation,” vol. 33, no. 10, pp. 30–53, Oct. 1990. [Online]. Available: http://doi.acm.org/10.1145/84537.84545

[53] R. M. Fujimoto, Parallel and Distributed Simulation Systems, 1st ed. New York, NY, USA: John Wiley & Sons, Inc., 1999.

[54] R. M. Fujimoto, “Parallel simulation: Distributed simulation systems,” in Proceedings of the 35th Conference on Winter Simulation: Driving Innovation, ser. WSC ’03. Winter Simulation Conference, 2003, pp. 124–134. [Online]. Available: http://dl.acm.org/citation.cfm?id=1030818.1030836

[55] D. D. Gajski, J. Zhu, R. Dömer, A. Gerstlauer, and S. Zhao, SpecC: specification language and methodology. Springer Science & Business Media, 2012.

[56] L. Gao, K. Karuri, S. Kraemer, R. Leupers, G. Ascheid, and H. Meyr, “Mul- tiprocessor performance estimation using hybrid simulation,” in 2008 45th ACM/IEEE Design Automation Conference, Jun. 2008, pp. 325–330.

[57] D. Gasparovski and K. Price, “Slirp, the PPP/SLIP-on-terminal emulator,” 2016. [Online]. Available: http://slirp.sourceforge.net/

[58] B. Gatliff, “Embedding with GNU: the GDB remote serial protocol,” Embedded Systems Programming, vol. 12, pp. 108–113, 1999.

[59] G. Georgakoudis, D. S. Nikolopoulos, and S. Lalis, “Fast dynamic binary rewriting to support thread migration in shared-ISA asymmetric multicores,” in Proceedings of the First International Workshop on Code Optimisation for Multi and Many Cores, ser. COSMIC@CGO ’13, Z. Wang and H. Leather, Eds. ACM, Feb. 2013, p. 4. [Online]. Available: http://doi.acm.org/10.1145/2446920. 2446924

[60] A. Gerstlauer, R. Dömer, J. Peng, and D. D. Gajski, System Design - A Practical Guide with SpecC. Springer, 2001. [Online]. Available: http://dx.doi.org/10. 1007/978-1-4615-1481-7

[61] GNU Project, “C++ Standards Support in GCC,” 2017. [Online]. Available: https://gcc.gnu.org/projects/cxx-status.html

[62] GNU Project, “GCC, the GNU Compiler Collection,” 2017. [Online]. Available: https://gcc.gnu.org/

[63] GNU Project, “The GNU C Library (glibc),” 2017. [Online]. Available: https:// www.gnu.org/software/libc/

[64] GNU Project, “Using the GNU Compiler Collection (GCC): Standards,” 2017. [Online]. Available: https://gcc.gnu.org/onlinedocs/gcc/Standards.html BIBLIOGRAPHY 199

[65] A. Graf, “QEMU – Aarch64 translation,” 2013. [Online]. Available: https:// github.com/qemu/qemu/blob/c4a6a8887c1b2a669e35ff9da9530824300bdce4/ target/arm/translate-a64.c#L1843

[66] R. Grisenthwaite, “ARM TechCon 2011: ARMv8 Technology Preview,” 2011. [Online]. Available: http://www.arm.com/files/downloads/ARMv8_ Architecture.pdf [67] T. Grötker, S. Liao, G. Martin, and S. Swan, System Design with SystemC. Springer US, 2002. [68] X. Guerin and F. Pétrot, “A system framework for the design of embedded software targeting heterogeneous multi-core socs,” in 20th IEEE International Conference on Application-Specific Systems, Architectures and Processors, ser. ASAP ’09. IEEE Computer Society, Jul. 2009, pp. 153–160. [Online]. Available: http://dx.doi.org/10.1109/ASAP.2009.9 [69] Z. Hao, L. Qian, H. Li, X. Xie, and K. Zhang, “A Parallel SystemC Environment: ArchSC,” in 2009 15th International Conference on Parallel and Distributed Systems, Dec. 2009, pp. 617–623. [Online]. Available: https://doi.org/10.1109/ICPADS. 2009.28

[70] Z. Hao, L. Qian, H. Li, X. Xie, and K. Zhang, “A parallel logic simulation framework: Study, implementation, and performance,” in Proceedings of the 2010 Spring Simulation Multiconference, ser. SpringSim ’10. San Diego, CA, USA: Society for Computer Simulation International, 2010, pp. 150:1–150:10. [Online]. Available: https://doi.org/10.1145/1878537.1878694

[71] C. Helmstetter and V. Joloboff, “SimSoC: A SystemC TLM integrated ISS for full system simulation,” in Asia Pacific Conference on Circuits and Systems (APCCAS), 2008. [72] J. L. Hennessy and D. A. Patterson, Computer Architecture - A Quantitative Ap- proach, 5th Edition. Morgan Kaufmann, 2012.

[73] R. Herveille, “SPI core specifications,” Jan. 2003. [Online]. Available: https:// .org/project,simple_spi

[74] R. Herveille, “VGA/LCD Core v2.0 Specifications,” Mar. 2003. [Online]. Available: https://opencores.org/project,vga_lcd

[75] A. Hoffmann, H. Meyr, and R. Leupers, Architecture exploration for embedded processors with LISA. Kluwer, 2002. [76] J. Howard, S. Dighe, S. R. Vangal, G. Ruhl, N. Borkar, S. Jain, V. Erraguntla, M. Konow, M. Riepen, M. Gries, G. Droege, T. Lund-Larsen, S. Steibl, S. Borkar, V. K. De, and R. V. D. Wijngaart, “A 48-Core IA-32 Processor in 45 nm CMOS Using On-Die Message-Passing and DVFS for Performance and Power Scaling,” IEEE Journal of Solid-State Circuits, vol. 46, no. 1, pp. 173–183, Jan. 2011. 200 BIBLIOGRAPHY

[77] K. Huang, I. Bacivarov, F. Hugelshofer, and L. Thiele, “Scalably distributed SystemC simulation for embedded applications,” in 2008 International Symposium on Industrial Embedded Systems, Jun. 2008, pp. 271–274. [Online]. Available: https://doi.org/10.1109/SIES.2008.4577715

[78] IEEE Computer Society, “IEEE Standard for System Verilog–Unified Hardware Design, Specification, and Verification Language,” IEEE Std 1800-2005, 2005.

[79] IEEE Computer Society, “IEEE Standard SystemC Language Reference Man- ual,” IEEE Std 1666-2005, 2005.

[80] IEEE Computer Society, “IEEE Standard SystemC Language Reference Man- ual,” IEEE Std 1666-2011 (Revision of IEEE Std 1666-2005), 2011.

[81] IEEE Computer Society, “IEEE Standard for System Verilog–Unified Hardware Design, Specification, and Verification Language,” IEEE Std 1800-2012 (Revision of IEEE Std 1800-2009), 2013.

[82] Imperas Ltd., “Open Virtual Platforms,” 2017. [Online]. Available: http:// www.ovpworld.org

[83] Intel Corporation, Intel(R) 64 and IA-32 Architectures Software Developer Manuals, 2016. [Online]. Available: https://software.intel.com/en-us/articles/intel-sdm

[84] International Technology Roadmap for Semiconductors (ITRS), “Design,” 2001. [Online]. Available: http://www.itrs2.net

[85] International Technology Roadmap for Semiconductors (ITRS), “System integration focus team,” 2015. [Online]. Available: http://www.itrs2.net

[86] D. R. Jefferson, “Virtual time,” ACM Transactions on Programming Languages and Systems (TOPLAS), vol. 7, no. 3, pp. 404–425, Jul. 1985. [Online]. Available: http://doi.acm.org/10.1145/3916.3988

[87] E. H. Jensen, G. W. Hagensen, and J. M. Broughton, “A new approach to exclusive data access in shared memory multiprocessors,” Lawrence Livermore National Laboratory, Tech. Rep. UCRL-97663, Nov. 1987. [Online]. Available: https://e-reports-ext.llnl.gov/pdf/212157.pdf

[88] D. Jones and N. Topham, High Speed CPU Simulation Using LTU Dynamic Binary Translation. Berlin, Heidelberg: Springer Berlin Heidelberg, 2009, pp. 50–64. [Online]. Available: http://dx.doi.org/10.1007/978-3-540-92990-1_6

[89] R. Jones, “libguestfs, library for accessing and modifying VM disk images,” 2014. [Online]. Available: http://libguestfs.org/

[90] S. Jones, “Optimistic parallelisation of SystemC,” M2R Placement Report, 2011. [Online]. Available: http://www-verimag.imag.fr/~moy/IMG/pdf/report-2. pdf BIBLIOGRAPHY 201

[91] H. Jordan, “Insieme - A compiler infrastructure for parallel programs,” Ph.D. dissertation, Faculty of Mathematics, Computer Science and Physics of the University of Innsbruck, Aug. 2014. [Online]. Available: http://www.dps.uibk. ac.at/~csaf7445/pub/phd_thesis_jordan.pdf

[92] H. Jordan, S. Pellegrini, P. Thoman, K. Kofler, and T. Fahringer, “INSPIRE: the insieme parallel intermediate representation,” in Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques, C. Fensch, M. F. P. O’Boyle, A. Seznec, and F. Bodin, Eds. IEEE Computer Society, Sep. 2013, pp. 7–17. [Online]. Available: http://dx.doi.org/10.1109/ PACT.2013.6618799

[93] J. Jovic, S. Yakoushkin, L. Murillo, J. Eusse, R. Leupers, and G. Ascheid, “Hybrid simulation for extensible processor cores,” in 2012 Design, Automation Test in Europe Conference Exhibition (DATE), Mar. 2012, pp. 288–291. [Online]. Available: https://doi.org/10.1109/DATE.2012.6176480

[94] L. Kaouane, D. Houzet, and S. Huet, “SysCellC: SystemC on Cell,” in 2008 International Conference on Computational Sciences and Its Applications, ser. ICCSA ’08, Jun. 2008, pp. 234–244. [Online]. Available: https://doi.org/10.1109/ ICCSA.2008.63

[95] B. W. Kernighan, The C Programming Language, 2nd ed., D. M. Ritchie, Ed. Pren- tice Hall Professional Technical Reference, 1988.

[96] O. Kindgren, S. Kristiansson, F. Jullien et al., “ORPSoC core description files for FuseSoC,” 2017. [Online]. Available: https://github.com/openrisc/orpsoc- cores

[97] O. Kindgren and J. McCrone, “OpenRISC 1200 implementation,” 2017. [Online]. Available: https://github.com/openrisc/or1200

[98] J. Knoble, “Almost internet with SLiRP and PPP,” Linux, 1996.

[99] S. Kraemer, L. Gao, J. Weinstock, R. Leupers, G. Ascheid, and H. Meyr, “HySim: A fast simulation framework for embedded software development,” in 2007 5th IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), Sep. 2007, pp. 75–80. [Online]. Available: https://doi.org/10.1145/1289816.1289837

[100] M. Krasnyansky, “VTUN – Virtual Tunnels over TCP/IP networks,” 2012. [Online]. Available: http://vtun.sourceforge.net/

[101] M. Krasnyansky, M. Yevmenkin, and F. Thiel, “Universal TUN/TAP device driver,” 2002. [Online]. Available: http://www.kernel.org/doc/ Documentation/networking/tuntap.txt 202 BIBLIOGRAPHY

[102] G. D. Krebs, “TechEdSat,” Jan. 2012. [Online]. Available: http://space. skyrocket.de/doc_sdat/techedsat.htm

[103] D. Krikun, “GitHub - dkrikun/syscpar,” 2010. [Online]. Available: http:// github.com/dkrikun/syscpar

[104] D. Krikun, U. Polina, and A. Efrati, “Multi-threading of SystemC scheduler,” 2010. [Online]. Available: http://webee.technion.ac.il/vlsi/Projects/Archive/ 2010/Danniel_Paulina.pdf

[105] S. Kristiansson, J. Baxter, S. Wallentowitz et al., “mor1kx - an OpenRISC 1000 processor IP core,” 2017. [Online]. Available: https://github.com/openrisc/ mor1kx

[106] S. Kristiansson et al., “OpenRISC Linux,” 2017. [Online]. Available: https:// github.com/openrisc/linux

[107] S. Lantinga et al., “Simple DirectMedia Layer – Homepage,” 2017. [Online]. Available: https://www.libsdl.org

[108] Lauterbach GmbH, “Lauterbach Development Tools,” 2016. [Online]. Available: http://www.lauterbach.com

[109] R. Leupers, J. Elste, and B. Landwehr, “Generation of interpretive and compiled instruction set simulators,” in Design Automation Conference, 1999. Proceedings of the ASP-DAC ’99. Asia and South Pacific, Jan. 1999, pp. 339–342 vol.1. [Online]. Available: https://doi.org/10.1109/ASPDAC.1999.760028

[110] X. Liao and T. Srikanthan, “Accelerating UNISIM-based cycle-level microar- chitectural simulations on multicore platforms,” ACM Transactions on Design Automation of Electronic Systems (TODAES), vol. 16, no. 3, pp. 26:1–26:25, Jun. 2011. [Online]. Available: http://doi.acm.org/10.1145/1970353.1970359

[111] R. Lipsett, C. A. Ussery, and C. F. Schaefer, VHDL, Hardware Description and Design. Kluwer Academic Publishers, 1993.

[112] G. Liu, T. Schmidt, and R. Dömer, “Out-of-order Parallel Simulation of SystemC Models using Intel MIC Architecture,” in Proceedings of the 20th North American SystemC User‘s Group Meeting, Jun. 2014. [Online]. Available: http:// nascug.org/events/20th/4-NASCUG20-OOParallel-RainerDomer.pdf

[113] P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hållberg, J. Högberg, F. Larsson, A. Moestedt, and B. Werner, “Simics: A full system simulation platform,” IEEE Computer, vol. 35, no. 2, pp. 50–58, 2002. [Online]. Available: https://doi.org/10.1109/2.982916

[114] M. Marcu, O. Boncalo, M. Ghenea, A. Amaricai, J. H. Weinstock, R. Leupers, Z. Wang, G. Georgakoudis, D. S. Nikolopoulos, C. Cernazanu-Glavan, L. Bara, BIBLIOGRAPHY 203

and M. Ionascu, “Low-cost hardware infrastructure for runtime thread level energy accounting,” in Architecture of Computing Systems - ARCS 2016 - 29th International Conference, Proceedings, ser. Lecture Notes in Computer Science, F. Hannig, J. M. P. Cardoso, T. Pionteck, D. Fey, W. Schröder-Preikschat, and J. Teich, Eds., vol. 9637. Springer, 2016, pp. 277–289. [Online]. Available: http://dx.doi.org/10.1007/978-3-319-30695-7_21

[115] C. Marinas, “[patch 00/36] aarch64 linux kernel port,” 2012. [Online]. Available: http://thread.gmane.org/gmane.linux.kernel/1324121

[116] S. Meftali and J.-L. Dekeyser, “An optimal charge balancing model for fast distributed SystemC simulation in IP/SoC design,” in Proceedings of the 4th IEEE International Workshop on System-on-Chip for Real-Time Applications, Jul. 2004, pp. 55–58. [Online]. Available: http://dx.doi.org/10.1109/IWSOC.2004. 1319849

[117] S. Meftali, A. Dziri, L. Charest, P. Marquet, and J.-L. Dekeyser, “SOAP Based Distributed Simulation Environment for System-on-Chip (SoC) Design,” in Forum on Specification and Design Languages, ser. FDL ’05, Mar. 2005, pp. 283–291. [Online]. Available: http://www.ecsi-association.org/ecsi/main.asp? l1=library&fn=def&id=465

[118] S. Meftali, J. Vennin, and J.-L. Dekeyser, “Automatic generation of geographically distributed, SystemC simulation models for IP/SoC design,” in 2003 46th Midwest Symposium on Circuits and Systems, vol. 3, Dec. 2003, pp. 1496–1498 Vol. 3. [Online]. Available: https://doi.org/10.1109/MWSCAS.2003. 1562579

[119] A. Mello, I. Maia, A. Greiner, and F. Pecheux, “Parallel simulation of SystemC TLM 2.0 compliant MPSoC on SMP workstations,” in Proceedings of the Conference on Design, Automation and Test in Europe, ser. DATE ’10. 3001 Leuven, Belgium, Belgium: European Design and Automation Association, 2010, pp. 606–609. [Online]. Available: http://dl.acm.org/citation. cfm?id=1870926.1871069

[120] S. Meyers, More Effective C++: 35 New Ways to Improve Your Programs and Designs. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc., 1995.

[121] S. Meyers, Effective C++: 55 Specific Ways to Improve Your Programs and Designs, 3rd ed. Addison-Wesley Professional, 2005.

[122] Micron Technology, Inc, “Micron M25P80 Serial Flash Embedded Memory,” 2011. [Online]. Available: https://www.micron.com/~/media/documents/ products/data-sheet/nor-flash/serial-nor/m25p/m25p80.pdf

[123] Microsoft Corporation, “Any Developer, Any App, Any Platform – Visual Studio,” 2016. [Online]. Available: https://www.visualstudio.com/ 204 BIBLIOGRAPHY

[124] C. Mills, S. C. Ahalt, and J. Fowler, “Compiled instruction set simulation,” Software: Practice and Experience, vol. 21, no. 8, pp. 877–889, 1991. [Online]. Available: https://dx.doi.org/10.1002/spe.4380210807

[125] J. Misra, “Distributed discrete-event simulation,” ACM Computing Surveys (CSUR), vol. 18, no. 1, pp. 39–65, Mar. 1986. [Online]. Available: http://doi. acm.org/10.1145/6462.6485

[126] I. Mohor, Knguyen, O. Kindgren, T. Markovic, and M. Unneback, “OpenCores Ethernet MAC 10/100 Mbps Overview,” Nov. 2002. [Online]. Available: https://opencores.org/project,ethmac,overview

[127] G. E. Moore, “Cramming more components onto integrated circuits,” Electronics, vol. 38, 1965.

[128] M. Moy, “sc-during: Parallel Programming on Top of SystemC,” 2012. [Online]. Available: http://www-verimag.imag.fr/~moy/?sc-during-Parallel- Programming-on

[129] M. Moy, “Parallel programming with SystemC for loosely timed models: A non-intrusive approach,” in Proceedings of the Conference on Design, Automation and Test in Europe, ser. DATE ’13. San Jose, CA, USA: EDA Consortium, 2013, pp. 9–14. [Online]. Available: http://dl.acm.org/citation.cfm?id=2485288. 2485294

[130] M. Nanjundappa, A. Kaushik, H. D. Patel, and S. K. Shukla, “Accelerating SystemC simulations using GPUs,” in 2012 IEEE International High Level Design Validation and Test Workshop, ser. HLDVT ’12, Nov. 2012, pp. 132–139. [Online]. Available: https://doi.org/10.1109/HLDVT.2012.6418255

[131] M. Nanjundappa, H. D. Patel, B. A. Jose, and S. K. Shukla, “SCGPSim: A Fast SystemC Simulator on GPUs,” in Proceedings of the 2010 Asia and South Pacific Design Automation Conference, ser. ASPDAC ’10. Piscataway, NJ, USA: IEEE Press, 2010, pp. 149–154. [Online]. Available: http://dl.acm.org/citation.cfm? id=1899721.1899753

[132] National Semiconductor, “PC16450C/NS16450, PC8250A/INS8250A Universal Asynchronous Receiver/Transmitter,” Jul. 1990. [Online]. Available: http:// archive.pcjs.org/pubs/pc/datasheets/8250A-UART.pdf

[133] R. H. B. Netzer and B. P. Miller, “What are race conditions?: Some issues and formalizations,” ACM Letters on Programming Languages and Systems, 1992.

[134] S. H. A. Niaki and I. Sander, “An automated parallel simulation flow for heterogeneous embedded systems,” in Proceedings of the Conference on Design, Automation and Test in Europe, ser. DATE ’13. San Jose, CA, USA: EDA Consortium, 2013, pp. 27–30. [Online]. Available: http://dl.acm.org/citation. cfm?id=2485288.2485297 BIBLIOGRAPHY 205

[135] J. Nickolls, I. Buck, M. Garland, and K. Skadron, “Scalable Parallel Programming with CUDA,” Queue, vol. 6, no. 2, pp. 40–53, Mar. 2008. [Online]. Available: http://doi.acm.org/10.1145/1365490.1365500

[136] D. Nicol and R. Fujimoto, “Parallel simulation today,” Annals of Operations Research, vol. 53, no. 1, pp. 249–285, 1994. [Online]. Available: http://dx.doi. org/10.1007/BF02136831

[137] D. Nicol and P. Heidelberger, “Parallel execution for serial simulators,” ACM Transactions on Modeling and Computer Simulation (TOMACS), vol. 6, no. 3, pp. 210–242, Jul. 1996. [Online]. Available: http://doi.acm.org/10.1145/235025. 235031

[138] A. Nohl, G. Braun, O. Schliebusch, R. Leupers, H. Meyr, and A. Hoffmann, “A universal technique for fast and flexible instruction-set architecture simulation,” in Proceedings 2002 Design Automation Conference (IEEE Cat. No.02CH37324), Jun. 2002, pp. 22–27. [Online]. Available: https://doi.org/10.1109/DAC.2002. 1012588

[139] A. Nohl, F. Schirrmeister, and D. Taussig, “Application specific architectures, design methods and tools,” in Proceedings of the International Conference on Computer-Aided Design, ser. ICCAD ’10. Piscataway, NJ, USA: IEEE Press, 2010, pp. 349–352. [Online]. Available: http://dl.acm. org/citation.cfm?id=2133429.2133503

[140] NVIDIA Corporation, “CUDA C Programming Guide,” 2017. [Online]. Available: https://docs.nvidia.com/cuda/cuda-c-programming-guide

[141] F. O’Brien, The Apollo Guidance Computer: Architecture and Operation, ser. Springer Praxis Books. Praxis, 2010.

[142] Open SystemC Initiative (OSCI), “OSCI TLM-2.0 Language Reference Manual,” 2009. [Online]. Available: https://accellera.org/images/downloads/ standards/systemc/TLM_2_0_LRM.pdf

[143] E. P, P. Chandran, J. Chandra, B. P. Simon, and D. Ravi, “Parallelizing systemc kernel for fast hardware simulation on SMP machines,” in Proceedings of the 2009 ACM/IEEE/SCS 23rd Workshop on Principles of Advanced and Distributed Simulation, ser. PADS ’09. Washington, DC, USA: IEEE Computer Society, 2009, pp. 80–87. [Online]. Available: http://dx.doi.org/10.1109/PADS.2009.25

[144] P. S. Paolucci, I. Bacivarov, G. Goossens, R. Leupers, F. Rousseau, C. Schumacher, L. Thiele, and P. Vicini, “EURETILE 2010-2012 summary: first three years of activity of the european reference tiled experiment,” Computing Research Repository (CoRR), vol. abs/1305.1459, 2013. [Online]. Available: http://arxiv.org/abs/1305.1459 206 BIBLIOGRAPHY

[145] P. S. Paolucci, A. Biagioni, L. G. Murillo, F. Rousseau, L. Schor, L. Tosoratto, I. Bacivarov, R. L. Bücs, C. Deschamps, A. E. Antably, R. Ammendola, N. Fournel, O. Frezza, R. Leupers, F. L. Cicero, A. Lonardo, M. Martinelli, E. Pastorelli, D. Rai, D. Rossetti, F. Simula, L. Thiele, P. Vicini, and J. H. Weinstock, “Dynamic many-process applications on many-tile embedded systems and HPC clusters: The EURETILE programming environment and execution platforms,” Journal of Systems Architecture - Embedded Systems Design, vol. 69, pp. 29–53, 2016. [Online]. Available: http://dx.doi.org/10.1016/j. sysarc.2015.11.008

[146] J. Peeters, N. Ventroux, T. Sassolas, and L. Lacassagne, “A SystemC TLM framework for distributed simulation of complex systems with unpredictable communication,” in Proceedings of the 2011 Conference on Design Architectures for Signal Image Processing, ser. DASIP ’11, Nov. 2011. [Online]. Available: https:// doi.org/10.1109/DASIP.2011.6136847

[147] Qualcomm, “Qualcomm Snapdragon Processors,” 2016. [Online]. Available: https://www.qualcomm.co.uk/products/snapdragon

[148] D. I. Rich, “The evolution of SystemVerilog,” IEEE Design & Test, vol. 20, no. 4, pp. 82–84, July 2003. [Online]. Available: http://dx.doi.org/10.1109/ MDT.2003.1214355

[149] C. Roth, H. Bucher, S. Reder, F. Buciuman, O. Sander, and J. Becker, “A SystemC modeling and simulation methodology for fast and accurate parallel MPSoC simulation,” in 26th Symposium on Integrated Circuits and Systems Design, ser. SBCCI ’13, Sep. 2013, pp. 1–6. [Online]. Available: http://dx.doi. org/10.1109/SBCCI.2013.6644853

[150] C. Roth, H. Bucher, S. Reder, O. Sander, and J. Becker, “Improving parallel MPSoC simulation performance by exploiting dynamic routing delay prediction,” in 2013 8th International Workshop on Reconfigurable and Communication-Centric Systems-on-Chip, ser. ReCoSoC ’13, Jul. 2013, pp. 1–8. [Online]. Available: http://dx.doi.org/10.1109/ReCoSoC.2013.6581524

[151] C. Roth, S. Reder, G. Erdogan, O. Sander, G. M. Almeida, H. Bucher, and J. Becker, “Asynchronous parallel MPSoC simulation on the Single-Chip Cloud Computer,” in 2012 International Symposium on System on Chip (SoC), Oct. 2012, pp. 1–8. [Online]. Available: https://doi.org/10.1109/ISSoC.2012.6376364

[152] C. Roth, S. Reder, O. Sander, M. Hübner, and J. Becker, “A framework for exploration of parallel SystemC simulation on the single-chip cloud computer,” in Proceedings of the 5th International ICST Conference on Simulation Tools and Techniques, ser. SIMUTOOLS ’12. ICST, Brussels, Belgium, Belgium: ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering), 2012, pp. 202–207. [Online]. Available: http://dl.acm.org/ citation.cfm?id=2263019.2263046 BIBLIOGRAPHY 207

[153] R. Salimi Khaligh, “Transaction level modeling and high performance simu- lation of embedded systems,” Ph.D. dissertation, Faculty of Computer Science, Electrical Engineering and Information Technology, University of Stuttgart, Ger- many, 2012.

[154] R. Salimi Khaligh and M. Radetzki, Efficient Parallel Transaction Level Simulation by Exploiting Temporal Decoupling. Berlin, Heidelberg: Springer Berlin Heidelberg, 2009, pp. 149–158. [Online]. Available: http://dx.doi.org/10.1007/ 978-3-642-04284-3_14

[155] R. Salimi Khaligh and M. Radetzki, “Modeling constructs and kernel for parallel simulation of accuracy adaptive TLMs,” in Proceedings of the Conference on Design, Automation and Test in Europe, ser. DATE ’10. 3001 Leuven, Belgium, Belgium: European Design and Automation Association, 2010, pp. 1183–1188. [Online]. Available: http://dl.acm.org/citation.cfm?id=1870926.1871212

[156] Samsung, “Samsung Exynos Processors,” 2016. [Online]. Available: http:// www.samsung.com/semiconductor/minisite/Exynos/w/

[157] I. Sander and A. Jantsch, “System modeling and transformational design refine- ment in ForSyDe [formal system design],” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2004.

[158] C. Sauer, H.-M. Bluethgen, and H.-P. Loeb, “Distributed, loosely-synchronized SystemC/TLM simulations of many-processor platforms,” in Proceedings of the 2014 Forum on Specification and Design Languages, ser. FDL ’14, Oct. 2014. [Online]. Available: https://doi.org/10.1109/FDL.2014.7119360

[159] N. Savoiu, S. Shukla, and R. Gupta, “Automated concurrency re-assignment in high level system models for efficient system-level simulation,” in Proceedings of the Conference on Design, Automation and Test in Europe, ser. DATE ’02. Washington, DC, USA: IEEE Computer Society, 2002, pp. 875–. [Online]. Available: http://dl.acm.org/citation.cfm?id=882452.874566

[160] N. Savoiu, S. K. Shukla, and R. K. Gupta, “Concurrency in system level de- sign: Conflict between simulation and synthesis goals,” in Proceedings of the 11th IEEE/ACM Workshop on Logic & Synthesis, ser. IWLS ’02, Jan. 2002, pp. 407–411.

[161] J. Schnerr, O. Bringmann, and W. Rosenstiel, “Cycle accurate binary translation for simulation acceleration in rapid prototyping of SoCs,” in Proceedings of the Conference on Design, Automation and Test in Europe - Volume 2, ser. DATE ’05. Washington, DC, USA: IEEE Computer Society, 2005, pp. 792–797. [Online]. Available: http://dx.doi.org/10.1109/DATE.2005.106

[162] L. Schor, I. Bacivarov, L. G. Murillo, P. S. Paolucci, F. Rousseau, A. E. Antably, R. Bücs, N. Fournel, R. Leupers, D. Rai, L. Thiele, L. Tosoratto, P. Vicini, and J. H. Weinstock, “EURETILE design flow: Dynamic and fault tolerant 208 BIBLIOGRAPHY

mapping of multiple applications onto many-tile systems,” in IEEE International Symposium on Parallel and Distributed Processing with Applications, ser. ISPA ’14. IEEE Computer Society, Aug. 2014, pp. 182–189. [Online]. Available: http://dx. doi.org/10.1109/ISPA.2014.32

[163] L. Schor, I. Bacivarov, D. Rai, H. Yang, S. Kang, and L. Thiele, “Scenario-based design flow for mapping streaming applications onto on-chip many-core systems,” in Proceedings of the 15th International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, ser. CASES ’12, A. Jerraya, L. P. Carloni, V. J. M. III, and R. M. Rabbah, Eds. ACM, Oct. 2012, pp. 71–80. [Online]. Available: http://doi.acm.org/10.1145/2380403.2380422

[164] C. Schumacher, “Construction of parallel and distributed SystemC simula- tors,” Ph.D. dissertation, Fakultät für Elektrotechnik und Informationstechnik, Rheinisch-Westfälische Technische Hochschule Aachen, 2015.

[165] C. Schumacher, R. Leupers, D. Petras, and A. Hoffmann, “parSC: Synchronous parallel systemc simulation on multi-core host architectures,” in Proceedings of the Eighth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis, ser. CODES/ISSS ’10. New York, NY, USA: ACM, 2010, pp. 241–246. [Online]. Available: http://doi.acm.org/10.1145/1878961.1879005

[166] C. Schumacher, J. Weinstock, R. Leupers, and G. Ascheid, “SCandal: SystemC analysis for nondeterminism anomalies,” in Proceeding of the 2012 Forum on Specification and Design Languages, ser. FDL ’12. IEEE, Sep. 2012, pp. 112–119. [Online]. Available: http://ieeexplore.ieee.org/xpl/freeabs_all.jsp? arnumber=6336995

[167] C. Schumacher, J. H. Weinstock, R. Leupers, and G. Ascheid, “Cause and effect of nondeterministic behavior in sequential and parallel SystemC simulators,” in 2012 IEEE International High Level Design Validation and Test Workshop, ser. HLDVT ’12. IEEE Computer Society, Nov. 2012, pp. 124–131. [Online]. Available: http://dx.doi.org/10.1109/HLDVT.2012.6418254

[168] C. Schumacher, J. H. Weinstock, R. Leupers, G. Ascheid, L. Tosoratto, A. Lonardo, D. Petras, and T. Grötker, “legaSCi: Legacy SystemC model integration into parallel SystemC simulators,” in 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum, ser. VIPES ’13. IEEE, May 2013, pp. 2188–2193. [Online]. Available: http://dx.doi. org/10.1109/IPDPSW.2013.34

[169] C. Schumacher, J. H. Weinstock, R. Leupers, G. Ascheid, L. Tosoratto, A. Lonardo, D. Petras, and A. Hoffmann, “legaSCi: Legacy SystemC model integration into parallel simulators,” ACM Transactions on Embedded Computing Systems (TECS), vol. 13, no. 5s, pp. 165:1–165:24, 2014. [Online]. Available: http://doi.acm.org/10.1145/2678018 BIBLIOGRAPHY 209

[170] S. Schürmans, D. Zhang, D. Auras, R. Leupers, G. Ascheid, X. Chen, and L. Wang, “Creation of ESL power models for communication architectures using automatic calibration,” in The 50th Annual Design Automation Conference, ser. DAC ’13. ACM, Aug. 2013, pp. 58:1–58:58. [Online]. Available: http:// doi.acm.org/10.1145/2463209.2488804

[171] S. Schürmans, D. Zhang, R. Leupers, G. Ascheid, and X. Chen, “Improving ESL power models using switching activity information from timed functional models,” in 17th International Workshop on Software and Compilers for Embedded Systems, ser. SCOPES ’14, H. Corporaal and S. Stuijk, Eds. ACM, Jun. 2014, pp. 89–97. [Online]. Available: http://doi.acm.org/10.1145/2609248.2609250

[172] R. Sinha, A. Prakash, and H. D. Patel, “Parallel simulation of mixed-abstraction SystemC models on GPUs and multicore CPUs,” in 17th Asia and South Pacific Design Automation Conference, ser. ASPDAC ’12, Jan. 2012, pp. 455–460. [Online]. Available: https://doi.org/10.1109/ASPDAC.2012.6164991

[173] S. Sirowy, C. Huang, and F. Vahid, “Dynamic acceleration management for SystemC emulation,” SIGBED Rev., vol. 6, no. 3, pp. 3:1–3:4, Oct. 2009. [Online]. Available: http://doi.acm.org/10.1145/1851340.1851345

[174] S. Sirowy, C. Huang, and F. Vahid, “Online SystemC emulation acceleration,” in Proceedings of the 47th Design Automation Conference, ser. DAC ’10. New York, NY, USA: ACM, 2010, pp. 30–35. [Online]. Available: http://doi.acm. org/10.1145/1837274.1837284

[175] S. S. Sirowy, B. Miller, and F. Vahid, “Portable systemc-on-a-chip,” in Proceedings of the 7th IEEE/ACM International Conference on Hardware/Software Codesign and System Synthesis, ser. CODES+ISSS ’09. New York, NY, USA: ACM, 2009, pp. 21–30. [Online]. Available: http://doi.acm.org/10.1145/1629435.1629439

[176] R. Smith, “More on Apple’s A9X SoC,” 2015. [Online]. Available: http://www. anandtech.com/show/9824/more-on-apples-a9x-soc

[177] R. Stallman, R. Pesch, and S. Shebs, Debugging with GDB: The GNU source-level debugger (GNU manuals). GNU Press, 2011.

[178] J. E. Stone, D. Gohara, and G. Shi, “OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems,” IEEE Design & Test, vol. 12, no. 3, pp. 66–73, May 2010. [Online]. Available: http://dx.doi.org/10.1109/MCSE.2010. 69

[179] B. Stroustrup, The C++ Programming Language, 4th ed. Addison-Wesley Profes- sional, 2013.

[180] B. Stroustrup, A Tour of C++, ser. C++ in-depth series. Addison-Wesley, 2014. 210 BIBLIOGRAPHY

[181] S. Sutherland and T. Fitzpatrick, “Keeping Up with Chip – the Proposed Sys- temVerilog 2012 Standard Makes Verifying Ever-increasing Design Complexity More Efficient,” in Design & Verification Conference and Exhibition (DVCON), 2012.

[182] Synopsys Inc., “Platform Architect MCO,” 2016. [Online]. Avail- able: http://www.synopsys.com/Prototyping/ArchitectureDesign/pages/ platform-architect.aspx

[183] Synopsys Inc., “Synopsys Virtualizer,” 2016. [Online]. Available: http://www. synopsys.com/Prototyping/VirtualPrototyping/Pages/virtualizer.aspx

[184] J. Tandon, “The OpenRISC Processor: Open Hardware and Linux,” Linux Journal, vol. 2011, no. 212, Dec. 2011. [Online]. Available: http://dl.acm.org/ citation.cfm?id=2123870.2123876

[185] The Khronos OpenCL Working Group, “The open standard for parallel programming of heterogeneous systems,” 2017. [Online]. Available: https:// www.khronos.org/opencl/

[186] The Linux Kernel Organization, Inc., “The linux kernel archives,” 2016. [Online]. Available: https://kernel.org

[187] L. Thiele, I. Bacivarov, W. Haid, and K. Huang, “Mapping applications to tiled multiprocessor embedded systems,” in Seventh International Conference on Appli- cation of Concurrency to System Design (ACSD 2007), 2007.

[188] P. Thoman, H. Jordan, and T. Fahringer, “Adaptive granularity control in task parallel programs using multiversioning,” in Euro-Par 2013 Parallel Processing - 19th International Conference, Proceedings, ser. Lecture Notes in Computer Science, F. Wolf, B. Mohr, and D. an Mey, Eds., vol. 8097. Springer, Aug. 2013, pp. 164–177. [Online]. Available: http://dx.doi.org/10.1007/978-3-642-40047- 6_19

[189] D. Thomas and P. Moorby, The Verilog Hardware Description Language. Springer US, 2002.

[190] N. Topham and D. Jones, “High speed CPU simulation using JIT binary trans- lation,” in Third Annual Workshop on Modeling, Benchmarking and Simulation, ser. MoBS’07, Jun. 2007.

[191] Toshiba Corporation, “SpecC Consortium Announced at Embedded Technology Conference and Exhibition MST ’99,” 1999. [Online]. Available: https://www. toshiba.co.jp/about/press/1999_11/pr1001.htm

[192] P. University, “UNISIM: UNIted SIMulation environment,” 2015. [Online]. Available: http://unisim.org/site/ BIBLIOGRAPHY 211

[193] N. Ventroux, J. Peeters, T. Sassolas, and J. C. Hoe, “Highly-parallel special-purpose multicore architecture for SystemC/TLM simulations,” in XIVth International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation, ser. SAMOS ’14. IEEE, Jul. 2014, pp. 250–257. [Online]. Available: http://dx.doi.org/10.1109/SAMOS.2014.6893218

[194] N. Ventroux and T. Sassolas, “A new parallel SystemC kernel leveraging manycore architectures,” in 2016 Design, Automation & Test in Europe Conference & Exhibition, ser. DATE ’16, L. Fanucci and J. Teich, Eds. IEEE, Mar. 2016, pp. 487–492. [Online]. Available: http://ieeexplore.ieee.org/xpl/freeabs_all. jsp?arnumber=7459359

[195] E. Viaud, F. Pêcheux, and A. Greiner, “An efficient TLM/T modeling and simulation environment based on conservative parallel discrete event principles,” in Proceedings of the Conference on Design, Automation and Test in Europe, Mar. 2006, pp. 94–99. [Online]. Available: http://dx.doi.org/10.1109/ DATE.2006.244003

[196] S. Vinco, D. Chatterjee, V. Bertacco, and F. Fummi, “SAGA: SystemC Acceleration on GPU Architectures,” in Proceedings of the 49th Annual Design Automation Conference, ser. DAC ’12. New York, NY, USA: ACM, 2012, pp. 115–120. [Online]. Available: http://doi.acm.org/10.1145/2228360.2228382

[197] C. Vinschen, J. Johnston et al., “The newlib homepage – sourceware.org,” 2017. [Online]. Available: https://sourceware.org/newlib/

[198] D. Vlasenko, “Busybox: The swiss army knife of embedded linux,” 2016. [Online]. Available: https://www.busybox.net

[199] Z. Wang, L. Wang, H. Xie, and A. Chattopadhyay, “Power modeling and estimation during ADL-driven embedded processor design,” in 2013 4th Annual International Conference on Energy Aware Computing Systems and Applications, ser. ICEAC ’13, Dec. 2013, pp. 97–102. [Online]. Available: https://doi.org/10. 1109/ICEAC.2013.6737645

[200] R. Weicker, “DHRYSTONE: A synthetic systems programming benchmark,” Communications of the ACM, vol. 27, no. 10, pp. 1013–1030, 1984. [Online]. Available: http://doi.acm.org/10.1145/358274.358283

[201] J. H. Weinstock, R. Leupers, and G. Ascheid, “Modeling exclusive memory access for a time-decoupled parallel SystemC simulator,” in Proceedings of the 18th International Workshop on Software and Compilers for Embedded Systems, ser. SCOPES ’15. New York, NY, USA: ACM, 2015, pp. 129–132. [Online]. Available: http://doi.acm.org/10.1145/2764967.2771929

[202] J. H. Weinstock, R. Leupers, and G. Ascheid, “Parallel SystemC simulation for ESL design using flexible time decoupling,” in 2015 International Conference on 212 BIBLIOGRAPHY

Embedded Computer Systems: Architectures, Modeling, and Simulation, ser. SAMOS ’15, D. Soudris and L. Carro, Eds. IEEE, Jul. 2015, pp. 378–383. [Online]. Available: http://dx.doi.org/10.1109/SAMOS.2015.7363702

[203] J. H. Weinstock, R. Leupers, and G. Ascheid, “Accelerating MPSoC Simulation Using Parallel SystemC and Processor Sleep Models,” in Proceedings of the 9th Workshop on Rapid Simulation and Performance Evaluation: Methods and Tools, ser. RAPIDO’17. New York, NY, USA: ACM, Jan. 2017, pp. 2:1–2:6. [Online]. Available: http://dl.acm.org/citation.cfm?id=3023975

[204] J. H. Weinstock, R. Leupers, G. Ascheid, D. Petras, and A. Hoffmann, “SystemC-Link: Parallel SystemC simulation using time-decoupled segments,” in 2016 Design, Automation & Test in Europe Conference & Exhibition, ser. DATE ’16, L. Fanucci and J. Teich, Eds. IEEE, Mar. 2016, pp. 493–498. [Online]. Available: http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=7459360

[205] J. H. Weinstock, L. G. Murillo, R. Leupers, and G. Ascheid, “Parallel SystemC simulation for ESL design,” ACM Transactions on Embedded Computing Systems (TECS), vol. 16, no. 1, pp. 27:1–27:25, Oct. 2016. [Online]. Available: http://doi. acm.org/10.1145/2987374

[206] J. H. Weinstock, C. Schumacher, R. Leupers, and G. Ascheid, SCandal: SystemC Analysis for Nondeterminism Anomalies. Cham: Springer International Publishing, 2014, pp. 69–88. [Online]. Available: http://dx.doi.org/10.1007/ 978-3-319-01418-0_5

[207] J. H. Weinstock, C. Schumacher, R. Leupers, G. Ascheid, and L. Tosoratto, “Time-decoupled parallel SystemC simulation,” in Proceedings of the Conference on Design, Automation & Test in Europe, ser. DATE ’14. 3001 Leuven, Belgium, Belgium: European Design and Automation Association, Mar. 2014, pp. 191:1–191:4. [Online]. Available: http://dl.acm.org/citation.cfm?id=2616606. 2616840

[208] Wikipedia, “Apple A9X,” 2016. [Online]. Available: https://en.wikipedia.org/ wiki/Apple_A9X

[209] Wikipedia, “ARM architecture,” 2016. [Online]. Available: https://en. wikipedia.org/wiki/ARM_architecture#AArch64

[210] Wikipedia, “ARM big.LITTLE,” 2016. [Online]. Available: https://en. wikipedia.org/wiki/ARM_big.LITTLE

[211] Wikipedia, “Embedded system,” 2016. [Online]. Available: https://en. wikipedia.org/wiki/Embedded_system

[212] Wikipedia, “Microsoft Visual Studio Debugger,” 2016. [Online]. Available: https://en.wikipedia.org/wiki/Microsoft_Visual_Studio_Debugger BIBLIOGRAPHY 213

[213] Wikipedia, “Moore’s law,” 2016. [Online]. Available: https://en.wikipedia.org/ wiki/Moore’s_law

[214] Wikipedia, “Qualcomm Snapdragon SoC,” 2016. [Online]. Available: https:// en.wikipedia.org/wiki/Qualcomm_Snapdragon

[215] Wikipedia, “Samsung Exynos SoC,” 2016. [Online]. Available: https://en. wikipedia.org/wiki/Exynos

[216] E. Witchel and M. Rosenblum, “Embra: Fast and flexible machine simulation,” in Proceedings of the 1996 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, ser. SIGMETRICS ’96. New York, NY, USA: ACM, 1996, pp. 68–79. [Online]. Available: http://doi.acm. org/10.1145/233013.233025

[217] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta, “The SPLASH-2 programs: Characterization and methodological considerations,” in Proceedings of the 22nd Annual International Symposium on Computer Architecture, ser. ISCA ’95, D. A. Patterson, Ed. ACM, Jun. 1995, pp. 24–36. [Online]. Available: http://doi.acm.org/10.1145/223982.223990

[218] G. Yang, “Parallel Simulation in Metropolis.” [Online]. Available: http:// citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.117.6973

[219] V. Zivojnovic and H. Meyr, “Compiled HW/SW co-simulation,” in 33rd Design Automation Conference Proceedings, 1996, Jun. 1996, pp. 690–695. [Online]. Available: https://doi.org/10.1109/DAC.1996.545662

[220] V. Zivojnovic, S. Tijang, and H. Meyr, “Compiled simulation of programmable DSP architectures,” in VLSI Signal Processing, VIII, Oct. 1995, pp. 187–196. [Online]. Available: https://doi.org/10.1109/VLSISP.1995.527490