Hardware Accelerated Design Space Exploration for Application Specific MPSoCs

Isuru Nawinne

A thesis in fulfillment of the requirements for the degree of

Doctor of Philosophy

School of Computer Science and Engineering

Faculty of Engineering

The University of New South Wales

June 2016 .------T----HE U--N--IV--ER--SIT--Y-- O--F N--E--W --SO--U--T H-- WA----LE--S ------. Thesis/Dissertation Sheet

Surname or Fami ly name: N awinne

First name: lsuru Other name/ s: B andara

Abreviation for degree as given in the University ca lendar: PhD

School: School of Computer Science and Engineering Faculty: Faculty of Engineer ing Title: Hardware Accelerated Cache Design Space Exploration for Application Specific MPSoCs

Abstract 350 words maximum The performance of a computing system heavily depend on the . Fast but expensive cache memories are commonly employed t o bridge the increasing performance gap between processors and memory devices. Benefits drawn from a cache vary significantly with the diverse memory access pattern s of software application programs, especially in the domain of embedded systems. Modern embedded procc:ssvrs acknowledge this relat;on between applications and caches, by incorporating cache 11e:ncries wh'ch are configurable at design-time.

Design space ex ploration of caches in an application specific system is a difficult problem, which typically takes days to solve, if not weeks, using software-based techniques. The problem becomes more complex for multiprocessor systems with hierarchical caches, executing many application programs. A typical such design space can be of vast proportions containing up to several trillions of unique design points, which is infeasible to be accurately explored using existing techniques. This dissertation presents a design space exploration framework which uses hardware accelerated sim­ ulation t o quickly determine the best set of cache configuration s for a multiprocessor cache hierarchy. T he proposed framework was able to achieve up to 456 t imes faster simulation times compared to the fastest known software-based si mulator, with similar accuracy in cac he access time. Further, a novel exploration algorithm is presented, which was able to improve the cache access times by up to 18.9%, while reducing total cache size by up to 74.15% at the same time. A new ru n-time concept is introduced, called switchable cache, where a can select from multiple pre-defined cache configurations, leveraging the abundant transistors available due to what is known as the dark silicon phenomenon. An architecture to enable seamless integration of multiple cache conftgurations is described. A novel design space exploration algorithm is presented to rapidly pre--deterr..,ine th~ cm t imal set of configurations at design-time, for a given group of applications. The use of Answer Set Programming, which guarantees optimal solu tions for NP-Hard problems, is explored

J w reliably solve the switcha ble cache t uning problem.

,.------Declaration relat ing to disposition of project t hesis/ dissertation

I hereby grant 1o -.:h€' University of New South Wales or its agents the right to archive and to make available my thesis or dissertation in whole or in part in the University libraries in all forms of media, now or here after known , subject to the provisrons or the Copyright Act 1968. I retain all property nghts, such as pa tent rights. I also retain the right to use in tuture work$ (su.:. h ;,·;articles 01 boo:

The University recognises that there may be exceptiona l circumstances requiring res trictions on copying or conditions on use. Req uests fer restriction for a period of up to 2 years must be made in writing. Requests for a longer period of restrict i0•1 m; y !.Je cor;.;:dered in exceptional circumstances <~nd requ ire the approval of the Dea n of Graduate Research .

ORIGINALITY STATEMENT

‘I hereby declare that this submission is my own work and to the best of my knowledge it contains no materials previously published or written by another person, or substantial proportions of material which have been accepted for the award of any other degree or diploma at UNSW or any other educational institution, except where due acknowledgement is made in the thesis. Any contribution made to the research by others, with whom I have worked at UNSW or elsewhere, is explicitly acknowledged in the thesis. I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project's design and conception or in style, presentation and linguistic expression is acknowledged.’

Signed ……………………………………………......

Date ……………………………………………......

COPYRIGHT STATEMENT

‘I hereby grant the University of New South Wales or its agents the right to archive and to make available my thesis or dissertation in whole or part in the University libraries in all forms of media, now or here after known, subject to the provisions of the Copyright Act 1968. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation. I also authorise University Microfilms to use the 350 word abstract of my thesis in Dissertation Abstract International (this is applicable to doctoral theses only). I have either used no substantial portions of copyright material in my thesis or I have obtained permission to use copyright material; where permission has not been granted I have applied/will apply for a partial restriction of the digital copy of my thesis or dissertation.'

Signed ……………………………………………......

Date ……………………………………………......

AUTHENTICITY STATEMENT

‘I certify that the Library deposit digital copy is a direct equivalent of the final officially approved version of my thesis. No emendation of content has occurred and if there are any minor variations in formatting, they are the result of the conversion to digital format.’

Signed ……………………………………………......

Date ……………………………………………...... Abstract

The performance of a computing system heavily depends on the memory hierarchy. Fast but expensive cache memories are commonly employed to bridge the increasing performance gap between processors and memory devices. Benefits drawn from a cache vary significantly with the diverse memory access patterns of software appli- cation programs, especially in the domain of embedded systems. Modern embedded processors acknowledge this relation between applications and caches, by incorpo- rating cache memories which are configurable at design-time.

Design space exploration of caches in an application specific system is a difficult problem, which typically takes days to solve, if not weeks, using software-based techniques. The problem becomes more complex for multiprocessor systems with hierarchical caches, executing many application programs. A typical such design space can be of vast proportions containing up to several trillions of unique design points, which is infeasible to be accurately explored using existing techniques.

This dissertation presents a design space exploration framework which uses hardware accelerated simulation to quickly determine the best set of cache configurations for a multiprocessor cache hierarchy. The proposed framework was able to achieve up to 456 times faster simulation times compared to the fastest known software-based

i simulator. Further, a novel exploration algorithm is presented, which was able to improve the cache access times by up to 18.9%, while reducing total cache size by up to 74.15% at the same time.

A new run-time concept is introduced, called switchable cache, where a processor can select from multiple pre-defined cache configurations, leveraging the abundant transistors available due to what is known as the dark silicon phenomenon. An archi- tecture to enable seamless integration of multiple cache configurations is described. A novel design space exploration algorithm is presented to rapidly pre-determine the optimal set of configurations at design-time, for a given group of applications. The use of Answer Set Programming, which guarantees optimal solutions for NP-Hard problems, is explored to reliably solve the switchable cache tuning problem.

ii Publications

Isuru Nawinne, Haris Javaid, Roshan Ragel and Sri Parameswaran. Switchable Cache: Utilizing Dark Silicon for Application Specific Cache Optimizations. IET Computers & Digital Techniques, IET, 2016

Isuru Nawinne, Haris Javaid, Roshan Ragel, Swarnalatha Radhakrishnan and Sri Parameswaran. Exploring Multi-Level Cache Hierarchies in Application Specific MPSOCs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), 34(12), Pages 1991-2003, IEEE, 2015.

Isuru Nawinne, Josef Schneider, Haris Javaid and Sri Parameswaran. Hardware- Based Fast Exploration of Cache Hierarchies in Application Specific MPSoCs. In Proceedings of the International Conference on Design, Automation & Test in Eu- rope (DATE‘14), Article No. 283, European Design and Automation Association (EDAA), 2014.

Isuru Nawinne and Sri Parameswaran. A Survey on Exact Cache Design Space Exploration Methodologies for Application Specific SoC Memory Hierarchies. In Proceedings of the 8th IEEE International Conference on Industrial and Information Systems (ICIIS‘13), Pages 332-337, IEEE, 2013.

iii Acknowledgements

My journey as a graduate student started nearly four years ago. When I took my first steps the end seemed so far away, illusive and an intimidating prospect. The path I had to tread was treacherous and full of obstacles. Time and then I saw golden glimpses of my destination, much like Bilbo Baggins saw the Lonely Mountain over the tops of Mirkwood trees. Now that the journey has reached its conclusion with the writing of this thesis dissertation, I have many a person to be thankful to who aided me along the way.

First and foremost, my most sincere gratitude goes to my supervisor Prof. Sri Parameswaran for his continuous support throughout my term as a PhD student under him, for his patience, motivation, enthusiasm, and profound knowledge. He guided me to be a researcher with integrity and strengthened my belief in not giving up. His enthusiasm in effectively communicating ideas taught me to be a good presenter. I watched him carefully nuture and inspire the extraordinary in every student, just like a good gardener patiently cares for his plants, enjoying watching them grow and blossom. Sri gave me opportunity and guidance to become a good teacher, which, considering it being my career of choice, I’m confident would be immensely helpful to me in the future. I appreciate the support he gave every time

iv I needed it. He was a great mentor to me in studies and as well as in life in general. I enjoyed every little chat we had sharing his immense experience, which opened my eyes to a great many things. I could not have asked for a better supervisor, and I hope I would be just as good to my students in the days to come.

I shall convey my heartfelt gratitude to Dr. Roshan Ragel, who has been a caring mentor and an inspiration to me in the past and in the present. He gave direction to the young researcher in me, and has been there every time I needed guidance. It is from him that I learnt that there is always another answer than the ones given to you, which is such a simple yet liberating perception of reality. And the numerous hours spent discussing university education with him helped me become a more insightful academic. I will always be thankful to him and his wife Dr. Swarnalatha Radhakrishnan for all the care bestowed on me.

One of the compelling reasons I decided to pursue my graduate studies at UNSW was the vibrant constitution of personalities in the embedded systems research group. I first joined the group as a research associate soon after being graduated from my Bachelor’s degree, thanks to the incredible opportunity given by Dr. Jude Angleo Ambrose. Since then I had the privilege of working alongside several brilliant young engineers and scientists of the finest calibre. To name a few who positively affected my time in the group: Haris Javaid who helped me complement my own work from different perspectives and guided me to hone my skills in scientific writing; Jorgen Peddersen who inspired me to think critically and whom I shared many enjoyable board game sessions with; Josef Schneider whom I depended on in my foray into FPGA design and whose zestfulness and good humour kept the group uplifted; and Mahanama and Pasindu who helped me in so many ways as great friends, and shared all those lunch time chats where we envisioned great plans to make the world

v a better place, which I earnestly hope will bear fruit as time unfolds. My sincere thanks go to all members of the research group, the time spent with whom will be cherished.

I would like to express my gratitude to Terasic Technologies who provided the FPGA equipment, and to Sam Swiss for helping me with the development of the parame- terized cache component, both of which were instrumental in performing the exper- iments presented in this thesis.

Most importantly, I dedicate this thesis to my family who were there for me through- out my time as a graduate student and in life in general, especially the moments when I needed them the most. I’m grateful to my parents and my sister who al- ways made me believe in myself. I lovingly thank my mother for being the greatest source of courage and inspiration to me, and my farther for teaching me to face life with a smile. My partner, Sumudu, deserves special gratitude from me for being a colossal strength to me through these four years which we had to spend apart, without whom my journey would not have been a reality, let alone a success. I’m truly grateful to her for bearing with me when I was stressed, and for her loving care and understanding, which helped me persevere with my work. I dearly hope I have made them all proud.

I thank my aunt Freda and uncle Stanley for everything they did for me when I first moved to Sydney for my studies. Thanks go to all my good friends, especially Hi- ranya, Gihan and Dhanushka for taking care of me when I came down with sickness, and for all the wonderful times we shared at No 4 Botany street, the memories of which will always be cherished.

vi Lastly, I shall end the acknowledgements by quoting a set of verses by William Earnest Henley titled Invictus, which I turn to for inspiration.

Out of the night that covers me, Black as the pit from pole to pole, I thank whatever gods may be For my unconquerable soul.

In the fell clutch of circumstance I have not winced nor cried aloud. Under the bludgeonings of chance My head is bloody, but unbowed.

Beyond this place of wrath and tears Looms but the Horror of the shade, And yet the menace of the years Finds, and shall find me, unafraid.

It matters not how strait the gate, How charged with punishments the scroll, I am the master of my fate: I am the captain of my soul.

vii Abbreviations

SoC System on Chip

MPSoC Multiprocessor System on Chip

FPGA Field Programmable Gate Array

NP-Hard Non-deterministic Polynomial Time - Hard

DRAM Dynamic Random Access Memory

WCET Worst Case Estimated Time

ILP Integer Linear Programming

RTL Register Transfer Level

HDL Hardware Description Languages

MRU Most Recently Used

LRU Least Recently Used

FIFO First In First Out

MRA Most Recently Accessed

viii MRE Most Recently Evicted

CLT Central Lookup Table

GPU Graphics Processor Unit

CUDA Compute Unifed Device Architecture

VHDL Very-High-Speed-Integrated-Circuit Hardware Description Language

PCIe Peripheral Component Interconnect Express

SMP Symmetric Multiprocessor

USB Universal Serial Bus

CPU

DSE Design Space Exploration

FP Forward Pass

BP Backward Pass

ALM Adaptive Logic Module

ASP Answer Set Programming

CP Constraint Programming

ix Nomenclature

B - Block Size

S - Set Size

A - Associativity

th Pe - e Processor Core

P - Number of Processor Cores

th Li - i Cache Level

N - Number of Cache Levels

th Cij - j Cache on Level Li

Mi - Number of Caches on Level Li

th Kijk - k Configuration for Cache Cij

Dij - Number of Configurations in the Sub-Design-Space of Cache Cij

HRijk - Hit Rate (Application Dependent) of Cache Configuration Kijk

HLijk - Hit Latency of Cache Configuration Kijk

x MLijk - Miss Latency of Cache Configuration Kijk

ULijk - Update Latency of Cache Configuration Kijk

HEijk - Hit Energy of Cache Configuration Kijk

UEijk - Update Energy (after a miss) of Cache Configuration Kijk

CTOT - Total Number of Caches in the System

DTOT - Total Number of Caches Hierarchy Design Points

Tijk - Average Cache Access Time (Per Access and Application Dependent)

of Cache Configuration Kijk

TTOT - Total of Average Cache Access Times over all L1 Caches

Kijkmin - Configuration with minimum Tijk in the Sub-Design-Space of Cache Cij

NS - Number of Allowed Switchable Cache Configurations

th Ai - i Application Program in the Pool of Applications

NA - Number of Application Programs in the Pool

th Kj - j Candidate Cache Configuration

NCC - Number of Candidate Cache Configurations

HRij - Hit Rate for Application Ai on Cache Configuration Kj

HLj - Hit Latency of Cache Configuration Kj

ULj - Update Latency of Cache Configuration Kj

xi Tij - Average Cache Access Time (Per Access) for Application Ai

on Cache Configuration Kj

Tm - Access Time for Main Memory fi - Normalized Frequency of Occurence for Application Ai

Tavg - Average Tij over all Applications

NU - Number of Unique Cache Configurations in a Selection

Tclock - Clock Cycle Time

HEj - Hit Energy of Cache Configuration Kj

UEj - Update Energy of Cache Configuration Kj

Eij - Average Cache Access Energy (Per Access) for Application Ai

on Cache Configuration Kj

Em - Access Energy for Main Memory

Eavg - Average Eij over all Applications

xii Contents

1 Introduction 1

1.1 Cache Design Space Exploration ...... 3

1.2 Design Space Exploration of Multiprocessor Multi-Level Cache Hier- archies ...... 6

1.3 Application Specific Cache Optimizations by Exploiting Dark Silicon9

2 Cache Basics 13

2.1 Structure of a Cache ...... 13

2.2 Hierarchical and Shared Cache Organizations ...... 16

2.3 Cache Design Spaces ...... 17

3 Literature Review 19

3.1 Types of Cache Simulation Methods ...... 22

xiii 3.2 Trace-Driven Cache Simulation Techniques ...... 25

3.3 Hardware Assistance in Cache Simulation ...... 39

3.4 Exploring Multiprocessor Cache Hierarchies ...... 43

3.5 Cache Optimizations in Multi-Programmed Environments ...... 50

4 Hardware Acceleration for Multiprocessor Cache Simulation 53

4.1 Target Multiprocessor System Architecture ...... 55

4.2 Design Space Exploration Methodology ...... 56

4.2.1 Hybrid Simulation Framework ...... 57

4.2.2 Selection of Cache Configurations ...... 58

4.3 Implementation of hSim ...... 63

4.4 Experimental Setup ...... 69

4.5 Test Results ...... 71

4.6 Summary ...... 79

5 Iterative Design Space Exploration of Multi-Level Caches 80

5.1 Target Multiprocessor Multi-Level Cache Hierarchy ...... 83

5.2 Problem Formulation ...... 84

xiv 5.3 Iterative DSE Methodology ...... 85

5.3.1 Cache Analysis ...... 86

5.3.2 Algorithm ...... 87

5.3.3 Convergence Criteria ...... 92

5.3.4 Case for Hardware-Accelerated Simulation ...... 93

5.3.5 Hardware-Accelerated Simulation Process ...... 94

5.4 Experiments ...... 96

5.4.1 Fairness of Comparison ...... 100

5.5 Test Results ...... 100

5.5.1 Convergence ...... 101

5.5.2 Simulation Times ...... 112

5.5.3 Stability and Empirical Optimality ...... 113

5.5.4 Alternative Iteration Policies ...... 118

5.6 Summary ...... 120

6 Dark Silicon and Application Specific Cache Optimizations 121

6.1 Introduction ...... 121

6.2 Switchable Cache Architecture ...... 125

xv 6.3 Swicthable Cache Tuning ...... 132

6.3.1 Problem Formulation ...... 133

6.3.2 Analysis ...... 134

6.3.3 Exploration Algorithm ...... 136

6.4 Experiments & Results ...... 141

6.5 Discussion ...... 148

6.5.1 Optimizing for Energy ...... 148

6.5.2 Extended Usage Scenarios for Switchable Cache ...... 149

6.6 Summary ...... 154

7 Answer Set Programming in Cache Design Space Exploration 155

7.1 Introduction ...... 155

7.2 Related Applications of ASP ...... 157

7.3 Problem Formulation ...... 159

7.4 Answer Set Programming (ASP) ...... 160

7.4.1 Overview ...... 160

7.4.2 Problem Encoding in ASP ...... 162

7.5 Experiments & Results ...... 166

xvi 7.5.1 Comparison of ASP & Heuristic Searches ...... 167

7.5.2 ASP Search Strategies & Parallelism ...... 171

7.6 Summary ...... 176

8 Conclusion 177

8.1 Future Work ...... 181

8.1.1 Simulating Cache Coherency ...... 181

8.1.2 Future of Run-Time Cache Switching ...... 183

Bibliography 185

xvii List of Figures

1.1 Profiles of: (a) Execution time; and (b) Energy consumption for G721 encoder on a Tensilica Xtensa processor using different cache config- urations, by Shwe et al. [SJP13]. Maximum hits correspond to the largest configuration...... 5

1.2 An example MPSoC with seven caches arranged in three levels, form- ing a vast design space...... 6

2.1 Structure and organization within a cache memory...... 14

2.2 Hierarchical cache organizations in uniprocessor and multiprocessor systems...... 16

3.1 Overview and flow of memory access trace driven simulation methods. 25

3.2 Simulation data structures used by Janapsatya et al. in [JIS06]. . . . 27

xviii 3.3 Example CLT data structure used by Haque et al. in SCUD algorithm [HPJP12]: A CLT contains an entry for each memory block in the system. Every CLT entry is assocaited with records which represent the different cache set sizes in the simulation. Records indicate the availability of the memory block in a group of configurations with same set size and varying associativities...... 33

3.4 Overview of SPCE algorithm by Viana et al. in [VGRBV08]...... 36

3.5 T-SPaCS algorithm by Zang et al. in [ZGR11] (Si - set size for level i, B - block size, Ai - associativity for level i, C conflict tables, b - number of simulated block sizes)...... 38

3.6 Operation of the FPGA cache simulator in [SPP14a]...... 41

3.7 DIMSim algorithm by Haque et al. in [HRA+12] ...... 46

4.1 Shared memory multiprocessor architecture with private L1 caches

and a shared L2 cache...... 55

4.2 Multiprocessor memory hierarchy with four processors (P1 to P4),

four L1 caches (C1,1 to C1,4) and one L2 cache (C2,1)...... 56

4.3 Hybrid simualtion platform where cache hit rates are calculated on FPGA...... 57

4.4 Graphical overview of the simulation methodology flow, described in Algorithm 1, used to explore the design space of a two-level multi- processor cache hierarchy and determine suitable configurations. . . . 62

xix 4.5 Connection interfaces and operation overview of the hardware simu- lator (hSim) module...... 63

4.6 Internal implementation of the cache simulator core: (a) example top level of the simulator core, with a maximum set size of eight and maximum associativity of four; (b) complete pipeline inside a simulator core for set sizes 8-to-2 and associativities 4-to-1...... 65

4.7 Detailed schematic symbol showing all signals for the hSim module as implemeted in Altera Qsys system integration tool [Altb]. Widths of the address and data signals are configurable...... 67

4.8 Connection and usage of hSim in an MPSoC on FPGA: (a) multiple

hSim modules connected in the positions of private L1 caches; (b)

a single hSim module connected in place of a shared L2 cache, to simulate the corresponding sub-design-spaces...... 68

4.9 Energy Consumption against Access Time for private L1 cache con- figurations, in Experiment 1...... 72

4.10 Energy Consumption against Access Time for private L1 cache con- figurations, in Experiment 2...... 73

4.11 Energy Consumption against Access Time for shared L2 cache con-

figurations, in Experiments 1 and 2, based on the selected L1 config- urations...... 75

5.1 Effects of changing a cache’s configuration on the explorations in ad- jacent cache levels...... 81

xx 5.2 An example architecture for the target MPSoC memory hierarchy. P1

to P4 represent processors and C1,1 to C3,1 represent caches organized in three levels...... 83

5.3 Overview of the forward pass (FP), where assistance of FPGA hard- ware is used for parallel design space explorations on each cache level. 94

5.4 Example use of hardware simulators (hSim) in level L2 of a cache hier-

archy. Components hSim2,1 and hSim2,2 work in parallel to simulate

sub-design-spaces of the two shared L2 caches...... 95

5.5 Interface and structure of the hardware simulator (hSim) module. . . 96

5.6 Altera DE5-NET FPGA board used in the experimental setup. . . . . 98

5.7 Results from Test A1. Changes in selected configuration sizes for the

caches Ci,j at the design point reached in each iteration step...... 101

5.8 Results from Test A1. Changes in resulting Ti,j,kmin for the caches Ci,j as seen by the algorithm, at the design point reached in each iteration step...... 103

5.9 Results from Test A1. Changes in TTOT at the design point reached in each iteration step...... 103

5.10 Results from Test B1. Changes in selected cache configuration sizes

for the caches Ci,j at the design point reached in each iteration step. . 105

xxi 5.11 Results from Test B1. Changes in resulting Ti,j,kmin for the caches Ci,j as seen by the algorithm, at the design point reached in each iteration step...... 106

5.12 Results from Test B1. Changes in TTOT at the design point reached in each iteration step...... 106

5.13 Changes in selected cache configuration sizes for the caches Ci,j at

each iteration step, in (a) Test A2 and (b) Test B2 where exploration

starts at level LN . Final design points reached are the same as those

of Tests A1 and B1 respectively...... 110

5.14 Number of iterations taken to re-stabilize when an offset is manually introduced to the originally selected design point for System A..... 114

5.15 Results from Test C1. Changes in: (a) selected configuration sizes for

the caches Ci,j; (b) resulting Ti,j,kmin for the caches Ci,j as seen by the algorithm, at the design point reached in each iteration step...... 116

5.16 Design space in System C, showing the optimal design point. Vertical

axis denotes TTOT . Horizontal axis denotes the total cache size of the hierarchy...... 117

5.17 Changes in selected cache configuration sizes for the caches Ci,j at

each iteration step, in Test A10 where Round Robin traversal of cache levels is used. Final design point reached is the same as that of Test

A1...... 119

xxii 6.1 Average cache access time for a group of four applications (adpcm, bzip2, fft, fdct) when using variable and fixed cache configurations. . 123

6.2 Implementation of the switchable cache...... 126

6.3 Example switchable cache use cases. Each application uses its optimal cache configuration...... 127

6.4 Detailed schematic symbol showing all signals for the switchable cache as implemented in Altera Qsys...... 129

6.5 Example scenario with eight application programs and four switch- able cache configurations. More than one application sharing the same cache configuration (Applications B and E share cache config- uration 2 to achieve best performance)...... 132

6.6 (a) Search tree node structure. (b) Example of tree level expansion. . 138

6.7 Average cache access time against chip area for the switchable cache in Group A. Each design point represents a set of selected cache configu- rations. (a) Complete design space. (b) Optimal and Pareto-optimal points. (c) Speed-up for a given area budget, over using largest fixed cache out of all applications’ individual optimal configurations. . . . . 144

xxiii 6.8 Average cache access time against average cache access energy for the switchable cache in Group A. Each design point represents a set of se- lected cache configurations. (a) Complete design space. (b) Optimal and Pareto-optimal points. (c) Speed-up for a given energy budget, over using largest fixed cache out of all applications’ individual opti- mal configurations...... 146

6.9 Energy-Delay-Product per cache access against chip area for the switch- able cache in Group A. Each design point represents a set of selected cache configurations. (a) Complete design space. (b) Optimal and Pareto-optimal points...... 147

6.10 Potential usage of a multi-port switchable cache in multiprocessor system. Application B migrates from CPU 2 to CPU 4, while still using the same cache configuration 3...... 150

6.11 Overview of a switchable cache with multiple data/address ports. . . 151

6.12 Example of switching caches between different phases in an applica- tion’s execution...... 151

6.13 Example of using cache switching in a pipelined multiprocessor system.152

7.1 Comparison of search times for application groups 8, 9 and 10. . . . . 170

7.2 Comparison of search times for application group 8, using different ASP search strategies and multiple threads...... 173

xxiv 7.3 Search times spent to: (a) find the optimal solution; (b) Verify the optimality of the solution...... 174

xxv List of Tables

4.1 Applications used in the Experiments ...... 70

4.2 Simulated Configurations for Private L1 Caches and Shared L2 Cache 70

4.3 L1 Cache Configurations with Minimum E and T from Experiment 1 74

4.4 L1 Cache Configurations with Minimum E and T from Experiment 2 74

4.5 L2 Cache Configurations with Minimum T and E from Experiments 1and2...... 75

4.6 Total Estimated Memory Access Energy and Time for Applications . 76

4.7 Simulation Times to Calculate Hit Rates in Hardware ...... 77

5.1 Applications Executed in System A ...... 97

5.2 Applications Executed in System B ...... 97

5.3 Design Space Parameters for Systems A and B ...... 99

5.4 Selected Design Point in Test A1 ...... 104

xxvi 5.5 Selected Design Point in Test B1 ...... 107

5.6 Results in Comparison ...... 108

5.7 Explored Portion of Design Space ...... 109

5.8 Simulation Times when using Hardware Assistance ...... 112

5.9 Offset Design Points from Latin Hypercube Sampling ...... 114

5.10 Design Space Parameters for System C ...... 115

5.11 Optimal Design Point for System C ...... 118

6.1 Overheads of Switchable Cache ...... 128

6.2 Candidate Cache Configurations ...... 128

6.3 Average Cache Access Times ...... 130

6.4 Application Groups in Experiments ...... 141

6.5 Design Space Exploration Results - Solutions ...... 141

6.6 Design Space Exploration Results - Statistics ...... 142

7.1 Candidate Cache Configurations ...... 166

7.2 Application Groups ...... 167

7.3 Search Times and Optimality ...... 168

xxvii 7.4 ASP - Search Times (in Minutes) for Multiple Threads & Search Strategies ...... 172

xxviii Chapter 1

Introduction

Recent advancements in digital electronics technology have benefitted computer processors in many ways. Modern day processors are able to operate at extreme clock frequencies, low voltage levels and draw minute amounts of energy. They are able to incorporate increasing amounts of transistors on board, which enables various optimizations that enhance performance. However, the performance of memory systems associated with processors have not quite scaled in a similar manner. Hence, the memory is often identified as a performance bottleneck in most systems. Over the years, caching of data and instructions have emerged as a popular and reliable solution to alleviate the impact of slow memory devices, and bridge the performance gap between processor and memory.

Caches are expensive but fast memory devices that can hold a subset of data close to the processor for efficient access. Conceptually, caches provide enhanced memory access performance through principles of temporal and spatial locality, which derive from the notions: that recently used data are likely to be reused; and adjacent

1 1. Introduction data blocks are likely to be used in sequence. The organization of data within a cache is determined by a set of parameters: block size; set size; and associativity. Different values of these parameters create unique structures which are known as cache configurations. How fast a cache can be accessed or updated is directly related to the cache’s configuration. Moreover, any access to a cache memory being a hit (data being available in the cache) or a miss (data not being available in the cache) depends on the cache’s configuration.

The sequence of memory accesses generated by a processor depends not only on the architecture of the processor itself, but also on the software application program that is being executed. Numerous research works [GRZVD04, JIS06, SJP13] have shown that the performance benefits drawn from the same cache configuration differs significantly between different application programs, especially in the domain of embedded systems. Conversely, the cache hit rate sustained by a given program varies between different cache configurations. The application dependent behaviour of cache memories have given rise to various design optimizations. Complementing many microarchitectural enhancements on caches, finding the cache configuration that cater best for a given application program stands as a dominant design desision for application specific embedded computing systems.

Modern day off the shelf System-on-Chips (SoC) recognize the need to customize various on-chip components to suit the embedded application program under design. Most SoC and processor manufacturers, such as ARM [ARM] and Cadencce Tensilica Xtensa [XTE], provide the designers with facility to select the values for cache paramters block size, set size and associativity. Hence the designer is tasked with analysing the application program to determine which cache configuration would allow the optimal performance.

2 1. Introduction

1.1 Cache Design Space Exploration

The design space of a cache consists of hundreds of different configurations, each providing a unique hit latency (time to determine whether an access is a hit). Fur- thermore, each cache configuration sustains a unique hit rate for a given application program. Cache design space exploration invloves assessing the suitability of a col- lection of cache configurations for a given application with the aim to determine the best configuration. Finding out the hit rates for all candidate cache configurations is an essential and integral part of the cache design space exploration process.

Testing an application program with all the cache configurations in a design space, in actual hardware, is an inefficient exercise with respect to time and resources. Simulating the behaviour of individual cache configurations to calculate the hit rates is indeed possible, and is a common approach in practice as well as in research, albeit a tedious process. While software based simulation techniques [Hil, JIS06, HPJP10, VGRBV08] tend to be heavily time consuming, state-of-the-art hardware accelerated simulation techniques [SPP14a, SPP14b] provide a much faster alternative.

The major obstacle for accurately performing such simulations is the extraction of memory access traces of an application. A significant length of time is required to generate the sequence of memory accesses through an instruction set simulation, and traces generated as such require massive storage space. For example, 72 hours were spent on extracting the memory access trace [NSJP14] for encoding 24 low resolution images into one second long MPEG2 format video, using a software encoder running on a Tensilica Xtensa processor [XTE]. The generated trace contained over 12 billion memory accesses (each access associated with a memory address and access type - read or write), which took 129.4GB worth of storage space.

3 1. Introduction

Specifically due to the issue of lengthy trace extraction, designers in practice often resort to using fractional samples of memory accesses for the simulations, in order to maintain a reasonable time-to-market. Memory access trace samples are obtained only from critical sections of application execution. However, using sampled memory access traces directly affect the accuracy of the design space exploration, as the behaviour of a cache in the entirity of application execution is not covered in the simulations, and more often than not end with sub-optimal results.

At this point, it is worthwhile noting a common misconception, which states that a larger cache will always provide better performance. This claim rarely holds true, as Shwe et al. show in [SJP13]. Figure 1.1 shows execution time and energy consump- tion profiles for G721 encoder application running on a Tensilica Xtensa processor using different cache configurations. The configuration providing the fastest execu- tion time is marked by a red triangle (N), and the configuration consuming the least energy is marked by an orange diamond () in Figure 1.1. Neither of the above con- figurations correlate with the largest cache configuration in the design space, which sustains the most cache hits, marked by a yellow square ().

Larger cache configurations always sustain more cache hits than smaller configu- rations, as more data can be accommodated in the cache. However, more cache hits does not necessarily guarantee better performance. The latency taken to make a single cache access grows with increasing cache sizes, which ultimately leads to degraded overall performace as well as high energy and chip area costs. The ex- perimental data presented in this dissertation show that relatively smaller cache configurations are more likely to provide optimal access times on average for appli- cation dependent embedded systems. Therefore, accurately exploring cache design spaces is crucial in achieving optimal memory access performance at low costs.

4 1. Introduction

Unique cache confi gura on Confi g with min execu on me Confi g with max cache hits Confi g with min energy consump on

Figure 1.1: Profiles of: (a) Execution time; and (b) Energy consumption for G721 encoder on a Tensilica Xtensa processor using different cache configurations, by Shwe et al. [SJP13]. Maximum hits correspond to the largest configuration.

5 1. Introduction

1.2 Design Space Exploration of Multiprocessor

Multi-Level Cache Hierarchies

Multiprocessor System on Chips (MPSoC) are becoming increasingly common in embedded devices. MPSoCs typically feature multiple cache memories organized in a hierarchical manner. Commonly used hierarchical memory architectures in- volve private first level caches (one for each processor core) and shared caches in subsequent levels, as depicted in Figure 1.2.

The performance of any given cache in a hierarchy is affected by the configurations of other caches, due to the relationships between connected caches. Therefore, the complete design space of an MPSoC cache hierarchy is the cross product of the in- dividual sub-design-spaces of all the caches in the hierarchy, thus containing billions or even trillions of unique design points. For example, for the cache hierarchy shown in Figure 1.2 with seven caches, if only 10 candidate configurations are considered

cpu cpu cpu cpu No. of caches 7

Example No. of candidate configurations 10 Private L1 Private L1 Private L1 Private L1 per cache

No. of design points 107 Shared L2 Shared L2 for the hierarchy

Traces required for 7 Shared L3 simulation per design point

Total no. of traces 7x107 Main Memory

Figure 1.2: An example MPSoC with seven caches arranged in three levels, forming a vast design space.

6 1. Introduction per each cache’s sub-design-space, the overall cache hierarchy design space contains a total of 107 unique design points to select from. The typical sized design spaces explored in this dissertation contains more than 10 trillion design points. As pointed out in Section 1.1, exploring the sub-design-space of an individual cache can be time consuming by itself, therefore exploring a cache hierarchy design space becomes a far more tedious exercise.

When exploring a multiprocessor multi-level cache hierarchy, a more difficult chal- lenge presents itself at the trace extraction process. For the cache hierarchy shown in Figure 1.2, seven different memory access traces (one for each cache) are required to simulate one design point. This means 7 × 107 different access traces (each of which can typically be several hundred gigabytes in size and take several hours to be generated) in total are needed for the complete design space. Thus, simulating each and every design point is impractical as it would take years. For the same rea- son, thorough explorations on such design spaces have seldom been attempted. Even state-of-the-art methods [HRA+12, HKH+13] resort to using a single memory access trace (combined from all processors) to evaluate all cache sub-design-spaces, which severely compromises accuracy for design time and inevitably leads to sub-optimal results.

Hardware accelerated simulation is a recent advent in research across many domains [CPN+09], where time intensive software components in a process can be substituted with specialized hardware. Most recent advancement in cache simulation research [SPP14a, SPP14b] presents fast hardware accelerators to perform cache hit counting simulations on individual cache design spaces.

7 1. Introduction

This dissertation presents the first hardware-based framework to rapidly perform cache hit counting simulations on multiprocessor multi-level cache hierarchies, with improved accuracy over state-of-the-art software-based methods. The proposed framework involves using an FPGA (Field Programmable Gate Array) device in the design process, which accommodates the specialized hardware components. Im- portantly, the presented hardware-based approach completely eliminates the need to pre-extract memory access traces, which is a major limiting factor for existing methods, by seamlessly integrating hardware simulator components into the MPSoC under investigation. Moreover, the access patterns on shared caches are effectively captured in real-time simulation. In the experiments, the proposed framework was able to achieve up to 456 times faster simulation times compared to the fastest known software-based multiprocessor multi-level cache simulator [HRA+12].

Furthermore, a novel design space exploration algorithm is presented in this disserta- tion, which, combined with the hardware simulation framework, can achieve higher exploration accuracy. The algorithm enables an unprecedented portion of the vast multiprocessor multi-level cache hierarchy design space to be explored, through a carefully crafted set of stages. The experiments show that the new exploration algo- rithm was able to improve an MPSoC’s average cache access time by up to 18.9%, while reducing total cache size by up to 74.15% at the same time, compared to previous techniques. Provided experimental data include an extensive set of tests in order to assess the optimality of the proposed algorithm.

While enabling to achieve improved results in terms of cache access time and cache size, perhaps a more important aspect of the proposed hardware-based framework and algorithm is that a thorough and accurate exploration of a generic multipro- cessor multi-level cache hierarchy design space is made practically feasible, without

8 1. Introduction compromising time-to-market. The methodology allows cache hierarchies with more than two levels of caches and with typical cache parameter ranges to be explored in reasonable time, which is highly desirable but was lacking in existing methods.

1.3 Application Specific Cache Optimizations by

Exploiting Dark Silicon

Improvements in manufacturing technologies have enabled smaller sized transistors over the years. Even though the operating voltage thresholds have also reduced, the power consumption per transistor has not scaled down as much. Today’s technology allows to have billions of transistors on the same silicon die. It is widely expected that future silicon chips will contain transistors in such abundance that keeping a significant portion of it, let alone a whole chip, powered at the same time may not be a possibility without encountering concentrated overheating. Thus, the majority of a chip would have to be kept powered down, for the chip to function safely. This phenomenon is referred to as Dark Silicon.

Taylor [Tay12] has predicted that, by the year 2020, a staggering 93.75% of a silicon chip design will have to kept dark (powered off) at a time. In the light of this phe- nomenon, researchers have proposed various techniques to exploit the Dark Silicon on a chip to perform application specific optimizations [BJS+14, CX13, CMP+14, TRGM13].

As discussed in Section 1.1, memory access performance of the same cache configura- tion varies between different application programs, due to contrasting memory access

9 1. Introduction patterns. In embedded systems that execute a number of different applications on the same processor, using a fixed cache configuration prevents all applications from achieving optimal memory access performance. Contrastingly, having the ability to use distinct cache configurations for different applications executed on the same processor can allow significant performance gains.

Run time re-configurable caches provide facility to change the internal organization of a cache memory while the system is in operation. However, re-configurable caches can only change between a handful of inter-dependent cache configurations, thus failing to include the optimal configurations for many application programs within the re-configurable design space. Moreover, the extra logic circuitry required to enable run time re-configuration creates overheads in terms of critical path delays and excess power consumption, which is not desirable in the context of Dark Silicon.

This dissertation describes an architecture, called switchable cache, where a single cache memory device can consist of several configurations separately within itself, by leveraging available Dark Silicon. The presented architecture proposes to keep only one cache configuration in Bright Silicon (i.e. powered on) and the rest in Dark Silicon (i.e. powered off), based on the application program under execution. Therefore, every application could use the pre-determined optimal cache configu- ration each time when the application is executed. Experimental data show that switchable caches can significantly improve overall memory access performance while imposing negligible overheads.

The Dark Silicon budget available for caching purposes will be limited when system- wide optimizations aim to collectively exploit the benefits. In realistic situations where the number of different configurations in the switchable cache is limited by

10 1. Introduction the availability of Dark Silicon, but a higher number of application programs are to be executed by the processor, selecting the optimal set of configurations for the switchable cache becomes a new design problem. For instance, if eight applications are to be executed on the system, but only four cache configurations can be accom- modated by the switchable cache, the ideal four configurations should be identified at design time based on the memory access behaviour of the group of eight applica- tions.

The design space for such an optimization problem could easily grow to massive pro- portions, containing several trillions of design points, growing exponentially with the number of programs and the number of candidate cache configurations. A problem instance with eight application programs, four switchable cache configurations and a pool of 315 candidate cache configurations to be selected from forms a design space with 26.38 trillion design points. The tuning of switchable cache can be identified as an NP -Hard optimization problem, since the knapsack problem can be under- stood as a distinct case of it, and solutions by means of conventional methods are extremely time intensive.

A new design time algorithm is presented in this dissertation to rapidly pre-determine the optimal or a near-optimal set of switchable cache configurations, which substan- tially improve the overall cache access performance for a given group of application programs. Using the data provided by the hardware cache simulators, the proposed heuristic algorithm could quickly find the solution, in under two seconds for most experiments. The presented work is the very first in the direction of switchable caches and their associated design space exploration problem.

11 1. Introduction

To enhance the robustness of the solution for switchable cache tuning, the design tool should be able to guarantee the optimality of the design space exploration. Thus, an alternative search methodology is presented by employing Answer Set Programming, which is a declarative logic programming technique that is primarily aimed at solving difficult NP -Hard problems with guaranteed optimality.

Design optimization of cache memories is a broad topic which encompasses a wide range of problems, most of which involve exploring design spaces of massive pro- portions. Satisfactorily solving such problems in reasonable time require innovative design space exploration methods, and the solutions enable application-specific pro- cessing systems to gain improved performance and alleviate memory-related bot- tleneck issues. Moreover, the ability to quickly explore vast cache design spaces in a thorough manner can allow designers and researchers to gain invaluable insight. In the following chapters, this dissertation presents novel design space exploration methodologies and associated hardware implementations, which make quickly solv- ing difficult cache design optimization problems practical and feasible.

Chapter 2 provides a brief overview of cache memory fundamentals. A comprehen- sive survey of literature on cache design space exploration is presented in Chapter 3. Chapter 4 presents the implementation details of the hardware-based multiprocessor cache simulation framework, while Chapter 5 describes a novel design space explo- ration algorithm which uses the simulation framework. In Chapter 6, the switchable cache architecture is presented and the associated design space exploration prob- lem is solved using a heuristic algorithm. Using answer set programming to solve the switchable cache tuning problem is discussed in Chapter 7. Finally, Chapter 8 concludes the dissertation and provides directions for furthering cache design space exploration research.

12 Chapter 2

Cache Basics

A cache memory is intended to hold a subset of data from memory close to the processor for efficient access. Caches become beneficial due to the temporal and spatial localities within application programs, where most recently used data are likely to be reused and adjacent data elements are likely to be accessed in sequence. In the following sections, basic concepts and terminology with regard to cache struc- ture, organizations and hierarchical systems will be discussed, which will be used in describing the design space exploration methods in upcoming chapters.

2.1 Structure of a Cache

The structure and size of a cache memory, as mentioned in Chapter 1, is governed by the parameters Block Size (B), Set Size (S) and Associativity (A). Figure 2.1 depicts the basic internal organizations of a cache memory device, with respect to the above parameters.

13 2. Cache Basics

Cache Associativity (A=4) Set

Block Block Block Block Block Block Block Block

Block Block Block Block

)

8 =

S Block Block Block Block ( Block Block Block Block

Set Set Size Block Block Block Block Block Block Block Block Block Block Block Block

Block Size Associative Way

Figure 2.1: Structure and organization within a cache memory.

In caching, data is fetched, stored and evicted as units called blocks. Block Size denotes the size of a data block in number of bytes. Typical block sizes range from four bytes (as most systems use four byte words) up to and over 256 bytes in certain cases. Larger block sizes can better exploit spatial locality, as more adjacent data words will be fetched to the cache anticipating future use, although causing the bus traffic and energy consumption to increase.

As caches are designed to hold a subset of data from memory, several memory block addresses are mapped to the same cache block location, and any given memory block address can be mapped to either one or a small fixed number of cache locations. Associativity defines the number of such locations within a cache that a given mem- ory block can be mapped to. An associativity of one (A = 1) means every memory

14 2. Cache Basics block maps to one fixed location in the cache, known as a direct mapped cache. Higher levels of associativity (as depicted in Figure 2.1) allows one-to-many map- ping for memory blocks, and essentially decreases the chance of one cached block to be evicted to make space for a new one. Associativities of two, four or eight are commonly used, while much higher levels can also be seen in practice, although rare. The storage arrays for the additional degrees of associativity is referred to as ways (e.g. 4-way set associative). Associative ways are searched in parallel to find a matching address tag for a cache access, therefore higher degrees of associativity impose heavy logic overheads, delays and higher energy consumption.

A set is the location (or the collection of locations) that a given memory block may be mapped to in a cache. The parameter Set Size denotes the number of such sets in the cache structure, which essentially determines the size of the cache. Set size can range from one up to and over 1024 for general purpose use depending on the system. A set size of one with very high degree of associativity is known as a fully associative cache, where the mapping of memory blocks to cache locations is any-to-any.

Since there are several locations available for a given memory block to reside in set associative caches, a decision has to be made on which location is to be evicted to make space for a fetched data block. A common policy for this decision is the Least Recently Used (LRU) block to be evicted, which supports temporal locality, and require chronological order of accesses to the cached blocks to be maintained in records. The other most common block replacement policy is First in First Out (FIFO), which does not exploit temporal locality as well as LRU, but is far simpler to implement as much less record keeping is involved.

15 2. Cache Basics

2.2 Hierarchical and Shared Cache Organizations

More than one cache memory device can be employed between a processor and main memory to improve memory access performance. Figure 2.2 depicts typical hierarchical cache organizations in uniprocessor and multiprocessor systems. The cache levels closer to the processor are referred to as upper levels and the cache levels closer to the memory are referred to as lower levels. The lower most cache is sometimes called the last level cache. It is common for the higher level caches to be embedded in to the same chip die as the processor, hence achieving very low access latencies. Last level caches are typically much larger than upper level caches, therefore may be placed off chip.

Hierarchical caches may implement either inclusive or exclusive relationships be- tween adjacent cache levels. In inclusive hierarchies, all data cached in an upper level cache forms a subset of data cached in the connect lower level cache. Thus, a cache miss from an upper level always results in data being requested from the next

CPU CPU CPU CPU CPU

Level 1 Caches Private Private Private Private Private (L1)

Level 2 Caches Private Shared Shared (L2)

Level 3 Caches Private Shared (L3)

Main Memory Main Memory

Figure 2.2: Hierarchical cache organizations in uniprocessor and multiprocessor sys- tems.

16 2. Cache Basics lower level, which in turn could either be a hit or a miss. In exclusive hierarchies, which are less commonly implemented, two connected cache levels hold mutually exclusive sets of data. Therefore, a miss at either level triggers an access to the main memory.

Shared memory multiprocessor systems use generic cache hierarchies as shown in Figure 2.2, where processors can use private caches as well as share access to lower level caches. Shared caches result in contention between processors (or upper level caches) to gain concurrent access, which is dependent on the memory access be- haviour of application programs being executed. Systems which implement Von Neumann or Modified Harvard architectures can employ caches which hold both data and instruction blocks, which are known as unified caches. In Modified Har- vard machines, unified caches can be typically seen providing shared access at lower levels, while higher levels caches tend to be private.

2.3 Cache Design Spaces

The design space of a single cache memory is composed of the parameters block size, set size and associativity. Every design point in such a space represents a unique cache configuration with a particular combination of values for the above parameters. A typical design space can have several hundreds of such configurations, each of which exhibits unique access and update latencies as well as unique hit rate for a given memory access sequence.

Due to the interactions between connected caches, the design space of a generic inclusive hierarchy is the cross product of the sub-design-spaces of all individual

17 2. Cache Basics caches. A design point in such a space gives a unique cache hierarchy where every cache has a fixed configuration from its sub-design-space. With each cache’s sub- design-space consisting of three dimensions, the design space of the cache hierarchy is composed of a massive number of dimensions and usually contains billions or even trillions of unique design points.

18 Chapter 3

Literature Review

Speed difference between processor and memory is an age old problem in computer systems. Improvements in DRAM (Dynamic Random Access Memory) technology have always evolved at a slower rate than those of [HP11, MV99, KGB96], making computation increasingly faster and cheaper relative to storage and communication . Therefore, processors could operate at much higher frequencies, and the gap with memory speed only continued to increase over the years. This essentially meant a processor has to wait a long time (several processor clock cycles) for a memory access request to be serviced. Even though processor frequency scal- ing has eventually reached technological limits, the advents in multiprocessors and many-core systems only increased the pressure on the memory side. Systems with more processor cores suffer more from additional memory contention issues, making the memory an even severe performance bottleneck.

A myriad of solutions have been proposed throughout the years, to improve memory access performance. The vast majority of the solutions involve either increasing

19 3. Literature Review memory bandwidth [MV99] or making efficient use of the available bandwidth, using techniques such as Dynamic Access Ordering [MWL95], Logic-DRAM integration [PAC+97] and Address/Data Compression [FP91, LDK99, BM96]. Caching of data is by far the most prominent solution among all, where small but fast memory devices are used ahead of main memory in order to provide efficient access to a subset of data. Caching becomes significantly beneficial due to temporal and spatial localities present in application programs (see Chapter 2), cutting down the amount of accesses to the main memory by over 99% in some cases.

Cache memories are in such widespread use, covering a broad spectrum of computing systems including general purpose computer processors, high performance systems and embedded systems. Accordingly, research over the past few decades contain numerous works focusing on cache related optimizations [SD95, KCDM98, GG04].

Due to the power and performance critical nature of modern embedded systems, the necessity to find the optimal (or near optimal) cache configuration has become an important issue in computer systems design. Modern embedded systems can have one or more processor cores on board and execute different classes of application programs including multi-media, security, compression, control and productivity among others. Sequences of memory accesses made by application programs are often highly diverse. Therefore, different applications sustain different cache hit rates from the same cache configuration, and the same application can achieve varying hit rates from different cache configurations [GRZVD04, JIS06, SJP13]. Finding the cache hit rate from a particular configuration for a given application program, known commonly as cache simulation, is therefore an important component when evaluating a cache memory. As discussed in the following sections, the process of cache simulation is a lengthy exercise which require massive amounts of storage

20 3. Literature Review space as well as time.

Even though the hit rate is a major constituent of cache performance, hit rate alone is not a sufficient performance measure. Attributes such as cache access latency, update latency, etc. also vary between different cache configurations, complicating the matter of selecting an optimal cache. For example, a set associative cache usually provides more cache hits than a direct mapped cache does, but the access latency of a set associative cache is comparatively higher. In [SJP13], Shwe et al. demonstrate that cache configurations achieving the most cache hits neither provide best application execution time, nor best energy efficiency (see Figure 1.1). Thus, proper design space explorations are required to identify cache configurations which provide optimal or near optimal cache access times, taking all attributes in to account. The same holds true when the objective of optimization is cache power consumption.

A typical design space of a cache memory is formed based on the cache parameters: Block Size; Set Size; and Associativity, while other factors such as replacement policy may also be considered. Definitions of the above parameters can be found in Chapter 2. Size of a cache block usually ranges from 4 bytes up to 256 bytes, in increasing powers of two, in various systems. Set size can typically vary between 1 (for fully associative caches) up to 256, in increasing powers of two, and can be even higher in some cases. The value range of associativity is usually between 1 and 16, but fully associative caches may have higher degrees of associativity. With the above three cache parameters, a typical design space of a single cache memory can contain several hundreds of unique cache configurations.

21 3. Literature Review

3.1 Types of Cache Simulation Methods

The literature contains a rich body of work on cache design space exploration, of which a majority is on performance estimation of configurations for a single cache. The prior work can be broadly categorized into two classes: analytical methods; and exact simulation methods. Analytical methods such as [BCB74, Rao78, Aga87, LMW96, CR09] involve mathematical modeling and analysis of cache behaviour to estimate WCET (worst case estimated time). Exact simulation methods are based around counting cache hits through simulations. Owing to the high computational demands of simulation techniques, most early works are predominantly analytical.

Works such as [LMW99] by Li et al. use Integer Linear Programming (ILP) to model the behaviour of generic instruction caches. However ILP-based techniques typically do not scale favourably in terms of analysis time. Theiling et al. [TFW00] propose to perform an analysis of instruction caches based on compiled program executables. The authors use an abstract interpretation approach and classify instructions as all- hit (evey access is a cache hit), all-miss (every access is a cache miss), persistent (never evicted from the cache) or non-classified (belong to neither of the other three classes). However, a similar analysis for a data or unified cache would require the access sequence to data cache be generated a priori.

Timing analysis of data caches have been studied in works [WMH+97, FW98]. Fer- dinand et al. makes use of an abstract interpretation approach for data cache anal- ysis in [FW98]. A major difficulty in data cache analysis using instructions from a compiled application program is precisely predicting the range of data memory addresses accessed by a particular instruction (for example in execution loops). Sen et al. [SS07] attempt to address this issue by partially unrolling loops.

22 3. Literature Review

Analytical methods as a whole can provide cache performance estimates reasonably quickly. However, it is particularly difficult to consider different memory access patterns of application programs through model analysis. Thus analytical methods suffer from failing to properly capture the application dependent behaviour of caches. Contrastingly, exact cache simulation techniques focus on providing a more realistic estimate of cache hits by explicitly considering cache access patterns of application programs.

State-of-the-art exact cache design space exploration techniques can be broadly cat- egorised into three sub-classes: 1) System Simulations [XTM]; 2) Instruction Set Simulations (for example, in SystemC) [LPB06]; and, 3) Trace-driven simulations [GRZVD04, JIS06, TTYO09]. In full system simulation tools such as [XTM], the complete system under investigation (including processors, peripherals, memory, etc. in addition to caches) are evaluated through software simulations. These simula- tions are typically cycle accurate and often occur at gate level to achieve utmost accuracy, or can be performed at RTL (Register Transfer Level) or behavioural level. Methods such as performance counters are used to record data related to cache be- haviour. However, system simulations are very costly to conduct, requiring high end computing resources and still incur significant simulation times (even several days to complete). Moreover, most system simulation tools are difficult to customize and hard to program with respect to cache configurations.

Instruction set simulators [LPB06] are somewhat similar in operation, and generally occur at behavioural level using tools such as SystemC or HDL (Hardware Descrip- tion Languages). Only the processor architecture is simulated along with cache models, instead of the complete system, which allows relatively faster simulation times compared to full system simulations.

23 3. Literature Review

While the system simulations and instruction set simulators offer precise estimations of application dependent cache performance, the major drawback of using such simulation tools for cache design space exploration is the need of multiple repetitions. Only a single cache configuration can be evaluated in a single simulation, therefore multiple simulations are required to cover a cache design space in order to identify the optimal configuration. As an individual simulation is costly by itself, having to perform multiple repetitions is not an appealing solution at design time.

Contrastingly, trace-driven cache simulators do not require processor architecture related information and are the fastest amongst all, therefore could be used well ahead in the design process. In trace-driven cache simulations, the cache-subsystem is isolated from all other components. Therefore, the simulation itself does not depend on other factors such as the type of the processor used. Thus the simula- tion itself could be performed faster compared to the other classes of simulations mentioned earlier. Memory footprint of the application program, in the form of an access trace is the only necessary input. Even though generating a memory access trace incur heavy time costs similar to system simulations, only one trace need to be generated per application program. Once the trace is obtained, the same trace can be used to simulate hundreds of different cache configurations, thereby saving a significant amount of design time.

24 3. Literature Review

3.2 Trace-Driven Cache Simulation Techniques

Early trace-driven cache simulation tools such as Dinero IV [Hil] by Hill were widely used in practice to calculate the accurate cache hit rate through simulation. Dinero IV algorithm allows the designer to simulate the behaviour of a specified cache configuration using an extracted memory access trace of a given application executed on a given processor. The configuration could be for either an instruction cache or a data cache, or both. The relevant instruction and/or data access traces should be fed into the simulator accordingly as input, and the simulator provides the cache hit/miss rate for the specified cache configuration, as illustrated in Figure 3.1.

However, Dinero IV simulates one cache configuration at a time, therefore the de- signer has to run the simulator repeatedly for different cache configurations, to find the most suitable configuration for a given application’s trace. Typically a memory access trace for a few seconds of an application can consist of millions or even up to billions of memory accesses. Therefore, repeatedly executing the simulator algo- rithm for different cache configurations, using such a trace can consume a lengthy amount of time, several hours and even days for large design spaces.

Memory Access Trace Simulation Hit/Miss Rates

Contains up to billions of Evaluates the For each cache memory accesses behaviour of one or configuration simulated, generated by a processor more cache using the input access executing an application configurations trace

Figure 3.1: Overview and flow of memory access trace driven simulation methods.

25 3. Literature Review

Different methods have been proposed to alleviate high time and space costs of trace- driven simulation. Wang and Baer’s approach in [WB91] was to reduce the size of the memory access trace used in the simulation, which lessens the simulation time and greatly reduces the storage space and time required to generate the trace, at the cost of simulation accuracy. Other approaches such as in [HS90] by Heidelberger and Stone proposed to use the high correlation of the activity between different cache sets to reduce the number of re-simualtions required.

The literature on cache design space exploration marks a significant milestone when parallel simulation of cache configurations is introduced by Sugumar et al. [SA95]. There, the fundamental idea is to emulate cached address tags for multiple cache configurations simultaneously, reducing the time spent to explore the design space. Sugumar’s work [SA95] proposed to efficiently count cache hits for a range of set- associative caches, with the same block size, but varying associativities and varying set sizes. Compared to previous simulation techniques, parallel simulation covers a number of cache configurations in a single pass over a given memory access trace. Moreover, storing of actual cached data is not emulated, in contrast to early works. Instead, cache hits were counted for the configurations under simulation, using gen- eralized binomial tree structures which keep track of address tags.

Extending on Sugumar’s work, Janapsatya et al. introduced a trace-driven cache simulation method [JIS06] for first level (L1) caches based on the formulations pre- sented earlier in [HS89] by Hill et al. It is a single-pass simulation method, in the sense that the memory access trace is read just once to evaluate all the configura- tions. A forest of binomial tree data structures, linked lists and an array are used to model the space of all cache configurations that should be explored, as illustrated in Figure 3.2.

26 3. Literature Review

Array of Binomial Tree Linked List Counters Forest (Set Sizes) (Associativities) Config. 1 Config. 2 Config. 3 Config. 4

... v ...... Config. n

Associativity = 1 Associativity = 4 Most Least Recently Recently Used Tag Used Tag

Figure 3.2: Simulation data structures used by Janapsatya et al. in [JIS06].

The array shown in Figure 3.2 contains hit/miss counters for each configuration under simulation and store pointers to the tree structures with relevant block size. Each level in a binomial tree structure corresponds to a set of cache configurations with same block size and set size but varying associativities, and each individual tree node represents a single cache set. Linked lists are connected to each tree node, which represent the set-associative cache ways and therefore different associativities. The linked lists store the address tags which are used to compare with access addresses from the input memory access trace, and used to assess whether the said access is a hit or a miss. The address tags stored in the linked lists are sorted according to the access history, from most recently used (MRU) to the least recently used (LRU),

27 3. Literature Review as shown in Figure 3.2. It should be noted that cached data is not stored in the simulation, as the tags are sufficient to determine whether an access is a hit or a miss.

Visiting all of the tree nodes for each memory address in the access trace would be an exhaustive task which can consume a lengthy amount of time, especially with a vast number of cache configurations. In order to remedy this issue and improve simulation efficiency, Janapsatya makes use of two critical observations, initially introduced by Mattson et al. [MGST70] and Hill et al. [HS89].

Property 1: The first observation states that when a cache hit occurs for an address MA in a cache configuration K (block size B, set size S, associativity A), all other configurations K0 (block size B, set size S0, associativity A) where S0>S with LRU replacement policy can also be guaranteed to have hits for address MA.

Property 2: The second observation states that when a cache hit occurs for an address MA in a cache configuration K (block size B, set size S, associativity A), all other configurations K0 (block size B, set size S, associativity A0) where A0>A with LRU replacement policy can also be guaranteed to have hits for address MA.

Above two correlations between cache configurations were first used by Mattson et al. in [MGST70] to find the frequency of accesses to different levels in a memory hierarchy, and by Hill et al. in [HS89] to analyse the effects of associativity on cache miss rate. Janapsatya et al. [JIS06] use the same correlations from Property 1 and Property 2 in a manner that allows to assess cache hits/misses for a group of con- figurations at once, enabling rapid evaluation of hit/miss rates, where simulation of

28 3. Literature Review a sub-set of cache configurations from the design space allows to accurately evalu- ate the complete design space. This improvement reduces the time complexity of Janapsatya’s algorithm considerably, which enables up to 45 times faster simulation on average compared to Dinero IV.

Janapsatya’s work employs analytical cache models for timing and energy consump- tion, to quantify performance and energy measures and assess suitability of the ex- plored cache configurations. The equations incorporate the calculated exact cache hit/miss rates in to finding the memory access latencies and consumed energy.

The work in [TTYO09] by Tojo et al. later proposed additional improvements to the approach of Janapsatya et al. [JIS06]. Tojo’s work utilizes the cache inclusion property presented in [MGST70] to define a new heuristic. Cache inclusion property states that a cache configuration K0 is a subset of a cache K, if all the contents of K0 are included in the contents of K. It can be observed that any cache configuration can automatically be a subset of another configuration with a higher number of sets. Therefore, the heuristic is constructed as follows:

Property 3: When a cache hit occurs for an address MA in a direct mapped cache configuration K (block size B, set size S, associativity 1), all other configurations K0 (block size B, set size S0, associativity A) where S0>S can also be guaranteed to have hits for address MA.

Based on Property 3, Tojo et al. proposed a modified algorithm [TTYO09] called CRCB1, which further reduces the subset of cache configurations that need to be simulated in order to explore the complete design space, without compromising

29 3. Literature Review the accuracy. Further, the authors of [TTYO09] extend their heuristic to cover additional ground, using the observation described below:

Property 4: Consecutive accesses to the most recently accessed memory address

MA are guaranteed to result in hits for all the cache configurations K (block size B, set size S, associativity A) where S ≥ 1,B ≥ 1 and A ≥ 1.

Property 4 can essentially be viewed as a generalization of property 3. This observa- tion is used in the CRCB2 algorithm, which is added on top of CRCB1, and reduces the number of cache hit/miss assessments in the simulation by a significant amount. CRCB2 approach is claimed to provide on average 1.8 times faster trace-driven cache simulation, compared to Janapsatya’s method [JIS06].

Haque, Janapsatya et al. proposed enhancements to the original algorithm from [JIS06] in their subsequent work [HJP09] called SuSeSim (Super Set Simulator). There, two additional correlations among cache configurations were observed in the design space, which could be used to further reduce the total simulation time.

Property 5: When a cache miss occurs for an address MA in a cache config- uration K (block size B, set size S, associativity A), all other configurations K (block size B, set size S0, associativity A) , where S0

30 3. Literature Review

Property 6: For an address MA, a cache hit in a configuration K (block size B, set size S, associativity A) implies that cache misses will occur in more recently used cache ways of all configurations K (block size B, set size S0, associativity A) where S0

Contrasting to properties 1 to 4 which are used to evaluate cache hits, properties 5 and 6 aid in evaluating cache misses in a group of configurations. Consequently, the simulation algorithm in SuSeSim [HJP09] counts cache misses for each configuration rather than cache hits. Therefore, Haque’s method takes a bottom up approach where the cache configurations with a higher number of sets are evaluated first, which enables smaller set sizes to be automatically covered. Haque et al. make use of doubly linked lists to store cache tags and incorporate forward and reverse search functions when searching for tag matches. This allows the tag searching to be 16% faster on average and the overall algorithm to be 33% faster than the method in [TTYO09].

When designing embedded processor systems, FIFO (First In First Out) cache re- placement policy is generally preferred over LRU (Least Recently Used) replacement policy, for set associative caches. This is largely due to the fact that FIFO replace- ment is comparatively simpler to implement and consume less chip area as well as less energy. Developing on the previous works, Haque et al. proposed a single pass cache simulation method [HPJP10] named DEW for caches using FIFO replace- ment. Several data structure enhancements were used in [HPJP10] compared to the previous method [HJP09].

Each cache tag stored in the linked list in [HPJP10] is associated with a wave pointer which points to the corresponding tag in the cache with the next larger set size.

31 3. Literature Review

The additional information from the wave pointers allow the simulation algorithm to directly access the location where a cache entry should exist, without having to search through a list. Thus the search time is dramatically reduced compared to the previous approaches.

Also in [HPJP10], a binomial tree node, which represents a cache set in a configura- tion, is associated with details about the most recently accessed address (MRA) and the most recently evicted address (MRE). Property 4 discussed above states that consequent accesses to the most recently accessed address is always a hit. Therefore it follows that:

Property 7: Access to the most recently evicted memory address MA is always a miss for all the cache configurations K (block size B, set size S, associativity A) with S ≥ 1,B ≥ 1 and A ≥ 1.

Thus, storing the MRA and MRE tags in the cache set allows faster assessment of cache hits and misses respectively in the simulation. According to temporal locality, which states recently accessed addresses are more likely to be accessed again, MRA address is the most likely to be re-accessed out of all the resident blocks whereas MRE address is the most likely to be re-accessed out of all the evicted blocks. The use of Property 7 potentially reduces the search time even further for the simulator. Haque et al. claim that the DEW algorithm is up to 40 times faster than Dinero IV and at least 8 times faster in the worst case, for Mediabench applications. However, it is worthwhile noting that above improvements to the simulation time are achieved at the expense of storage space for the simulator.

32 3. Literature Review

Central Lookup Table (CLT)

Memory Block Record 1 Record 2 Record 3 P Count: 1 Count: 2 Count: 3

Total Count: 6 Set size: 1 Set size: 2 Set size: 4

Associativity: 1 Associativity: 1 Associativity: 1 Availability: False Availability: False Availability: True

Associativity: 2 Associativity: 2 Associativity: 2 Availability: False Availability: True Availability: True

Associativity: 4 Associativity: 4 Associativity: 4 Availability: True Availability: True Availability: True

Memory Block Record 1 Record 2 Record 3 Q Count: 1 Count: 2 Count: 2

Total Count: 5 Set size: 1 Set size: 2 Set size: 4

Associativity: 1 Associativity: 1 Associativity: 1 Availability: False Availability: False Availability: False

Associativity: 2 Associativity: 2 Associativity: 2 Availability: False Availability: True Availability: True

Associativity: 4 Associativity: 4 Associativity: 4 Availability: True Availability: True Availability: True

Figure 3.3: Example CLT data structure used by Haque et al. in SCUD algorithm [HPJP12]: A CLT contains an entry for each memory block in the system. Every CLT entry is assocaited with records which represent the different cache set sizes in the simulation. Records indicate the availability of the memory block in a group of configurations with same set size and varying associativities.

33 3. Literature Review

Subsequent work of Haque et al. [HPJP12], named SCUD, presented a different approach from the previous continuations. It is still a trace-driven exact cache simulation, however the simulation space is observed from the perspective of memory blocks, as opposed to cache locations. Therefore the authors use a data structure named Central Lookup Table (CLT), as depicted in Figure 3.3.

The simulator in [HPJP12] consists of one CLT for each cache block size simulated, and CLTs are sorted by the block addresses. A CLT contains entries for all memory blocks present in the cache configurations. Each memory block entry is attributed with records, which serve to indicate the availability of the said memory block in different cache configurations, and a count (number of configurations where the block is available). In a single block entry, there are as many records as the number of different cache set sizes in the simulation. Each record contains information about which configurations with different associativities contain the memory block, and the associated count.

In SCUD Haque et al. use a binomial tree of cache set nodes in association with the new CLT. The binomial tree is similar to the ones in [JIS06, HJP09], and used to update the CLTs while reading through the memory access trace. Using the CLTs provides the SCUD simulator with the ability to quickly determine whether a memory access is a hit or a miss in all configurations under simulation. This is made possible by the Count value associated with each CLT entry. The count for a particular block being 0 indicates a miss for all configurations, and the count being the highest possible value (which depends on the simulated design space) indicates all hits.

34 3. Literature Review

Haque et al. claim that the SCUD simulator is on average 19 times faster than Dinero IV for MediaBench applications and 10 times faster for SPEC CPU2000 applica- tions. The downside is the simulation speedups are obtained at a considerable cost of storage space. Since the simulator stores a vast amount of information associated with each memory block in the CLT for all the configuration, the space complexity is exponentially increased when simulating a large number of cache configurations, especially for systems with relatively large memories.

Most of the correlation properties studied above are based on the contents of a smaller configuration being a subset of the contents of a larger configuration. This made it possible to draw conclusions about larger configurations when simulating a smaller one, or vise versa. Haque et al. proposed a set of intersection properties [HPP11] for caches with FIFO replacement, which predict the availability of memory blocks in other configurations subject to certain conditions.

Viana et al. formulated a different trace-driven cache simulator, called SPCE, in [VGRBV08]. Viana’s approach closely resembles Janapsatya’s method, but uses different data structures to evaluate cache hits and misses.

In SPCE algorithm, Viana et al. determine whether a memory access is a hit or a miss by keeping track of how many unique addresses, mapping to the same cache set, were accessed after the previous reference of the current address. The term Conflicts is used for this count. In other words, once a block is fetched to the cache, the number of conflicts on the same set determines when that block will be evicted from the cache due to a conflict. If the associativity of the concerned configuration is larger than the number of conflicts recorded, then the currently accessed block must still be available in the cache. A set of structures named

35 3. Literature Review

Conflict Tables are used to analyse the conflicts. One table is created for each degree of associativity under consideration, containing entries for different block sizes and set sizes. Considering all the tables together, there are as many entries as the number of cache configurations explored.

The SPCE algorithm uses a stack structure to keep track of previously accessed memory blocks (Figure 3.4). If an address is not found in the stack, it is pushed on to the top of the stack, and the access is deemed a miss for all configurations. Once an address is found, the stack is scanned to see how many conflicts occurred after it was previously accessed. This leads to determining what levels of associativity will allow that block to remain in the cache, and the conflict tables are updated accordingly. The cache inclusion property is used to determine the cache set sizes where hits could occur. The address is then removed and pushed back on to the top of the stack. The final hit and miss rates are calculated using the values in the conflict tables at the end of the simulation.

T[t] Conflict Tables Memory . (covering all Access Trace . configurations

. K={K1,K2,…,Km}) T[3] T[2] T[1] Stack-based T[0] Analysis Cache Miss Rates for configurations

K={K1,K2,…,Km}

Figure 3.4: Overview of SPCE algorithm by Viana et al. in [VGRBV08].

36 3. Literature Review

The formulation of SPCE algorithm provides the benefit of single pass simulation of a memory access trace without consuming excessive storage space, albeit with similar space complexity as the binomial tree based methods. However, the approach results in a large amount of operations being carried out on the stack structure, which consumes majority of the simulation time. The results show that SPCE simulator obtains the miss rates for a given trace 14.88 times faster than Dinero IV on average, for the applications in Motorola PowerStone benchmark suite. Therefore this method is not as fast as other simulators discussed above, but is efficient in terms of storage space.

Extending the work by Viana et al., Zang et al. [ZGR11] proposed a stack-based single-pass cache simulator for two-level cache. The major challenge in two level cache simulation is to produce the filtered access trace for the L2 cache. The L2 cache’s access trace is comprised of the missed accesses from the L1 cache. A single- pass simulator analyses a vast space of L1 cache configurations simultaneously, and the L2 access trace for each of the L1 configurations is unique. This results in n different L2 cache simulations where n is the number of simulated L1 cache con- figurations. Therefore the storage space and simulation time consumption could exponentially increase beyond bounds.

In order to avoid the complication with two cache levels, Zang et al. limit their scope to exclusive two-level caches with LRU replacement for L1 and FIFO replacement for

L2. In exclusive caches, the content of each cache level is a disjoint set from the other. A cache block in one level is guaranteed not to exist in the other level. This enables the simulator to view the two cache levels as one single cache using the original access trace, with only a minimal loss of accuracy in the L2 miss rate estimation. Figure 3.5 depicts the two-level cache simulator, named T-SPaCS. However, the combination of

37 3. Literature Review

Memory T[t] All A1 for each Access Trace . combination of T . S1 and B . T[3]

T[2] 2 T[1] All A for each T[0] combination of S2 and B

Stacks for different B For C Stack Processing i,j Accumulated

b L1 Analysis number of L1

All and L2 misses L Analysis S1 and S2 2 for all for each B Stack Update configurations in design space Ci,j

Figure 3.5: T-SPaCS algorithm by Zang et al. in [ZGR11] (Si - set size for level i, B - block size, Ai - associativity for level i, C conflict tables, b - number of simulated block sizes). two caches enlarges the stack structure dramatically, which degrades the simulator’s performance considerably. In order to remedy this, the authors associate tree and array data structure to determine conflicts faster for different set sizes and different associativities, which in turn increases the space complexity of the algorithm.

Zang et al. continued their work in [ZGR12] by modifying T-SPaCS for simulating unified two-level caches. In unified cache architectures, instruction and data caches

38 3. Literature Review in the first level are separate while the second level cache hosts both types of blocks. The modified simulator in [ZGR12] is called U-SPaCS. The memory access trace is divided in to two separate instruction and data traces, and two stacks are used for these accordingly. Separate analyses are performed for the two L1 instruction and data caches, and L2 analysis occurs in the event of a miss from either of the L1 caches. Both T-SPaCS [ZGR11] and U-SPaCS [ZGR12] simulators support only exclusive two-level caches, and do not possess the ability to explore inclusive cache hierarchies.

3.3 Hardware Assistance in Cache Simulation

The major limitation of the trace-driven cache design space exploration is the signifi- cantly high simulation time consumption, which occurs in two forms. The first is the time taken to extract the memory access trace of an application program executed by a processor, which is a painstakingly slow process. Typically trace extraction is done by simulating the instruction set of the processor in software, which takes up to several days to generate the access trace for a few seconds execution of the program. For instance, encoding 24 low-resolution frames with MPEG2 (one second worth of video) [NSJP14] takes 72 hours to extract the memory access trace, which contains 12 billion address references. Extracting the address traces through spe- cialized hardware is faster but such devices incur extremely high costs and can only be used to extract very small portions of an access trace. The second contribution to the high simulation time consumption comes from the software simulation itself, which could take up to several hours depending on the number of simulated cache configurations and the size of the access trace.

39 3. Literature Review

Modern design optimization methodologies involve Hybrid Simulation techniques [CPN+09], where the repetitive and time intensive portions in the design process are accelerated using the assistance of hardware components. Such acceleration methods can be found in the literature of cache simulation and design space exploration. In [LL03] Lu et al. discuss a method to listen to the memory access bus in real-time and use the retrieved trace data to evaluate a cache. Interfacing with the memory bus is implemented using an FPGA (Field Programmable Gate Array) device in [LL03], and additional FPGAs are used to emulate up to four cache configurations.

Similarly, the work in [HNL06] by Hong et al. presents an FPGA-based emulator that models a last-level cache actively and in real-time by listening to the memory accesses generated by a host processor system. Hong’s work emulates a single cache configuration at a time, requiring multiple re-simulations for a complete design space exploration. Both Hong’s [HNL06] and Lu’s [LL03] approaches fall under the cate- gory of system simulation, thus requiring repetitive re-simulations to cover a cache design space. Even though the hardware assistance in such methods help avoiding having to extract memory access traces a priori, simulation using multiple FPGA devices connected to the memory bus incur heavy costs in terms of resources as well as synthesis time.

Han et al. [HXXY11] proposed an acceleration scheme which uses a Graphics Pro- cessor Unit (GPU) for the simulation of a cache. CUDA (Compute unified device architecture) processors of nVidia [NVI] GPUs are used in [HXXY11] to efficiently emulate cache behaviour. However, Han’s method can only simulate up to 6 cache configurations in a single pass over a pre-extracted memory access trace.

40 3. Literature Review

The methods in [LL03, HNL06, HXXY11] only analyse either a single configuration or a few configurations for a cache in a single simulation run. Multiple re-simulations combined with the high resource costs when using hardware acceleration makes such methods unattractive for efficient cache design space exploration.

In the state of the art work [SPP14a], Schneider et al. proposed a novel solution to al- leviate high time penalties of trace-driven cache simulation, by introducing hardware acceleration to parallel simulation techniques such as [JIS06, HJP09]. Schneider’s simulator [SPP14a] is a hardware component designed with VHDL and implemented on FPGA logic fabric. As depicted in Figure 3.6, the simulator operates on an FPGA device and receives the memory access trace from the host computer attached to the FPGA through a PCIe (Peripheral Component Interconnect Express) connection.

The major benefit in Schneider’s approach, over previous hardware accelerations, is that a complete design space with several hundreds of cache configurations can be simulated with a single pass over a memory access trace. The hardware in [SPP14a] is designed to use LRU replacement policy for set-associative cache configurations.

Memory Access Trace Via PCIe Simulation in Hardware: Cache Hit/Miss Rates Evaluate the for the Complete Complete Design Design Space Space

Host PC FPGA Device

Figure 3.6: Operation of the FPGA cache simulator in [SPP14a].

41 3. Literature Review

The simulation is done in parallel for different degrees of associativities, and the op- eration of the simulation is pipelined in hardware with respect to different cache set sizes, which significantly improves simulation speed. The structure of Schneider’s FPGA simulation core adopts most features of prior binomial-tree-based simulation methods, especially utilizing Property 2 and Property 5 described in section 3.2 to reduce the logic footprint used by the simulator on the FPGA, which allows to encompass a large number of cache configurations into the design space. Therefore, only the cache sets for the largest simulated set size (first pipleline stage of simual- tion) store the tag information, and decisions for smaller set sizes are inferred using the above inclusion properties.

Schneider et al. was able to demonstrate simulation speedups up to 53 times using the FPGA cache simulator, compared to the fastest software simulators available, which makes it the fastest design tool presented yet to explore the cache configura- tion design space. The presented simulation times are in the order of seconds and milliseconds, which are substantial improvements over the previous cache simula- tion methods. Additionally, the simulator core itself operating in hardware leaves the possibility open for real-time extraction of memory access information from a processor system working in the same FPGA, eliminating the tedious process of extracting the memory access trace beforehand.

In their subsequent work [SPP14b], Schneider et al. presented a variation of the original FPGA cache simulator, which implements FIFO cache replacement policy for set associative caches. A new set of cache inclusion properties were used in [SPP14b] to reduce the number of cache sets to be simulated in order to cover the given design space. Speedups up to 11 times were reported by the authors, compared to the fastest software based FIFO cache simulator [HPP11].

42 3. Literature Review

3.4 Exploring Multiprocessor Cache Hierarchies

Many computing systems have adopted multiprocessors in order to cope with the emerging parallel workloads. Modern embedded systems are a forerunner in this respect, where a major shift has occurred towards using Multi Processor Systems on Chip (MPSoC), which allow overlapped and parallel execution of programs to achieve higher throughputs. Sharing memory address spaces is a preferred way of facilitating communication between programs on a multiprocessor. Among many shared memory models, Symmetric Multi-Processor (SMP) is the most widely used architecture. There, all the processors in the system share a single memory, with partitioned address spaces. The unique feature of this model is that each processor has similar memory access times, as opposed to distributed memory models. Conse- quently, the recent literature looks at methods to explore the design space of cache hierarchies for such multiprocessor systems.

Multiprocessor caches involve additional complications due to the caching of shared data. When one processor writes to a shared memory block which is already cached by another processor, the cache entry in the second processor’s cache becomes in- valid (or stale). Different cache coherence techniques are used to make sure that contents of all the caches are up to date [Mar08]. Capturing the effects of coherence management in cache simulation is a difficult prospect.

Typical multiprocessor cache hierarchies contain private caches for individual pro- cessors (usually at L1), as well as shared caches in lower levels (such as L2 and L3. Such cache hierarchies are mostly inclusive, meaning data present in upper levels are a subset of data available in lower levels. Therefore, it is virtually impossible to sim- ulate cache configurations for a multiprocessor cache hierarchy in a single pass using

43 3. Literature Review a memory access trace. The sub-design-spaces of each individual cache is dependent on the configuration of other caches because: access trace seen by lower level caches are dependent on upper level cache configurations; and miss latency experienced by upper level caches are dependent on lower level cache configurations. Thus, the overall design space of a multiprocessor cache hierarchy is the cross product of the sub-design-spaces of all individual caches, which is of massive proportions. Such a design space can contain several trillions of unique design points, which makes exhaustive explorations practically infeasible.

A majority of the available multiprocessor cache design exploration techniques are predominantly system simulations. Iyer presented a simulation tool [Iye03] named CASPER to evaluate a multiprocessor cache hierarchy using extracted memory ac- cess traces, where each cache has a fixed configuration. CASPER provides a rich set of cache performance measures in addition to implementing the MESI coherence mechanism [PP84]. Access traces should be pre-extracted from the system under investigation, at every cache in the hierarchy, and fed to the simulator. However, since [Iye03] is a system simulation tool, multiple re-simulations are required to assess difference cache hierarchy design points. As discussed earlier, extraction of a single trace is a painstakingly slow process, therefore obtaining multiple traces several times is a daunting prospect which is practically infeasible.

Similarly, the work of Han et al. [HXXY11] which uses GPU processing to accelerate the simulation process suffers from having to pre extract multiple memory access traces. Han’s method provide the ability to simulate a set of cache configurations for an MPSoC quickly using a GPU device, once the traces are available.

44 3. Literature Review

Lu et al. [LL03] presented a simulation framework where a given multiprocessor cache hierarchy can be emulated in FPGA hardware. The strength of Lu’s work is not having to extract access traces, as multiple FPGA devices are used to interface with the memory subsystem of an MPSoC in real-time. Each FPGA hosts a hard- ware model which acts as a cache, and actually store the data, so the access traces to the lower level caches could be produced dynamically. However, the framework in [LL03] can only support up to four different caches in a system. Moreover, being a system simulation tool, [LL03] requires multiple re-simulations to assess different design points in the cache hierarchy design space.

Due to the sheer size and complexity of multiprocessor cache design spaces, single- pass trace-driven simulation of such cache hierarchies has seldom been addressed. The literature contains approximate methods, where minimal number of traces are used and small portions of the design space are explored in order to produce a near optimal result.

Haque et al. proposed a single-pass trace-driven cache simulation framework for

+ SMP MPSoC architectures in [HRA 12], considering cache coherency at L1 caches. The method in [HRA+12] assumes a two-level inclusive cache hierarchy, with a shared L2 cache and private L1 caches per processor. The simulator core employed here is derived from CIPRASim [HPP11], using FIFO replacement policy. The aim of Haque’s simulator, named DIMSim, is to find a reasonable set of cache configurations which allows the MPSoC to meet required memory access timing constraints.

45 3. Literature Review

Shared Memory based Requested Data Access Time (cache-less) MPSoC Simulator (RDAT)

Shared Memory Access Trace Stage-1: L2 Simulation

Cache Config. 1 Memory Trace Cache Config. 2 for Data Accesses - - - Cache Config. n

Read/ Memory Processor Write Address ID Selected L2 Trace Pre-Processing Configuration

Requested Headroom Stage-2: Secondary Trace Generation Time (RHRT)

Stage-3: Secondary Traces

L1 Simulation Core 1 Core 2 Core 3 Core 4 (per core)

Cache Config. 1 Read/ Memory Recently L2 Hit/ CC of Last Cache Config. 2 Write Address wrote uP Miss Write - - - Cache Config. m Selected L1 Configurations

Figure 3.7: DIMSim algorithm by Haque et al. in [HRA+12]

46 3. Literature Review

Figure 3.7 illustrates the simulation flow of DIMSim algorithm, which is comprised of three distinct stages. A single memory access trace from the memory’s point of view is used to configure the shared L2 cache at the first stage. The original access trace is comprised of a time ordered sequence of memory accesses to the main memory.

In order to simulate the configurations for L1 caches in the system, separate access traces for each individual L1 cache need to be derived from the original memory access trace. This is shown as the secondary trace generation step in Figure 3.7, at the second stage of simulation flow.

Additional pieces of information are recorded in the secondary traces, which allow the simulator to consider the cache coherence for L1 simulation. For an example, every access in secondary traces are attributed with whether the access was a hit in the L2 configuration selected in Stage 1. Assuming that the two cache levels are inclusive, the misses in the selected L2 configuration are considered to be misses by default in all L1 configurations. Making use of the inclusiveness of the cache hierarchy along with the additional information recorded in the secondary traces, simulations for each L1 caches are carried out in Stage 3.

However, once the L1 caches are selected and in place, the accesses seen by the L2 cache is in reality composed of cache misses from the L1 caches. Therefore, the access trace used for L2 simulation in Stage 1 is no longer valid under the final selection of L1 caches. Thus, in this method, L2 cache configuration selected using the original memory access trace is essentially non-optimal.

The other point worthy of noting is that the generated secondary traces may not be in correct chronological order, when considered in parallel execution of different applications. This means that with the L1 caches present, the order of accesses to

47 3. Literature Review

the shared L2 cache could potentially be different from the original memory access trace, owing to the fact of varying hit latencies on different L1 caches. Thus, the method in [HRA+12] explores only a fragment of the overall design space, hence cannot guarantee optimal cache performance.

+ In [HKH 13] Haque et al. perform a similar simulation, where the flow starts at L1 caches, and then combine the L1 access traces to simulate configurations for a shared

L2 cache. Optimality still cannot be guaranteed as the combined access traces used for L2 is not necessarily the one observed in reality.

Obtaining accurate memory access traces is a vital part in exact simulation of mul- tiprocessor cache configurations. Thus, extracting the memory access trace from actual hardware is preferable to software simulation of the execution. Wilson et al. based their work in [WJ90] on multiprocessor cache simulation for bus traffic analysis, by obtaining traces from hardware. Rawlins and Gordon-Ross proposed a run time tuning methodology [RGR11] for reconfigurable data caches in a dual-core processor system. The main objective of the tuner is to reduce the energy con- sumption of the data caches. It uses a simple algorithm and heuristics where the caches are initialized with smallest values for all parameters, which are periodically incremented until no further decrease in energy is observed. [RGR11] is a run-time approach and doesn’t provide the designer with information on all concerned cache configurations.

Mariani et al. [MPZS12] evaluated evolutionary algorithms based on various heuris- tics to explore the multiprocessor cache design space to find an approximated Pareto front. Each design point is individually evaluated in a full system simulation, in con- trast to trace-driven cache analysis. Therefore several hundreds of such simulations

48 3. Literature Review are performed in the course of the evolutionary algorithm even for the smallest of design spaces. While having the potential to find reasonable solutions, performing several hundred system simulations is a slow process taking weeks to complete. As described in [MPZS12], exploring a small design space of 128000 points by assessing only 550 points through system simulations take 165 hours (at 18 minutes per sim- ulation). Results show that the optimality of the achieved approximated solution varies with the explored fraction of the design space, where accuracies of 83.6% and 97.9% were achieved by exploring respectively 0.28% of a small design space (80 points out of 27648) and 1.95% of a smaller design space (80 points out of 4096).

Due to the high time cost of simulation and trace extraction, all of the above men- tioned works regarding optimizations of multi-level multiprocessor cache hierarchies are predominantly limited to exploring two-level hierarchies and to exploring each cache level only once, using a limited number of memory access traces. Therefore, opportunities are available for improving the deign space exploration process by speeding up the simulations, while covering generic cache hierarchies with several levels (such as L3) for application specific MPSoCs.

49 3. Literature Review

3.5 Cache Optimizations in Multi-Programmed

Environments

Many computing systems in the embedded domain can be found implemented as multi-programmed systems, where the same processor and memory sub-system are used by several different application programs. Due to the application dependent behaviour of cache performance, achieving best performance by all programs using the same cache memory becomes a difficult prospect. Two types of run-time so- lutions could be found to alleviate this difficulty and obtain better memory access performance. The first is by using run-time re-configurable cache memories, where the internal set organization could be dynamically altered. The second solution is using switchable caches, which houses a number of different cache configurations but only one is kept active at any given time.

Re-configurable caches [GRLC08, ZV03, ZVL04] have been proposed to perform application dependent cache optimizations at run-time. Reconfiguration may alter the structures of data and tag arrays to achieve performance gains with various applications. However, re-configurable caches can only represent a limited set of inter-dependent configurations. For example, the re-configurable cache presented in [ZV03] has a maximum size of 8KB and four associative ways (each way being 2KB). Way shut-down technique is used to turn off unwanted cache ways (shutting down three ways gives 2KB direct mapped cache, shutting down two ways can give 4KB caches either direct mapped or two-way associative), and logical way concatenation technique is employed to combine cache ways (combining four ways gives 8KB direct mapped cache, combining two ways can give either 8KB two-way associative cache or 4KB direct mapped cache). While the physical cache block size

50 3. Literature Review is 16 bytes, logical cache blocks are used to emulate the block sizes 32 bytes and 64 bytes. However, changing one cache parameter essentially limits the possible values for other parameters (i.e. if the block size is 16 bytes, then the set size can only be 128, 256 or 512. If the block size 32 bytes, then the set size can only be 64, 128 or 256. If the associativity is four-way, then set size can only be 32, 64 or 128). Therefore, not all configurations up to a total size of 8KB are implementable through re-configuration. For example, block size 16 bytes cannot co-exist with a set size of 64, or cache size 4KB cannot co-exist with four-way associativity. As a result, the re-configurable cache in [ZV03] can only represent 18 different cache configurations. Contrastingly, a typical cache design space can contain over 300 configurations to select from. Hence, a given re-configurable cache may not be able to provide significant performance gains for a wide range of applications.

On the other hand, run-time cache switching provides a simpler way to change between configurations by leveraging the available Dark Silicon on future chips. The number of cache configurations that a switchable cache may hold could be limited. However, any configuration from a cache design space may be selected, as opposed to re-configurable caches, therefore enabling comparatively higher cache access performance.

For either re-configurable or switchable caches, identifying the optimal set of cache configurations that provides best memory access performance for a given group of ap- plication programs is a vital step in the design process, especially when the number of applications using the cache is higher than the number of allowable configura- tions. In [VGRK+06] Viana et al. discuss similar analytical methods to determine a configuration subset, out of the small design space of a re-configurable cache, such that design time tuning tools can be efficiently used to explore the smaller subset.

51 3. Literature Review

Subset selection is performed aiming to reduce energy consumption. The algorithm in [VGRK+06] based on Keogh’s heuristic [KCHP01] takes approximately a minute for a small re-configurable cache design space of 40 configurations, and would take approximately 10 minutes for a regular sized design space.

Since architectures such switchable caches, which exploit Dark Silicon, have only re- cently been proposed, the literature lacks robust exploration methods to optimally select groups of cache configurations from a vast design space. Existing hardware based simulation techniques such as [SPP14b] can rapidly provide applications’ cache performance measures for a large space of cache configurations at design-time. Using such information at the disposal of designers, better tuning methods could be devel- oped which can quickly determine the optimal cache configurations in design-time to take better advantage of available run-time cache optimizations.

52 Chapter 4

Hardware Acceleration for Multiprocessor Cache Simulation

Performing trace-driven simulations on multiprocessor cache hierarchies has thus far been difficult, due to high time costs associated with software simulators. However, using specialized hardware accelerators in the simulation process have the potential to allow significantly fast simulation times. The work presented in this chapter describe the first ever hardware-accelerated simulation framework to rapidly perform design space exploration for a multiprocessor multi-level cache hierarchy containing private and shared caches.

As discussed in Chapter 3, existing software based design space exploration methods such as [HRA+12] and [HKH+13] are hindered by two major limiting factors. First, extracting multiple memory access traces from different points in a cache hierarchy impose significant costs, timewise as well as storage-wise (several days and several Terabytes). Next, software simulation of cache design spaces using extracted traces

53 4. Hardware Acceleration for Multiprocessor Cache Simulation itself is a slow process, taking several hours of design time. Compared to the existing methods, the new hardware-accelerated simulation framework aims to achieve two key goals:

• Eliminate the need to extract multiple memory access traces by obtaining the memory access traces in real-time from different points of an MPSoC memory subsystem, exactly as experienced by individual caches;

• Provide much faster simulation times by simultaneously processing the ob- served memory access traces using specialized hardware in real-time, parallel to the MPSoC execution.

The rest of this chapter is organized as follows: Section 4.1 identifies the target MPSoC architecture for the proposed design space exploration method; the method- ology is detailed in Section 4.2; hardware implementation details are presented in Section 4.3; Sections 4.4 and 4.5 present demonstrations of the proposed hardware based method, targeting a two-level cache hierarchy with four private L1 data caches and a shared L2 data cache for a quad-core system executing non-communicating applications.

54 4. Hardware Acceleration for Multiprocessor Cache Simulation

4.1 Target Multiprocessor System Architecture

This chapter focuses on shared memory multiprocessor architectures with multi-level cache hierarchies containing private and shared caches, as the example depicted in Figure 4.1. The memory hierarchy itself can be illustrated as in Figure 4.2. The system can contain a set of processors (P), sharing the same main memory. Accesses to the memory go through a hierarchy of caches linked by a shared interconnect.

The system may contain N many levels of caches (Li: 1 ≤ i ≤ N). Each level Li can contain Mi many caches (Cij: 1 ≤ j ≤ Mi). Each such cache Cij is associated with a sub-design-space containing up to Dij different cache configurations (Kijk:

1 ≤ k ≤ Dij).

CPU CPU CPU CPU

L1 Cache L1 Cache L1 Cache L1 Cache

Bus Interconnect

Main Memory Shared L2 Cache

Figure 4.1: Shared memory multiprocessor architecture with private L1 caches and a shared L2 cache.

The system used for demonstrations contains four processor cores with four private

L1 caches and one shared L2 cache. It is assumed that the memory accesses produced by the processors are blocking, and that the caches used in the system do not implement advanced techniques such as block pre-fetching, similar to the prior works [SA95, JIS06, ZGR11, HRA+12]. These assumptions allow deterministic simulation

55 4. Hardware Acceleration for Multiprocessor Cache Simulation of cache hits and misses. It should be noted that this work consider a cache hierarchy where no coherency controlling is performed.

P1 P2 P3 P4

C1,1 C1,2 C1,3 C1,4

C2,1

Main Memory

Figure 4.2: Multiprocessor memory hierarchy with four processors (P1 to P4), four L1 caches (C1,1 to C1,4) and one L2 cache (C2,1).

4.2 Design Space Exploration Methodology

This section presents the proposed methodology to explore the design space of a mul- tiprocessor cache hierachy, highlighting the use of hardware modules to accelerate the simulation process. Details of the exploration process and the hardware-software collaboration paradigm will be established in Sub-sections 4.2.1 and 4.2.2. There- after, Section 4.3 will comprehensively describe the implementation of simulation hardware.

56 4. Hardware Acceleration for Multiprocessor Cache Simulation

4.2.1 Hybrid Simulation Framework

Many modern design optimization methodologies tend to use Hybrid Simulation where the repetitive and time consuming portions in the design process are acceler- ated using assisting hardware components [CPN+09]. The hybrid simulation frame- work presented in this work utilizes an FPGA device connected to a host computer, as illustrated in Figure 4.3.

FPGA Device Host Computer

MPSoC with hSim modules Input data for MPSoC

Analytical models for performance and energy

Cache configuration selection

Figure 4.3: Hybrid simualtion platform where cache hit rates are calculated on FPGA.

Target MPSoC exists in the FPGA device, along with hardware cache simulation modules (hSim). Input data for the application programs being executed on the MPSoC are provided by the host computer, through a USB (Universal Serial Bus) connection (other fast connections such as PCIExpress could alternatively be used here). The responsibilities of an hSim module are: to unobstructively observe the memory accesses generated by the MPSoC in real-time; to decide whether the ob- served accesses will be cache hits or misses in all cache configurations in the design space; to calculate the hit rates for all cache configurations in the design space, in

57 4. Hardware Acceleration for Multiprocessor Cache Simulation parallel to the execution of the application programs. The resulting cache hit rates are sent to the host computer where analytical models are applied to estimate the timing and energy measures for different cache configurations. Further details about the hSim modules and how the memory access extraction is done in real-time are presented in Section 4.3.

Initially, the MPSoC in the FPGA device does not contain any actual caches. The cache hierarchy is explored in N different stages, starting from the L1 caches and

th moving down the hierarchy until the last cache level LN . In an arbitrary i stage of exploration, individual sub-design-spaces of all Mi caches in level Li of the hierarchy are explored in parallel (simultaneously) to calculate cache hit rates. After the results (the hit rates for all the configurations in the design space, for each cache in level i) are sent to the host computer, timing and energy values are estimated for all configurations and a selection decision is made based on minimum energy or maximum performance as per design requirement. Afterwards, actual caches with selected configurations are inserted into the level Li in the cache hierarchy and the system is re-synthesized before exploration moves on to the next stage (level Li+1).

4.2.2 Selection of Cache Configurations

The hit rates for different cache configurations that are provided by the hSim mod- ules are used for the calculation of Average Cache Access Time (T ) and Average Access Energy Consumption (E). Values of T and E are normalized per single access to a cache. These two measures are used to select a suitable cache configuration, de- pending on the design requirement. The model for T is described in Equation (4.1) and provides a time estimate for accessing a cache with a given configuration.

58 4. Hardware Acceleration for Multiprocessor Cache Simulation

T = HL + (1 − HR) × ML (4.1)

E = HE + (1 − HR) × ME (4.2)

The term HL (Hit Latency) represents the time required to make a single access to the cache (i.e. time to determine whether the access was a hit or a miss and load data to the bus), which encompasses the effects of design space parameters of the cache’s configuration such as block size, set size and associativity. Hit rate for the cache configuration is given by HR and the time penalty for a cache miss is given by ML (Miss Latency). Similarly, Equation (4.2) provides an average energy measure for accessing a cache with a given configuration. Energy required to make a single cache access (i.e. determine hit/miss status and load data to the bus) is given by HE (Hit Energy), representing the effects of the configuration itself, and the energy penalty in the event of a cache miss is given by ME (Miss Energy).

For all the configurations in the design space, the values of HL and HE are obtained by using the detailed cache analysis tool CACTI 6.5 [MBJ07] by Muralimanohar et al. It should be noted that any suitable analytical model may be used for this purpose depending on the designer’s requirement, and the performance and accuracy of the model used is beyond the scope of this work. Since HR is provided by the simulation done in hardware, the only unknown terms are ML and ME.

59 4. Hardware Acceleration for Multiprocessor Cache Simulation

ML = Tf + UL (4.3)

ME = Ef + UE (4.4)

Equations (4.3) and (4.4) describe time and energy penalties incurred in a cache miss. Terms Tf and Ef respectively represent time and energy spent on fetching the missing cache block from the next level in the memory subsystem, while UL (Update Latency) and UE (Update Energy) respectively denote the time and energy to write

th the fetched block to the cache. For the caches in the i level of the hierarchy (Li), miss penalties depend on the properties of the next cache level (Li+1). Penalties for the misses occurring at the last cache level (LN ) depends on the properties of the main memory (usually DRAM - Dynamic Random Access Memory). Values for the parameters in Equations (4.3) and (4.4) are obtained through CACTI 6.5 tool, which takes the contention when accessing a shared memory device into ac- count. Since this work is aimed at speeding up the design space exploration process, Equations 4.1, 4.2, 4.3 and 4.4 are formulated to closely match the model used in [HRA+12] so that a reasonable comparison between the methods could be provided.

The complete flow of the selection process is described in Algorithm 1, which ex- plores the cache hierarchy in stages. In a single stage of exploration (lines 1-16) the algorithm finds the suitable configuration for all mi caches in the level Li. Parallel exploration of sub-design-spaces, of all caches in a given level, to calculate cache hit rates for different configurations is given in lines 2 and 3. This step is carried out on the FPGA device with the aid of hSim modules, simultaneous to the execution of

60 4. Hardware Acceleration for Multiprocessor Cache Simulation

Algorithm 1: Configuring N-level Cache Hierarchy with Mi caches at any level Li

1 for each cache level Li where i:=1 to N do

2 for each individual cache Cij in level Li where j:=1 to Mi do

3 ∀ configurations Kijk, calculate hit rates (HR) using real-time extracted memory access traces. (Done in parallel on the FPGA)

4 for each individual cache Cj in level Li where j:=1 to Mi do

5 for each configuration Kijk where k:=1 to Dij do

6 if i < N then

7 Estimate ML and ME for Kijk, using UL and UE values of Li+1

8 else if i = N then

9 Estimate ML and ME for Kijk, using UL and UE values of DRAM

10 Estimate T and E for configuration Kijk

// Find configuration Kijk−selected for the cache Cj

11 if best performance then

12 Select configuration Kijk with minimum T

13 else if least energy consumption then

14 Select configuration Kijk with minimum E

15 Include a cache Cj with configuration Kijk−selected, into the MPSoC

16 Re-synthesize the MPSoC on the FPGA

application programs on the MPSoC. Lines 5-10 describe the estimation of energy and performance measures (T and E) using analytical models. When assessing the miss penalties, the access time and energy of the next cache level is considered. In the case of the last level cache, the access time and energy of the main memory is

61 4. Hardware Acceleration for Multiprocessor Cache Simulation used. Making the selection of a cache configuration is described in Lines 11 through 14. Line 15 represents event of configuring a cache using the selected configuration

(Kijk−selected) in the actual MPSoC on the FPGA. Once all the caches in the level

Li are configured in this manner, the system is re-synthesized and the exploration process moves to the level Li+1.

Figure 4.4 further illustrates the flow of the exploration process for a 2-level cache hierarchy with four private L1 caches and one shared L2 cache (as in Figure 4.2). The process starts at the top-left corner of Figure 4.4. Top half of the figure depicts the first stage of exploration where the four L1 caches are configured, while the bottom half shows the second stage where the shared L2 cache is configured.

FPGA . Host PC

L1 Start: Performance Parallel exploration Hit rates Select Cache-less and energy L1 Cache of all L1 cache configuration(s) MPSoC Accesses sub-design-spaces models in real-time Selected L1 configurations

MPSoC Exploration of with L1 shared L2 cache L2 Selected L1 and L2 cache caches L2 Cache sub-design-space Hit rates configurations Accesses in real-time

Figure 4.4: Graphical overview of the simulation methodology flow, described in Algorithm 1, used to explore the design space of a two-level multiprocessor cache hierarchy and determine suitable configurations.

62 4. Hardware Acceleration for Multiprocessor Cache Simulation

4.3 Implementation of hSim

Exploring a multi-level cache hierarchy in an MPSoC requires memory accesses to be extracted from different points in the memory subsystem in order to carry out the simulation of cache configurations. Therefore, the hardware cache simulator module (hSim) in this chapter was specifically designed such that it can be connected to different positions in the memory subsystem of the MPSoC operating on the FPGA device. The module was designed using VHDL (VHSIC Hardware Description Lan- guage), and based around the simulation circuitry developed by Schneider et al. [SPP14a], which assumes the LRU (Least Recently Used) replacement policy.

The hSim module is designed to be connected in the place of a cache memory, to simulate many different configurations for that particular cache in the hierarchy. The memory accesses, which are coming from the processor (or from the previous cache(s) in the hierarchy) are passed through to the next level cache (or to the main memory). When hSim is connected in the place of an L1 cache, it receives the

Data Data Address Address in out Total / Read / Write Counters Clock Crossing Buffer Control System Data Control Clock Logic Simulator Control Core Simulator Address Clock

Figure 4.5: Connection interfaces and operation overview of the hardware simulator (hSim) module.

63 4. Hardware Acceleration for Multiprocessor Cache Simulation memory access addresses from the corresponding processor. When the module is connected in the place of a last level cache, it passes the access addresses through to DRAM.

Figure 4.5 illustrates the connection interfaces and internal operation details of hSim. The module consists of three ports to connect to outside. The first port connects to the previous cache level of the memory hierarchy (to receive memory addresses) and the second port connects to the next cache level. A third port is used for control signals such as enabling and disabling of the hSim module, reading results, etc.. The control signals can be sent from any processor in the MPSoC. Address and data widths for the ports are parameterized and hence customizable upon requirement. Using these three ports, a number of hSim modules can be flexibly connected to different points in the memory hierarchy.

A copy of every memory access that passes through the hSim module is unobstruc- tively obtained and fed into the simulator core. This chapter uses the hardware cache simulator circuit introduced by Schneider et al. in [SPP14a], which is illus- trated in Figure 4.6. Simulator core circuitry is designed as a pipeline, with each stage simulating cache configurations with a fixed set size and varying associativity. Figure 4.6(a) depicts an example top-level (1st pipeline stage) with a simulated set size of eight and associativities ranging from four to one. Set size of the top level is the maximum set size simulated. Only the top level of the simulator contains a tag array, which is used to compare against incoming address tokens. [SPP14a] uses the associativity inclusion property (a given cache configuration is always a subset of another cache configuration with same set size and block size, but higher associa- tivity) to avoid having to use multiple tag arrays at top level. A set of registers are used as hit counter corresponding to the simulated associativity levels.

64 4. Hardware Acceleration for Multiprocessor Cache Simulation

Associativity Top Level of 1 2 3 4 Simulator Core

Tag Array

Hit Counters

Incoming Address Incoming

1 2 3 4 5 6 7 8 7 6 5 4 3 2 1

Set Index Index Set Smaller Set Sizes Set Smaller

(a)

1st Stage: 2nd Stage: 3rd Stage: Set Size = 8 Set Size = 4 Set Size = 2

(Top Level)

Pipeline Register Pipeline Register Pipeline Incoming Address Incoming

(b)

Figure 4.6: Internal implementation of the cache simulator core: (a) example top level of the simulator core, with a maximum set size of eight and maximum associa- tivity of four; (b) complete pipeline inside a simulator core for set sizes 8-to-2 and associativities 4-to-1.

65 4. Hardware Acceleration for Multiprocessor Cache Simulation

Figure 4.6(b) shows a complete pipeline inside a full simulator core, for set sizes eight to two and associativity levels four to one. [SPP14a] uses the set size inclusion property (a given cache configuration is always a subset of another cache configu- ration with same block size and associativity, but higher set size) to avoid having to store address tags in pipeline stages other than the top level. Instead, hit infor- mation is passed from one stage to the next through pipeline registers. Simulated cache block size is parameterized and can be set through the control port. Readers can refer to [SPP14a] for further details of the simulator core.

There are two different clock signal inputs associated with the hSim module: main system clock for interface and control operations (typically the same clock that the MPSoC operates on); and a separate clock for the simulator core. A clock crossing buffer is used to feed the observed memory access into the simulator core. The pipelined design of the cache simulator core allows it to operate at relatively high frequency (upto 200MHz in tests). Due to the sparse nature of memory accesses in application programs, hSim can be integrated in to a system working at a much higher frequency than the simulator core itself, without the buffer being overflowed.

Reading the counter values at the end of the simulation is handled by the . A connected CPU can send a read request to the control port, specifying the required counter. Relevant firmware functions for control and communication with hSim have been developed in C language.

The schematic symbol of the hSim module as implemeted in Altera Qsys system integration tool [Altb] (compatible with Altera Avalon bus architecture) is shown in Figure 4.7. The ports on hSim can be easily adapted to other interconnect architectures using bridging components.

66 4. Hardware Acceleration for Multiprocessor Cache Simulation

Figure 4.7: Detailed schematic symbol showing all signals for the hSim module as implemeted in Altera Qsys system integration tool [Altb]. Widths of the address and data signals are configurable.

Multiple instances of the hSim module can be connected to the MPSoC memory hierarchy. Figure 4.8(a) shows four hSim modules being connected in place of L2 caches, simulating the corresponding sub-design-spaces (as per the first stage in

Algorithm 1). Memory accesses pass through the hSim into the shared L2 cache, while hSim uses a copy of each access address for the simulation.

Accesses coming from multiple different sources can be connected in combination to the input port of the hSim module, as shown in Figure 4.8(b), which enables hSim

67 4. Hardware Acceleration for Multiprocessor Cache Simulation

CPU hSim CPU hSim CPU hSim CPU hSim 0 L1-0 1 L1-1 2 L1-2 3 L1-3

FPGA

Main Memory Shared L2 Cache

(a)

CPU L1 Cache CPU L1 Cache CPU L1 Cache CPU L1 Cache 0 0 1 1 2 2 3 3

FPGA

Main Memory hSim Shared L2

(b)

Figure 4.8: Connection and usage of hSim in an MPSoC on FPGA: (a) multiple hSim modules connected in the positions of private L1 caches; (b) a single hSim module connected in place of a shared L2 cache, to simulate the corresponding sub-design-spaces. to simulate sub-design-spaces of shared caches taking realistic bus contention into account. All L1 cache misses generate accesses to main memory, which pass through the hSim module connected in place of the shared L2 cache.

68 4. Hardware Acceleration for Multiprocessor Cache Simulation

4.4 Experimental Setup

The target system of Figure 4.1 was used for the experiments presented in this chapter, to demonstrate the process of cache design space exploration. Since a hardware cache coherence mechanism is not assumed to be present in the final system and hence cache misses occurring due to maintaining coherency are not counted, separate application programs (non-communicating) were executed on the four processors. Even though data are not shared between programs, the sharing of

L2 cache affects the hit rates of different cache configurations for L2.

The MPSoC was built using Altera Qsys system integration tool [Altb], with four Nios II/f embedded processor cores [Nio] operating at 200MHz, and was deployed in a Stratix V GX FPGA device on an Altera DE5-NET board [Alta]. 1 gigabyte of DDR3 SDRAM (Double Data Rate type 3 Synchronous Dynamic Random Ac- cess Memory) operating at 800MHz on the DE5-NET board was used as the main memory for the MPSoC.

We used six benchmark application programs, from SPEC2006 benchmark suite (bzip2 compression, bzip2 de-compression) and MiBench suite (lame mp3-encoding, lame mp3-decoding, rijndael aes-encryption, jpeg), to create two groups of four ap- plications in each. Table 4.1 shows the two groups of applications. Coloumn one gives the processor core number. The columns two, three and four respectively con- tain the application program, size of the data input used and the amount of memory accesses generated by the application, in Experiment 1. The columns five, six and seven respectively contain the application program, size of the data input used and the amount of memory accesses generated by the application, in Experiment 2.

69 4. Hardware Acceleration for Multiprocessor Cache Simulation

Table 4.1: Applications used in the Experiments

Experiment 1 Experiment 2 Core Input Memory Input Memory Application Application size (KB) Accesses size (KB) Accesses 0 rijndael aes 17 99,299,728 lame encode 3 259,070,294 bzip2 bzip2 1 18 66,365,689 131 143,691,935 decompress compress 2 jpeg 769 53,597,119 rijndael aes 41 238,329,483 3 lame decode 25 70,702,236 jpeg 769 53,597,119

Table 4.2: Simulated Configurations for Private L1 Caches and Shared L2 Cache

Private L1 Cache sub-design-spaces Shared L2 Cache sub-design-space

(27 Configurations for each L1 Cache) (180 Configurations for Shared L2 Cache) Block Sizes Block Sizes Set Sizes Associativities Set Sizes Associativities (Bytes) (Bytes) 4, 16, 32 1 - 256 1 32 - 256 1-256 1 - 16

Four instances of the hSim module (simulator core of each operating at 100MHz) were intially connected to the cache-less MPSoC in order to simulate the cache hits in L1 cache sub-design-spaces for the four processors in parallel. Each of the L1 hSim modules were parameterized to simulate 27 different cache configurations as described in Table 4.2 (27 configurations were used since Nios II data cache module is only direct mapped). The first three columns in Table 4.2 respectively present the block sizes, set sizes and associativities considered in L1 sub-design-spaces. Energy and performance measures were calculated as described in Section 4.2 using the cache hit rates obtained from the hSim modules, and L1 caches were put into the MPSoC based on the selected configurations (minimum E for Experiment 1 and minimum T for Experiment 2 were considered). Thereafter, a single hSim module

70 4. Hardware Acceleration for Multiprocessor Cache Simulation

was connected to the MPSoC to simulate the cache hits in the shared L2 cache sub- design-space. The hSim module for L2 was parameterized to simulate 180 different cache configurations as described in Table 4.2. The last three columns in Table 4.2 respectively present the block sizes, set sizes and associativities considered in the L2 sub-design-space.

4.5 Test Results

The results obtained from the simulations are presented in Figures 4.9, 4.10 and 4.11 with the Average Cache Access Energy (Ecache) on vertical axis plotted against

Average Cache Access Time (Tcache) on horizontal axis. Each plot displays a subset of configurations in the respective sub-design-space, with low energy and low access time values. A cross (+) represent a unique cache configurations in the subset. The configuration giving the least energy in the sub-design-space is marked with a red triangle (N), whereas the configuration giving the fastest access time is marked with a green circle (•). The largest cache configuration in the sub-design-space is marked with a yellow square ().

Figures 4.9 and 4.10 report the results for all private L1 caches in Experiments 1 and 2 respectively. A designer can decide which configuration to select depending on the requirement and design constraints. For example, for jpeg application, the configu- ration with block size = 32 bytes, set size = 256 (8KB cache) gives the fastest access time with 99.2 % hit rate; while the configuration with block size = 16 bytes set size = 32 (512B cache) gives the lowest energy consumption, with only 90.3% hit rate. A couple of observations are worthwhile noting from the plots in Figures 4.9 and 4.10:

71 4. Hardware Acceleration for Multiprocessor Cache Simulation

Experiment 1

rijndael_aes bzip2_decompress

)

pJ (

jpeg lame_decode Average Cache Access Energy Energy Access Cache Average

Average Cache Access Time (ns)

Configuration with min. access time Configuration with min. energy Unique cache configuration Largest configuration

Figure 4.9: Energy Consumption against Access Time for private L1 cache configu- rations, in Experiment 1. that the two instances of rijndael aes application, using two different sized input data, achieves minimum E from two different cache configurations; that the largest

L1 cache configuration provides neither the fastest access time nor the least energy consumption (except in the case of jpeg).

72 4. Hardware Acceleration for Multiprocessor Cache Simulation

Experiment 2

lame_encode bzip2_ compress

)

pJ (

rijndael_aes jpeg Average Cache Access Energy Energy Access Cache Average

Average Cache Access Time (ns)

Configuration with min. access time Configuration with min. energy Unique cache configuration Largest configuration

Figure 4.10: Energy Consumption against Access Time for private L1 cache config- urations, in Experiment 2.

When choosing suitable L1 cache configurations in the two experiments, configura- tions displaying minimum energy were selcted for applications in Experiment 1; and configurations displaying minimum access times were selected for applications in Ex- periment 2. Details of the selected L1 configurations are shown in Tables 4.3 and 4.4. The first two columns state the processor core number and the application program respectively. Columns three to seven describe the properties of the L1 configura-

73 4. Hardware Acceleration for Multiprocessor Cache Simulation

Table 4.3: L1 Cache Configurations with Minimum E and T from Experiment 1

L1 Config. with minimum E L1 Config. with minimum T Core Application Blk Set Cache Hit Blk Set Cache Hit Assoc Assoc Size Size Size Rate Size Size Size Rate

0 rijndael aes 32B 128 1 4KB 99.5% 32B 128 1 4KB 99.5%

1 bzip2 decompress 16B 32 1 512B 93.9% 32B 32 1 1KB 95.6%

2 jpeg 16B 32 1 512B 90.3% 32B 256 1 8KB 99.2%

3 lame decode 32B 16 1 512B 96.6% 32B 16 1 512B 96.6%

Table 4.4: L1 Cache Configurations with Minimum E and T from Experiment 2

L1 Config. with minimum E L1 Config. with minimum T Core Application Blk Set Cache Hit Blk Set Cache Hit Assoc Assoc Size Size Size Rate Size Size Size Rate

0 lame encode 32B 16 1 512B 98.3% 32B 16 1 512B 98.3%

1 bzip2 compress 16B 32 1 512B 92.7% 32B 64 1 2KB 96.9%

2 rijndael aes 32B 4 1 128B 86.8% 32B 128 1 4KB 99.4%

3 jpeg 16B 32 1 512B 90.3% 32B 256 1 8KB 99.2% tion with minimum energy (block size, set size, associativity, resultant cache size, achieved hit rate respectively). Columns eight to twelve describe the properties of the L1 configuration with minimum access time (block size, set size, associativity, resultant cache size, achieved hit rate respectively).

Based on the selected configurations for private L1 caches in the MPSoC, Figure 4.11 reports the results obtained for the shared L2 cache. Details of L2 cache configura- tions with minimum energy and minimum access times are shown in Table 4.5. The first column states the experiment number. Columns two to six describe the proper- ties of the L2 configuration with minimum energy (block size, set size, associativity, resultant cache size, achieved hit rate respectively). Columns seven to eleven de- scribe the properties of the L2 configuration with minimum access time (block size,

74

4. Hardware Acceleration for Multiprocessor Cache Simulation

) pJ

( Experiment 1 Experiment 2 Average Cache Access Energy Energy Access Cache Average

Average Cache Access Time (ns)

Configuration with min. access time Configuration with min. energy Unique cache configuration

Figure 4.11: Energy Consumption against Access Time for shared L2 cache config- urations, in Experiments 1 and 2, based on the selected L1 configurations.

Table 4.5: L2 Cache Configurations with Minimum T and E from Experiments 1 and 2

L2 Config. with minimum E L2 Config. with minimum T Experiment Block Set Cache Hit Block Set Cache Hit Assoc Assoc Size Size Size Rate Size Size Size Rate 1 256B 4 16 16KB 99.5% 256B 4 16 16KB 99.5% 2 128B 8 16 16KB 99.8% 128B 1 16 2KB 98.9% set size, associativity, resultant cache size, achieved hit rate respectively). MPSoC in Experiment 1 observed both minimum energy and minimum access time using a

16KB L2 cache with 99.5% hit rate. MPSoC in Experiment 2 obtained minimum energy using a 16KB cache with 99.8% hit rate while minimum access time was achieved by a 2KB cache with 98.9% hit rate. In either experiment, the largest

L2 cache configuration in the sub-design-space is well outside the plotted region in Figure 4.11, further underlining its unsuitability.

75 4. Hardware Acceleration for Multiprocessor Cache Simulation

Table 4.6: Total Estimated Memory Access Energy and Time for Applications

Experiment 1 Experiment 2 Core Input Estimated Input Estimated Application Application size (KB) Energy (µJ) size (KB) Time (s) 0 rijndael aes 17 248.35 lame encode 3 57.32 bzip2 bzip2 1 18 182.84 131 36.85 decompress compress 2 jpeg 769 144.36 rijndael aes 41 64.56 3 lame decode 25 182.07 jpeg 769 14.19

The total energy and time spent on memory accesses by the applications in both experiments are estimated in Table 4.6, based on the total memory access counts from Table 4.1 and the average cache access energy/time of the selected cache con- figurations. Column one gives the processor core number. Columns two, three and four respectively report the application program, size of the data input used and the total estimated memory access energy for the application, in Experiment 1. Columns five, six and seven respectively contain the application program, size of the data input used and the total estimated memory access time for the application, in Experiment 2.

Since this work involves counting hit rates using specialized hardware for the cache configuration design space, it is important to assess the time consumed by the hSim modules to produce the results. Simulation times and memory access trace size details from Experiments 1 and 2 are tabulated in Table 4.7. The first column gives the simulation stage. Total simulation time is recorded in the second column. The third and fourth columns respectively present the total number of memory accesses processed by the hSim modules and the time spent per million memory accesses.

76 4. Hardware Acceleration for Multiprocessor Cache Simulation

Table 4.7: Simulation Times to Calculate Hit Rates in Hardware

Total Memory Time per Simulation Stage Time (s) Accesses Million Accesses (s)

Experiment 1: Four L1 Caches 30.6 289,964,772 0.106

Experiment 1: Shared L2 Cache 6.3 42,360,645 0.149

Experiment 2: Four L1 Caches 84.3 694,688,826 0.121

Experiment 2: Shared L2 Cache 3.2 23,943,761 0.133

In Experiment 1, simulation of 27 configurations for each of the four private L1 caches took 30.6 seconds with 289.9 million memory accesses processed in total by the four hSim modules. With the selected L1 configurations (with minimum energy estimates) put into the MPSoC, the combined trace of L1 misses was 42.3 million accesses. The hSim module simulating 180 shared L2 cache configurations took 6.3 seconds to process this trace. In Experiment 2, four hSim modules (each simulating

27 L1 configurations) took 84.3 seconds to process a total of 694.7 million memory accesses. After the L1 caches with minimum access time estimates were put into the

MPSoC, the hSim module simulating 180 shared L2 cache configurations observed

23 million accesses. The L2 simulation in Experiment 2 took 3.2 seconds to calculate the hit rates.

These values represents time taken purely for simulation, as reported in the prior works, excluding overheads of re-synthesizing the system on FPGA. The observed time per million memory accesses taken by hSim modules is from 0.106 seconds to 0.149 seconds. The software based method in [HRA+12] takes 68 seconds per million accesses to explore the design space of a similar multiprocessor cache hierarchy, with a total time of 9468 seconds for 139 millions of memory accesses. In comparison, our hardware based method took 87.5 seconds and 36.9 seconds for the two experi-

77 4. Hardware Acceleration for Multiprocessor Cache Simulation ments demonstrated here, even when processing much larger memory access traces. Therefore the hardware based calculation of cache hit rates using hSim modules in the experiments is up to 456 times faster than software based simulation, owing to the parallel simulation in hardware.

Time overhead for synthesizing the hSim for FPGA was in the range of one to two hours (depending on the size of the simulated design space and the capacity of the FPGA device used). Partial resynthesis could be used to avoid having to re-synthesize the hSim and only re-synthesize the MPSoC when caches are included in the system. This overhead is still significantly smaller than that of extracting a memory access trace. As demonstrated in Chapter 1, extracting a memory access trace from a mere one second execution of MPEG2 encoder took 72 hours.

Additionally, simulation using hSim uses the real memory access traces observed by caches in a multi-level hierarchy, as opposed to software simulation where L1 mem- ory access traces are derived from the extracted L2 trace (due to high extraction times). It should also be noted that simulation accuracy depends on the trace size, and simulation times are directly proportional to the number of memory accesses processed, either in hardware or software. Hence the several orders of magnitudes faster simulation in hardware proves valuable to achieve better accuracy. In hard- ware based simulation, increasing the number of cache configurations to explore requires more FPGA logic elements, rather than increasing the simualtion time.

78 4. Hardware Acceleration for Multiprocessor Cache Simulation

4.6 Summary

A hardware based design space exploration methodology was presented in this chap- ter, to rapidly determine the cache configurations for a multi-level cache hierarchy in an application specific MPSoC. The proposed method significantly reduces the simulation time taken during the MPSoC design stages by rapidly calculating the cache hits for all configurations using specialized hardware. The hSim modules de- signed for this purpose can be flexibly connected to multiple different points in an MPSoC cache hierarchy on an FPGA device, to extract and analyse the memory access information at those points in real-time. With such fast and flexible simula- tion made possible, designers can explore the cache hierarchy design space with ease and gain better insight about application-dependent cache behavior. An additional benefit offered by using the proposed hSim module is that actual memory accesses generated by processors can be observed at real-time.

79 Chapter 5

Iterative Design Space Exploration of Multi-Level Caches

Extending on the fast hardware simulator components designed in Chapter 4, this chapter presents an improved design space exploration algorithm to optimize multi- level cache hierarchies for application specific multiprocessors. While the hardware acceleration in Chapter 4 achieves much faster simulation times compared to pre- vious works [HRA+12, HKH+13], limitations on optimality from the exploration algorithm are still present. The work presented in this chapter describes an itera- tive algorithm which selectively explores an extended portion of the design space to find an optimal (or near-optimal) solution.

Due to interlinked relationships between connected cache levels, individually op- timizing each cache in the hierarchy once does not guarantee that an optimal or near-optimal cache hierarchy is found. The selected configuration of a particular cache affects the exploration of lower and upper level connected caches as shown

80 5. Iterative Design Space Exploration of Multi-Level Caches

CPU CPU CPU CPU

C1,1 C1,2 C1,3 C1,4 Effect 2: on miss penalty

C2,1 C2,2 Effect 1: on access trace

C3,1

Main Memory

Figure 5.1: Effects of changing a cache’s configuration on the explorations in adjacent cache levels. in Figure 5.1. When a configuration is selected for the L2 cache C2,1 , the change affects the design space explorations in:

• the connected upper level (L1) caches C1,1 and C1,2 through the miss penalty;

• the connected lower level (L3) cache C3,1 through the access trace.

In turn, a cache configuration selection for either L1 or L3 caches affects the con- nected L2 cache. This cycle of effects cannot be wholly captured by performing a single optimization-pass over the cache hierarchy, in either direction. The improved algorithm presented in this chapter takes multiple iterations over a generic cache hierarchy containing inclusive non-coherent caches, with the goal of minimizing the overall time spent on memory accesses by the system. Each iteration of the algo- rithm captures effects of any change in selected cache configuration and propagates those effects to all relevant connected cache in adjacent levels through a carefully

81 5. Iterative Design Space Exploration of Multi-Level Caches crafted sequence of simulation steps. Demonstrations are provided to show that the iterative algorithm converges after a few iterations (in other words, after explo- ration of only a fraction of the design space) on the final design point for the cache hierarchy.

The rest of this chapter is organized as follows: Section 5.1 identifies the target MPSoC architecture and generic memory hierachy for the proposed design space exploration method; a comprehensive formulation for the design space exploration problem is provided in section 5.2; the iterative DSE methodology is detailed in Section 5.3; Sections 5.4 and 5.5 present demonstrations of the proposed iterative algorithm, using two-and-three-level cache hierarchies, and illustrate the algorithm’s convergence, stability and optimality through extensive testing.

82 5. Iterative Design Space Exploration of Multi-Level Caches

5.1 Target Multiprocessor Multi-Level Cache

Hierarchy

This chapter concerns an application specific MPSoC architecture that consists of P many processor cores and an inclusive cache hierarchy with N levels. Each cache in the hierarchy can either be private or shared. An instance of the memory hier- archy is depicted in Figure 5.2, where P = 4 and N = 3 (processors such as ARM Cortex [ARM] and Tensilica Xtensa [XTE] have been used in similar commercial architectures).

P1 P2 P3 P4

C1,1 C1,2 C1,3 C1,4 Upper-most cache level

C2,1 C2,2

C3,1 Lowest cache level

Main Memory

Figure 5.2: An example architecture for the target MPSoC memory hierarchy. P1 to P4 represent processors and C1,1 to C3,1 represent caches organized in three levels.

Similar to prior works on trace-driven simulation [JIS06, SA95, ZGR11, HRA+12], assumptions are made that in-order processors are used producing blocking memory accesses, and that the caches used in the system do not implement advanced tech- niques such as block pre-fetching. These assumptions allow deterministic simulation of cache hits and misses. Lower level caches are assumed to be inclusive, where the cached data form a superset for all connected upper levels.

83 5. Iterative Design Space Exploration of Multi-Level Caches

5.2 Problem Formulation

Given an MPSoC such as the one described in Section 5.1, with:

• a known set of target application programs;

• a known set of processors Pe (1 ≤ e ≤ P ); • an established cache hierarchy, with:

– a cache in the hierarchy denoted by Ci,j and a level denoted by Li, where i is the level number and j is the cache number in the particular level; – a fixed number of levels (N) and therefore, i ≤ N;

– a fixed number of caches per level (Mi for level Li) and therefore, for level

Li, j ≤ Mi; and – a predetermined one-to-one or many-to-one connections between caches

in levels Li and Li+1 and therefore Mi ≥ Mi+1. • a known set of candidate cache configurations in the design space for each

cache Ci,j, with:

– a configuration in the set is denoted by Ki,j,k, where k is the configuration number;

– a fixed number of configurations (Di,j) and therefore k ≤ Di,j; and

– known hit latency (HLi,j,k) and update latency (ULi,j,k) in case of a miss,

for each configuration Ki,j,k

find the set of cache configurations, Ki,j,kmin for all caches Ci,j in the hierarchy which minimizes the average memory access time of the system.

84 5. Iterative Design Space Exploration of Multi-Level Caches

5.3 Iterative DSE Methodology

When exploring an N-level cache hierarchy, there are CTOT number of caches to be configured given by Equation 5.1. Consequently the total number of design points for the given hierarchy (DTOT) can be expressed by Equation 5.2, which can grow exponentially with N, Mi and Di,j.

N X CTOT = Mi (5.1) i=1

N m Y Yi DTOT = Di,j (5.2) i=1 j=1

As outlined in Chapter 1, CTOT many access traces are required to evaluate a single design point for a cache hierarchy. Evaluating the complete design space will require

CTOT × DTOT many access traces, which can be up to several trillions in number. Considering the extreme lengths of time required for trace extraction as well as the magnitude of the design space, one cannot afford to evaluate all the design points one by one (or even a significant portion of the design space) in reasonable time. To tackle the problem, this section:

• presents an algorithm which visits a handful of selected cache hierarchy design points;

• combines hardware-assisted simulation with the proposed algorithm to speed up the process of counting cache hits, in parallel for all caches in a level.

85 5. Iterative Design Space Exploration of Multi-Level Caches

The following subsections provide a comprehensive description of the proposed methodology.

5.3.1 Cache Analysis

The term Ti,j,k is defined in this work as the average cache access time for a cache with configuration Ki,j,k. For a cache, Ti,j,k is the combined average of the time spent on cache hits and the time spent on cache misses. The value of Ti,j,k, which is normalized per single access to the cache, can be described by Equations 5.3 and 5.4.

Ti,j,k = HLi,j,k + [1 − HRi,j,k] × MLi,j,k (5.3)

ML(i,j,k) = Ti+1,j0,k0 + ULi,j,k (5.4)

The term HRi,j,k refers to the hit rate sustained by a cache with configuration Ki,j,k.

For the same configuration, the hit latency is given by the term HLi,j,k and the miss latency (penalty) is represented by MLi,j,k. The term ULi,j,k specifies the time to update the cache entry after a miss is detected and the data is fetched from a lower level. Further, Ci+1,j0 refers to the connected cache in the lower level for any particular cache Ci,j (which can be the main memory in the case of i = N). Hence the term Ti+1,j0,k0 represents the average time taken to access the next level cache.

The values of HLi,j,k and ULi,j,k can be obtained from an analytical model of the cache being used (taking the bus contention into account), since they are properties of the particular cache configuration Ki,j,k and therefore assumed to be provided

86 5. Iterative Design Space Exploration of Multi-Level Caches

in the problem formulation. However, HRi,j,k depends on the cache configuration as well as the memory access trace seen by the cache. Hence, HRi,j,k needs to be obtained through a trace-driven simulation for each cache configuration Ki,j,k.

Equations 5.3 and 5.4 point to the fact that evaluating Ti+1,j0,k0 requires a similar evaluation to be performed in the next cache level.

M X1 TTOT = T1,j,k (5.5) j=1

T1,j,k for all caches in level L1 can represent the average time spent on a memory access by the corresponding processor Pe (1 ≤ e ≤ P ). Thus, minimizing average memory access time can be expressed as minimizing TTOT in Equation 5.5.

5.3.2 Algorithm

In Algorithms 2 and 3, the iterative process is described where each iteration explores every cache in the hierarchy at least once. Rather than visiting the cache levels in a Round-Robin manner (Section 5.5.4 points out that Round-Robin is slower), an iteration is divided into a forward pass (FP) and a backward pass (BP). The cache hierarchy is traversed from the upper-most level to the lowest level in the forward pass, propagating effects of cache configuration changes to the levels below via updated memory access traces (Effect 1 in Figure 5.1). The backward pass then traverses the hierarchy from bottom to top, propagating the effects of cache configuration changes (made in FP) to the levels above via updated miss latencies (Effect 2 in Figure 5.1). Initially, there are no caches physically present in the hierarchy. Optimization starts at the first cache level. The optimization process

87 5. Iterative Design Space Exploration of Multi-Level Caches

is constructive, in the sense that caches with selected configurations are added to each level in the initial forward pass, and then modified in the subsequent passes to

achieve a lower Ti,j,k. Hence, the algorithm iteratively updates the selected cache hierarchy design point, until a stable state is reached.

An implicit upper bound to the size of the selected caches is provided by carefully setting the maximum values for the design space parameters (i.e. maximum block size, set size and associativity). Therefore, the combined chip area of selected caches is bounded, which prevents the cache hierarchy from being unnecessarily large.

The memory access traces observed by the first level of caches are the same in ev- ery iteration as per the stated assumptions in Section 5.1. Thus, the trace-driven

simulation of the design spaces of L1 caches is performed only once (pre-pass) at the beginning as shown in lines 1-7 in Algorithm 2, which reduces the overhead of counting cache hits. The cache hits are simultaneously calculated for all the config-

urations in each L1 cache’s sub-design-space (line 1) in parallel. T1,j,k is evaluated

for every configuration as given by lines 4-5. The value of T2,j0,k0 is obtained using the access time of Dynamic Random Access Memory (DRAM), since there are no

Algorithm 2: Pre-pass

// Pre-pass: hardware simulation for L1 needs to be done only once

1 ∀ j:= 1 to M1 and ∀ k:= 1 to D1,j: Calculate HR1,j,k

2 for j:=1 to M1 do

3 for k:= 1 to D1,j do

4 Calculate ML1,j,k for K1,j,k, by using DRAM access time for T2,j0,k0

5 Evaluate T1,j,k

6 Select K1,j,kmin | T1,j,kmin = min(T1,j,∗) 7 Include K1,j,kmin into the MPSoC for C1,j

88 5. Iterative Design Space Exploration of Multi-Level Caches caches present in the lower levels as yet. Then the configurations with the minimum access times for every cache are selected and included in the hierarchy (line 6-7). Real-time simulation of cache accesses ensures the different memory access patterns of applications and contention for shared caches are reflected in Ti,j,k values.

Algorithm 3 describes the iterative portion of the optimization process. In the forward pass (FP), the algorithm visits one cache level at a time starting from level L2. A configuration change in any level above would change the access traces received by lower levels necessitating a re-calculation of hits for every configuration in each cache’s design space, as given by lines 4-6.

The hardware simulator components (hSim described in Chapter 4) are integrated into the MPSoC in line 4 and the system is re-synthesized in line 5 (The use of hardware simulators is illustrated in Section 5.3.5). Then lines 8-13 re-evaluates

Ti,j,k for all configurations. The value of Ti+1,j0,k0 is taken from DRAM parameters only if the examined level is the last one, or for all levels during the first iteration.

Otherwise, Ti+1,j0,k0 is obtained from the cache level below. The configurations with the minimum access times are then selected and included in the MPSoC (lines 14-15), moving from the previous design point to a new one. The FP ends after optimizing the last cache level (LN ).

The backward pass (BP) begins at the penultimate cache level. Any configuration change in the level below would affect the miss penalty of the level currently visited by the BP. Thus, MLi,j,k needs to be updated (line 19), consequently having to update the access times for every cache configuration (line 20). Finally, the fastest set of cache configurations for each cache in the level are selected and included in the system as given by lines 21-22, updating design point of the cache hierarchy.

89 5. Iterative Design Space Exploration of Multi-Level Caches

Algorithm 3: Iteratively optimizing an N-level cache hierarchy.

1 Iteration number r = 1

2 repeat // Forward Pass (FP):

3 for i:=2 to N do

4 Replace Ci,j with hardware cache simulators ∀ j:= 1 to M1 in the MPSoC 5 Re-synthesize the MPSoC

6 ∀ j:=1 to Mi and ∀ k:= 1 to Di,j: Calculate HRi,j,k

7 for j:=1 to Mi do

8 for k:= 1 to Di,j do 9 if i = N OR r = 1 then

10 Calculate MLi,j,k by substituting DRAM access time Ti+1,j0,k0

11 else

12 Calculate MLi,j,k using Ti+1,j0,k0 from the next cache level

13 Evaluate Ti,j,k

14 Select Ki,j,kmin | Ti,j,kmin = min(Ti,j,∗) 15 Include Ki,j,kmin into the MPSoC for Ci,j

// Backward Pass (BP):

16 for i:=N − 1 to 1 do

17 for j:=1 to Mi do

18 for k:= 1 to Di,j do

19 Calculate MLi,j,k using Ti+1,j0,k0 from next cache level

20 Evaluate Ti,j,k

21 Select Ki,j,kmin | Ti,j,kmin = min(Ti,j,∗) 22 Include Ki,j,kmin into the MPSoC for Ci,j

23 r := r + 1

24 until no change made in any selected configuration Ki,j,kmin in the backward pass;

90 5. Iterative Design Space Exploration of Multi-Level Caches

One FP and one BP conclude a single iteration in the algorithm. If the BP does not incur any cache configuration change, the algorithm terminates and the set of configurations Ki,j,kmin for every cache in the hierarchy is returned.

When the configuration design space of a particular cache is explored at a given stage of algorithm progression, the algorithm fixes the configurations of other caches in the hierarchy at the current selections. The number of design points visited by the algorithm (DALG) can therefore be expressed by Equation 5.6 (R = the number of iterations taken to achieve convergence).

M N M 1 M ! X1 X Xi X Xi DALG = D1,j + R × Di,j + Di,j (5.6) j=1 i=2 j=1 i=N−1 j=1

Note that DALG grows linearly, in contrast to DTOT in Equation 5.6 which grows exponentially with respect to N, Mi and Di,j. Hence, the number of visited design points in Algorithm 3 is a mere fraction of the total space when R is small, which is the case as observed in Section 5.5. As an example, if there are only ten candidate cache configurations for the seven-cache hierarchy from Figure 5.1 in Chapter 1, there will be a total of 107 design points and Algorithm 3 visits 130 points out of those.

91 5. Iterative Design Space Exploration of Multi-Level Caches

5.3.3 Convergence Criteria

Consider the following two observations on the exploration process given in Algo- rithm 3.

1. In an FP, a cache’s selected configuration may be changed due to an updated memory access trace seen by the said cache, when one or more upper level cache configuration changes have occurred. Even if there were no configuration

changes made in a complete FP, the values of Ti,j,k may still have changed in

one or more caches. A subsequent BP may propagate the changes in Ti,j,k values as updated miss penalty to upper levels, therefore potentially causing configuration changes.

2. In a BP, a cache’s selected configuration may be altered due to the change in miss latency/penalty seen by the said cache, when a lower level cache con- figuration changes had occurred. Even if there were no configuration changes

made in a complete BP, the values of Ti,j,k may still have changed in one or more caches. However, having no configuration changes made implies that a subsequent FP will see the same memory access traces as before and conse-

quently the same Ti,j,k values and same selected configurations.

Therefore, when there are no changes seen in any of the cache configurations during a BP, it can be concluded that there will not be any further changes to the cache configurations in a subsequent FP. Therefore, this this condition is used as the termination criterion for Algorithm 3.

92 5. Iterative Design Space Exploration of Multi-Level Caches

5.3.4 Case for Hardware-Accelerated Simulation

As empirically shown with experimental results in Section 5.5, up to three iterations are taken to reach the stable state. This means 13 trace extractions and simulations are performed in total (as per Algorithm 3) for a cache hierarchy with four L1 caches, two L2 caches and one L3 cache. A completely software based process would become infeasible in such a scenario due to extreme time consumption, as highlighted in Chapters 1 and 3.

Hardware accelerated simulation is used here to simultaneously count hits for a cache’s configuration sub-design-space, similar to the design presented in Chapter 4. This enables the speeding up of simulations, and eliminates the time spent on trace extraction. The parallel simulation of design spaces for all the caches in each level (given by line 1 in Algorithm 2 and line 6 in Algorithm 3) overlaps the simulation time, thus significantly speeding up the overall process. In contrast to the framework in Chapter 4, the hSim modules used in this Chapter have the capability to integrate either Least Recently Used (LRU) or First In First Out (FIFO) replacement policies for associative caches in the hardware simulator core.

93 5. Iterative Design Space Exploration of Multi-Level Caches

5.3.5 Hardware-Accelerated Simulation Process

The assistance of hardware is incorporated into the FP, as that’s where the effects of memory access traces are considered for subsequent lower cache levels. Figure 5.3 provides an overview of how an FPGA device connected to a host PC provides this assistance. At a given cache level Li in the FP, the hardware simulators do the following simultaneously and in parallel for each cache Ci,j:

• observe the memory access trace received by cache Ci,j;

• count hits sustained by each candidate cache configuration Ki,j,k in the cache’s sub-design-space.

FPGA Host PC Input data for MPSoC applications

Analytical model for cache Memory timing access Parallel hit MPSoC traces counting of HR Calculate Select Ki,j,kmin T for Ki,j,k for all Ci,j i,j,k with minimum in level Li each Ki,j,k Ti,j,k

Update MPSoC with selected caches. Selected for all caches Move the simulation to next level Ki,j,kmin (i←i+1) in level Li

Figure 5.3: Overview of the forward pass (FP), where assistance of FPGA hardware is used for parallel design space explorations on each cache level.

94 5. Iterative Design Space Exploration of Multi-Level Caches

p1 p2 p3 p4

C1,1 C1,2 C1,3 C1,4

hSim2,1 hSim2,2

C3,1

Main Memory

Figure 5.4: Example use of hardware simulators (hSim) in level L2 of a cache hi- erarchy. Components hSim2,1 and hSim2,2 work in parallel to simulate sub-design- spaces of the two shared L2 caches.

After the configurations Ki,j,kmin are selected at level Li, the MPSoC is updated to reflect those changes and the FP moves on to the next cache level Li+1.

An example on how the hardware simulators (hSim modules) are connected to the MPSoC in the FP is depicted in Figure 5.4. The hSim module can be connected in the place of a cache memory, to simulate different configurations for that particular cache in the hierarchy.

Figure 5.5 illustrates the interfacing details of hSim, which consists of three ports. The first port connects to the previous cache level of the memory hierarchy (to receive memory addresses) and the second port connects to the next level. The third port is used for control signals. The memory accesses which are coming from the processor, or the previous cache(s) in the hierarchy, are passed through to the next level cache. A copy of the address is non-intrusively obtained and fed into the simulator core. Accesses coming from different sources can be connected in combination to the input port of the hSim module, which enables it to simulate

95 5. Iterative Design Space Exploration of Multi-Level Caches

Data Data Address Address in out Total / Read / Write Counters Clock Crossing Buffer Control System Data Control Clock Logic Simulator Control Core Simulator Address Clock

Figure 5.5: Interface and structure of the hardware simulator (hSim) module. configurations for shared caches.

Further details about the hardware-assisted simulation framework are presented in Chapter 4. The experiments presented in this chapter were conducted using FIFO replacement policy, which is widely used in embedded systems. Interested readers are referred to the recent work [SPP14b] for a detailed description on the hardware implementation of the simulator core using FIFO replacement policy.

5.4 Experiments

To evaluate the convergence and stability of the algorithm with rigorous testing, a series of comprehensive experiments are presented in this section. Algorithm 3 is applied on two different quad-core MPSoCs, System A and System B, each executing a different group of four application programs. In System A, the MPSoC contains a three-level cache hierarchy with four private L1 caches, two shared L2 caches and one shared L3 cache similar to Figure 5.2. In System B, the MPSoC contains a two-level cache hierarchy with four private L1 caches and one shared L2 cache.

96 5. Iterative Design Space Exploration of Multi-Level Caches

Tests A1,A2, B1 and B2 are presented to demonstrate the convergence of Algorithm 3 for System A and System B. Tests A3 − A9 evaluate the stability of the final design point achieved for System A. An alternative iteration policy is investigated in Test

A10. Test C1 examines the optimality of the design point found by the algorithm, using a dual-core MPSoC (System C ) containing a two-level cache hierarchy with two private L1 caches and one shared L2 cache.

The 12 tests (A1 to A10, B1 and B2) were performed using the quad-core MPSoC. Since the coherence between caches is not considered in this work, non-communicating application programs were used for the experiments. A set of benchmark applica- tions from SPEC2006 benchmark suite (bzip2 compression, bzip2 de-compression) and MiBench suite (lame mp3-encoding, lame mp3-decoding, rijndael aes-encryption,

Table 5.1: Applications Executed in System A

Processor Application Input Size Memory Accesses

1 lame encode 3KB 259,070,294

2 bzip2 compress 131KB 143,691,935

3 rijndael aes 41KB 238,329,483

4 jpeg encode 769KB 53,597,114

Table 5.2: Applications Executed in System B

Processor Application Input Size Memory Accesses

1 rijndael aes 17KB 99,299,728

2 bzip2 de-compress 18KB 66,365,689

3 jpeg encode 769KB 53,597,119

4 lame decode 25KB 70,702,236

97 5. Iterative Design Space Exploration of Multi-Level Caches jpeg) were used in groups of four for the experiments as shown in Table 5.1 and Ta- ble 5.2. First and second columns respectively give the processor core and the assigned application program. The size of the input data used is given in the third column. The last column reports the number of total memory accesses generated by the processors for the corresponding applications.

The MPSoCs in all tests were created using Altera Quartus and Qsys tools [Altb], using Nios II/f [Nio] embedded processor instances. An in-house developed param- eterized set-associative cache component was used to include caches with desired configurations into the MPSoC at various stages in Algorithm 3. The cache com- ponent implements FIFO as the replacement policy and write-through as the write policy (It should be noted that the presented algorithm does not assume a partic- ular write policy. The algorithm is independent of the write policy used by the system designer). An Altera DE5-NET evaluation board [Alta], depicted in Fig- ure 5.6, containing a Stratix V GX FPGA and 512 megabytes of DDR3 SDRAM

Figure 5.6: Altera DE5-NET FPGA board used in the experimental setup.

98 5. Iterative Design Space Exploration of Multi-Level Caches

Table 5.3: Design Space Parameters for Systems A and B

Block Sizes Set Sizes No. of Design Caches Associativities (Bytes) (Powers of 2) Configs Points

System A L1 4,8,16 0 to 6 1,2,4 63 10.4 System A L2 16,32,64 0 to 7 1,2,4 72 trillion System A L3 32 to 256 0 to 7 1,2,4,8 128

System B L1 4,8,16 0 to 7 1,2,4 72 5.4

System B L2 16 to 256 0 to 7 1 to 16 200 billion was used to deploy the MPSoC and the hardware simulator components. CACTI 6.5 [MBJ07] was used as the analytical model to obtain timing values for different cache configurations.

Table 5.3 reports the design space parameters for each cache in the hierarchies ex- plored in the tests. First column lists the cache levels of the hierarchies used in System A and System B. Block sizes in Bytes are given in column two for caches in different levels. Third column gives the set sizes as powers of two, and the fourth column gives the associativities. Fifth column reports the number of cache configu- rations in each cache’s sub-design-space. The last column states the total number of design points in each system’s cache hierarchy. With the given parameters, the three-level hierarchy in System A has 10.4 trillion design points (634 × 722 × 128 from Equation 5.2) and the two-level hierarchy in System B has 5.4 billion design points (724 × 200 from Equation 5.2).

99 5. Iterative Design Space Exploration of Multi-Level Caches

5.4.1 Fairness of Comparison

The works such as [HRA+12, HKH+13, ZGR11] used software based trace-driven simulation, which were the most thorough explorations of two-level multiprocessor cache hierarchies prior to [NSJP14] (presented here in Chapter 4). Work in [NSJP14] was the first to combine the power of hardware-accelerated trace-driven simulation of a complete cache design space, with generalized multi-level multiprocessor cache hierarchy exploration. Prior works used a single memory access trace to simulate configurations for every cache in the hierarchy. The framework in [12] allowed each cache to be simulated with the memory access trace observed by that cache in reality, which leads to a more accurate design space exploration. Further, the method in [NSJP14] was able to achieve improved simulation speeds and enabled to perform such simulations practically in reasonable time. Hence [NSJP14] is considered as the state-of-the-art and used as the reference point for comparisons, whereas the work in this Chapter further improves the design space exploration to yield better results in terms of memory performance.

5.5 Test Results

The primary goal of the experiments was to evaluate the convergence of Algorithm 3 to a stable set of cache configurations, and to compare the achieved result for TTOT with that of the state-of-the-art work [NSJP14]. Additionally, the stability and optimality of the algorithm will be demonstrated inthe following sections.

100 5. Iterative Design Space Exploration of Multi-Level Caches

5.5.1 Convergence

A. Test A1

A three-level cache hierarchy for the quad-core system was explored in Test A1. Figures 5.7, 5.8 and 5.9 present an overview of the results. Changes in selected configurations and average access times at different passes of the algorithm are reported for every cache. The horizontal axes mark the steps in forward pass (FP) and backward pass (BP) of each iteration.

Figure 5.7 reports changes in selected configurations. Initially, the MPSoC doesn’t

Test A1 )

s 524288 e t y B ( 131072 e Design point from z i state-of-the-art S

n

o Design point from C1,1C1,1

i 32768 t

a Algorithm 3 r C1,2C1,2 u g i

f 8192

n C1,3C1,3 o C C1,4C1,4 e

h 2048

c C2,1C2,1 a C

d 512 C2,2C2,2 e t c C3,1C3,1 e l e

S 128 L1 L2 L3 L2 L1 L2 L3 L2 L1 L2 L3 L2 L1

Pre 1-FP 1-BP 2-FP 2-BP 3-FP 3-BP

Iteration Step (a)

Figure 5.7: Results from Test A1. Changes in selected configuration sizes for the caches C7.0i,j at the design point reached in each iteration step. ) s

n 6.0 (

n i m _

k 101 , j 5.0 , i T1,1,k_min T

e

m 4.0 T1,2,k_min i T

s T1,3,k_min s

e 3.0

c T1,4,k_min c A

e 2.0 T2,1,k_min g a r T2,2,k_min e

v 1.0 A T3,1,k_min 0.0 L1 L2 L3 L2 L1 L2 L3 L2 L1 L2 L3 L2 L1

Pre 1-FP 1-BP 2-FP 2-BP 3-FP 3-BP

4.5

4.0

3.5

3.0

2.5 L1 L2 L3 L2 L1 L2 L3 L2 L1 L2 L3 L2 L1

Pre 1-FP 1-BP 2-FP 2-BP 3-FP 3-BP 5. Iterative Design Space Exploration of Multi-Level Caches have any caches. In the FP of the first iteration, the caches are being added to the system. At this stage, miss penalty ML for each cache level is calculated using DRAM access latency. Therefore misses are costly and high hit rates are desirable, which leads to large cache configurations being selected. During the BP of the first iteration, the existence of lower level caches is taken into account which makes upper level misses less costly. When selecting configurations, the importance of keeping the miss penalty low diminishes while the importance of reducing hit latency increases. Therefore, smaller configurations are selected with faster average access time albeit with reduced hit rates.

The changes in average access times (Ti,j,kmin ) observed by the algorithm are reported in Figure 5.8, which correspond to the configuration changes shown in Figure 5.7.

For example, a configuration change happens for C2,1 at 1-BP L2 step, resulting in a reduced T2,1,kmin . The value of Ti,j,k is calculated considering the currently selected cache configurations in the lower cache levels, and depending on the access traces received from upper levels, as described in Section 5.3. When a configuration change

is made to a cache, previously calculated Ti,j,kmin for caches in other levels become obsolete. Subsequent iterations over the hierarchy updates these values until no further changes are need to be made.

As shown in Figure 5.9, the configuration changes made at each step results in a continued reduction in TTOT as th ealgorithm progresses. The algorithm converges to a stable state after three iterations. At the termination of the algorithm, there are no further changes to be propagated from any cache level to another (i.e. no selection on Ki,j,k can be made with lesser Ti,j,k for any cache). T1,j,k at this design point gives the final set of average times spent on a memory access by each processor core Pe in the multi-processor environment.

102 524288

131072 524288 32768 C1,1 131072 C1,2 8192 C1,3 32768 C1,1 C1,4 2048 C1,2 C2,1 8192 C1,3 C2,2 512 C1,4 2048 C3,1 C2,1 128 C2,2 512 L1 L2 L3 L2 L1 L2 L3 L2 L1 L2 L3 L2 L1 C3,1 128 Pre 1-FP 1-BP 2-FP 2-BP 3-FP 3-BP 5. Iterative DesignL1 L2 SpaceL3 ExplorationL2 L1 L2 of Multi-LevelL3 L2 L1 CachesL2 L3 L2 L1

Pre 1-FP 1-BP 2-FP 2-BP 3-FP 3-BP Test A1 7.0 ) s

n 6.0 (

n i

m 7.0 _ k , j 5.0 , i T1,1,k_minT1,1,k_min T 6.0 e T m 4.0 T1,2,k_min1,2,k_min i T

s 5.0 T1,3,k_minT1,3,k_min

s T1,1,k_min

e 3.0

c T1,4,k_minT1,4,k_min c 4.0 T1,2,k_min A

e 2.0 T2,1,k_minT2,1,k_min

g T1,3,k_min a r 3.0 T2,2,k_minT2,2,k_min e

v 1.0 T1,4,k_min A T3,1,k_minT3,1,k_min 2.0 T2,1,k_min 0.0 T2,2,k_min 1.0 L1 L2 L3 L2 L1 L2 L3 L2 L1 L2 L3 L2 L1 T3,1,k_min 0.0 Pre 1-FP 1-BP 2-FP 2-BP 3-FP 3-BP L1 L2 L3 L2 L1 L2 L3 L2 L1 L2 L3 L2 L1 Iteration Step Pre 1-FP 1-BP 2-FP(b) 2-BP 3-FP 3-BP

Figure 5.8: Results from Test A1. Changes in resulting Ti,j,kmin for the caches Ci,j 4.5 as seen by the algorithm, at the design point reached in each iteration step. 4.0 ) Test A1 s n (

4.5

T 3.5

O Result from T

T 4.0 state-of-the-art ) 3.0 s Result from n (

T 3.5 Algorithm 3

O 2.5 T T 3.0 L1 L2 L3 L2 L1 L2 L3 L2 L1 L2 L3 L2 L1

2.5 Pre 1-FP 1-BP 2-FP 2-BP 3-FP 3-BP L1 L2 L3 L2 L1 L2 L3 L2 L1 L2 L3 L2 L1

Pre 1-FP 1-BP 2-FP 2-BP 3-FP 3-BP Iteration Step (c)

Figure 5.9: Results from Test A1. Changes in TTOT at the design point reached in each iteration step.

103 5. Iterative Design Space Exploration of Multi-Level Caches

Table 5.4: Selected Design Point in Test A1

Cache Block Size (Bytes) Set Size Associativity Cache Size (Bytes)

C1,1 8 32 1 256

C1,2 8 16 4 512

C1,3 16 16 1 256

C1,4 8 16 4 512

C2,1 32 32 2 2048

C2,2 32 32 4 4096

C3,1 256 128 8 262144

Figure 5.7 and Figure 5.9 highlight the differences between design points reached by Algorithm 3 and the state-of-the-art [NSJP14] where only a single pass is performed. It can be observed that the sizes of selected cache configurations have reduced in the new iterative method (by upto 93.75%), while reducing TTOT by 16% at the same time. Using Equation 5.6, the number of explored design points is 2256 which amounts to a mere 2.2 × 10−8% of the design space.

Table 5.4 reports the set of selected cache configurations at the final design point. First column gives the cache identifier. Second, third and fourth columns present the Block Size, Set Size and Associativity of the selected configurations respectively. The last column gives the size of the selected configuration.

104 5. Iterative Design Space Exploration of Multi-Level Caches

B. Test B1

Similarly, Figures 5.10, 5.11 and 5.12 present the algorithm’s convergence for the 2-level cache hierarchy in Test B1. The design points visited at each iteration step and the corresponding values of TTOT are shown in Figure 5.10 and Figure 5.12

respectively. Figure 5.11 gives the successive Ti,j,kmin values calculated for each cache in the hierarchy.

Test B1 )

s 131072 e t y B ( 32768 e z i

S Design point from C1,1C1,1

n 8192 Design point from

o state-of-the-art i t Algorithm 3 C1,2C1,2 a r

u 2048 C1,3C1,3 g i f

n C1,4 o C1,4 512 C

d C2,1C2,1 e t

c 128 e l

e L1 L2 L1 L2 L1 L2 L1 S Pre 1-FP 1-BP 2-FP 2-BP 3-FP 3-BP Iteration Step (a)

5 Figure 5.10: Results from Test B1. Changes in selected cache configuration sizes for the caches Ci,j at the design point reached in each iteration step. 4 T1,1,k_min 3 T1,2,k_min T1,3,k_min 2 T1,4,k_min T2,1,k_min 1 105 0 L1 L2 L1 L2 L1 L2 L1

Pre 1-FP 1-BP 2-FP 2-BP 3-FP 3-BP

4.5

4.0

3.5

3.0 L1 L2 L1 L2 L1 L2 L1

Pre 1-FP 1-BP 2-FP 2-BP 3-FP 3-BP 131072

32768 131072 8192 C1,1 32768 C1,2 2048 8192 C1,1C1,3 C1,4 512 C1,2 2048 C1,3C2,1 128 C1,4 512 L1 L2 L1 L2 L1 L2 L1 C2,1 128 Pre 1-FP 1-BP 2-FP 2-BP 3-FP 3-BP L1 L2 L1 L2 L1 L2 L1 5. Iterative Design Space ExplorationIte ofrat Multi-Levelion Step Caches Pre 1-FP 1-BP 2-(FPa) 2-BP 3-FP 3-BP

Test B1 5 ) s n (

n i 54 m _ k , j , i T1,1,k_minT1,1,k_min T

e 43 T1,2,k_minT1,2,k_min m i T T1,3,k_minT s T1,1,k_min1,3,k_min s

e 32 c T1,2,k_minT1,4,k_minT1,4,k_min c A T1,3,k_minT2,1,k_minT2,1,k_min e g

a 21 r T1,4,k_min e v

A T2,1,k_min 10 L1 L2 L1 L2 L1 L2 L1

0 Pre 1-FP 1-BP 2-FP 2-BP 3-FP 3-BP L1 L2 L1 L2 L1 L2 L1 Iteration Step Pre 1-FP 1-BP 2-(FPb) 2-BP 3-FP 3-BP

Figure 5.11: Results from Test B1. Changes in resulting Ti,j,kmin for the caches Ci,j 4.5 as seen by the algorithm, at the design point reached in each iteration step. 4.0 Test B1 4.5 3.5 Result from )

s 4.0 state-of-the-art Result from n (

Algorithm 3

T 3.0 O T 3.5 L1 L2 L1 L2 L1 L2 L1 T

3.0 Pre 1-FP 1-BP 2-FP 2-BP 3-FP 3-BP L1 L2 L1 L2 L1 L2 L1

Pre 1-FP 1-BP 2-FP 2-BP 3-FP 3-BP Iteration Step (c) Figure 5.12: Results from Test B1. Changes in TTOT at the design point reached in each iteration step.

106 5. Iterative Design Space Exploration of Multi-Level Caches

Table 5.5: Selected Design Point in Test B1

Cache Block Size (Bytes) Set Size Associativity Cache Size (Bytes)

C1,1 16 64 2 2048

C1,2 8 32 1 256

C1,3 8 16 4 512

C1,4 8 32 1 256

C2,1 128 16 8 16384

After three iterations, the algorithm converges to the design point reported in Ta- ble 5.5. It can be observed that the sizes of selected cache configurations have reduced in the new iterative method (by upto 93.75%), while reducing TTOT by 18.9% at the same time. Using equation 5.6, the number of explored design points is 1952, which is a mere 3.6 × 10−5% of the total design space.

Contrasting to the new iterative method, state-of-the-art work [NSJP14] presented in Chapter 4 optimizes each individual cache only once. Table 5.6 reports resulting design points from the new iterative method in comparison with the state-of-the- art. The first column indicates the test number and the second column indicates the measure to be compared (size of caches and value of TTOT). The third and fourth columns respectively report the values of the corresponding measure at the final design point from [NSJP14] and Algorithm 3. The last column presents the achieved reduction as a percentage, for the corresponding measure. Algorithm 3 was able to find design points with smaller caches in both Tests A1 and B1. For an example, the size of C1,2 was reduced from 4KB to 256B in Test A1 which is a 93.75% reduction. The iterative method was able to reduce the total size of caches used by 10.37% in Test A1 and by a significant 74.15% in Test B1. At the same

107 5. Iterative Design Space Exploration of Multi-Level Caches

Table 5.6: Results in Comparison

Test Measure [NSJP14] Algorithm 3 Reduction [%]

C1,1 size 1KB 256B 75.00

C1,2 size 1KB 512B 50.00

C1,3 size 4KB 256B 93.75

C1,4 size 1KB 512B 50.00 A 1 C2,1 size 16KB 2KB 87.50

C2,2 size 16KB 4KB 75.00

C3,1 size 256KB 256KB 00.00

Total cache size 294KB 263.5KB 10.37

TTOT 3.32ns 2.79ns 16.0

C1,1 size 4KB 2KB 50.00

C1,2 size 1KB 256B 75.00

C1,3 size 8KB 512B 93.75 B 1 C1,4 size 512B 256B 50.00

C2,1 size 64KB 16KB 75.00

Total cache size 73.5KB 19KB 74.15

TTOT 3.81ns 3.09ns 18.9

time, the overall average memory access time was reduced by 16.0% in Test A1 and

18.9% Test B1, achieving faster MPSoC performance. These observations evidently reinforce the hypothesis that carefully selected smaller caches have the potential for better performance than larger caches in application specific systems.

108 5. Iterative Design Space Exploration of Multi-Level Caches

Table 5.7: Explored Portion of Design Space

Total Explored Explored in Increase in No. Test Design Points in [NSJP14] Algorithm 3 of Design Points

13 A1 1.04 × 10 524 2256 4.31×

9 B1 5.4 × 10 488 1952 4×

Table 5.7 reports the increased amount of design points evaluated by Algorithm 3, compared to the state-of-the-art [NSJP14]. Column one indicates the test number, while the total size of the design space is given in column two. Third column gives the number of design points explored in [NSJP14]. Fourth column reports the total number of design points explored in Algorithm 3 and the last column reports the increase compared to column three. Results show that the new iterative method explores 4.31× and 4× larger portions of design spaces respectively in Tests A1 and

B1, in order to find better design points for the cache hierarchies. However, the absolute increase in design points is in thousands, which is practically feasible in reasonable time since the power of hardware-accelerated simulation is exploited.

C. Tests A2 and B2

The goal of the Tests A2 and B2 is to investigate whether the algorithm converges to the same design point regardless of whether the exploration starts from either the top of the cache hierarchy or the bottom (i.e. either L1 or LN ). This essentially changes the starting point of the optimization in the design space and makes the algorithm take a completely different path of design points. Either approach was taken in the previous non-iterative methods [HRA+12] and [HKH+13].

109 5. Iterative Design Space Exploration of Multi-Level Caches

Tests A2 and B2

262144 ) s e t y 65536 B (

e C1,1C1,1 z i

S 16384 C1,2C1,2 n o i t C a C1,31,3

r 4096 u g

i C1,4C1,4 f

n 1024 o C2,1C2,1 C

d

e C2,2C2,2 t 256 c e l C3,1C3,1 e

S 64 L3 L2 L1 L2 L3 L2 L1 L2 L3 L2 L1

Pre 1-FP 1-BP 2-FP 2-BP

Iteration Step (a) )

s 32768 e t y B

( C 8192 C1,11,1 e z i C1,2C1,2 S

n 2048

o C1,3

i C1,3 t a r C1,4C1,4 u 512 g i f

n C2,1C2,1 o

C 128

d

e L2 L1 L2 L1 L2 L1 t c e l e Pre 1-FP 1-BP 2-FP 2-BP S

Iteration Step (b)

Figure 5.13: Changes in selected cache configuration sizes for the caches Ci,j at each iteration step, in (a) Test A2 and (b) Test B2 where exploration starts at level LN .

Final design points reached are the same as those of Tests A1 and B1 respectively.

110 5. Iterative Design Space Exploration of Multi-Level Caches

For this the same cache hierarchies were explored as in Tests A1 and B1, starting from the bottom cache level. To achieve this, the pre-pass of the algorithm has to be altered such that cache levels from the bottom to the top are traversed once prior to the first FP.

An overview of the results is presented in Figure 5.13. The pre-pass performs three simulations in Test A2 (Figure 5.13(a)), at levels L3, L2 and L1 in sequence before

first FP takes over the iterations. Similarly, in Test B2 (Figure 5.13(b)), two sim- ulation steps are performed in the pre-pass. Selected configurations are reported for every cache, at different iteration steps in the algorithm. The process follows different points in the design space when compared to Tests A1 and B1, as shown in Figure 5.13. However, the algorithm terminates at the same design points given in Table 5.4 and Table 5.5, empirically showing that convergence can be achieved regardless of whether the optimization starts at the top or the bottom of the hier- archy.

As in the previous tests, initially selected configurations and their associated access

times (Ti,j,kmin ) are made obsolete by successive changes made in other cache levels. The size increase in selected cache configurations happens due to the sharing of lower level caches. For a given shared L2 cache, the observed accesses coming from

∗ ∗ a particular L1 cache reduce when the size of the L1 cache is increased. Depending on the interleaving pattern of accesses received from other L1 caches, the reduced

∗ accesses coming from L1 are likely to incur more misses than before. Therefore, large configurations with high hit rates generally tend to be favourable for shared lower level caches.

111 5. Iterative Design Space Exploration of Multi-Level Caches

5.5.2 Simulation Times

In order to propagate the effects of access trace changes to the lower levels in a hierarchy, a number of cache design space simulations are performed in the FP. In each of these simulations, cache hits for a number of configurations are simultane- ously counted. At each cache level, simulations are performed by using hardware simulators in parallel for all caches in the level.

The overall time taken by hardware simulators to count cache hits are reported in Table 5.8. Columns 2 to 5 respectively report the measurements for Tests A1,

A2, B2 and B2. Row one gives the number of simulation steps encountered by the algorithm, where HRi,j,k is calculated. Each simulation step requires the MPSOC to

Table 5.8: Simulation Times when using Hardware Assistance

Test A1 Test A2 Test B1 Test B2 No. of Simulation Steps 7 7 4 4 (No. of. Re-synthesis)

Time for Re-synthesis 10.5 10.5 6 6 (hours)

Time for Simulation 1595 2681 303 476 (s)

Total Time 10.94 11.25 6.08 6.13 (hours)

Total Memory Accesses 2600 4305 572 799 (millions)

Sim. Time per million 0.61 0.62 0.53 0.59 Accesses (s)

112 5. Iterative Design Space Exploration of Multi-Level Caches be re-synthesized. Time taken for re-synthesis of the MPSoC is reported in row two. Time spent on simulations is reported in row three. The total times for re-synthesis and simulation are presented in the fourth row. Fifth and sixth rows present the total amount of memory accesses consumed by the hardware simulators, and the respective time for simulation per million memory accesses.

As shown in Algorithm 3, each simulation step is associated with integrating hard- ware simulators to the MPSoC, including caches with selected configurations, and re-synthesizing. Re-synthesis takes several minutes (or a few hours depending on the size of selected caches, size of the configuration design spaces and the size of the FPGA). However, the use of hardware simulators is still appealing, when the ability to simulate several design spaces in parallel is considered, and when compared to the lengthy time needed to extract several traces if hardware simulators are not used.

5.5.3 Stability and Empirical Optimality

A. Tests A3 to A9

Due to the vast proportions of the design space, exhaustively finding a proof of optimality for the final design point is practically infeasible. Instead, a set of tests

(A3 to A9) were conducted in order to assess the stability of the final design point reached by Algorithm 3. In each test, an offset was manually introduced to the final design point after the algorithm converges for System A. Then the algorithm is allowed to continue iterating to check whether and when does it return to the originally selected design point.

113 5. Iterative Design Space Exploration of Multi-Level Caches

To select random offset design points, Latin Hypercube Sampling [IDZ80] was used, which is a standard statistical method to select a plausible collection of parameter values covering a multidimensional design space. The selected offset points include the ones with maximum and minimum cache sizes in the design space. Table 5.9 reports the total cache sizes for the random offset design points.

In each test A3 to A9, the algorithm was first applied on System A to find the original convergence. Thereafter, a random offset from Table 5.9 was applied to the selected design point, and the exploration process was continued. The algorithm succeeded

Table 5.9: Offset Design Points from Latin Hypercube Sampling

Test Offset Design Point (Total cache size in Bytes)

A4 40064B

A5 14848B

A6 9088B

A7 29696B

A8 311296B

A9 1024B

Number of Iterations to Re-Stabilize 3

s 2 n o

i Average = 1.29 t a r

e 1 t I 0 A3A3 A4 A4 A5 A5 A6 A6 A7 A7 A8 A8 A9A9 Test

Figure 5.14: Number of iterations taken to re-stabilize when an offset is manually introduced to the originally selected design point for System A.

114 5. Iterative Design Space Exploration of Multi-Level Caches in returning to the original design point in all seven tests, underlining its stability. Figure 5.14 presents a summary of the results. An average of 1.29 iterations were taken for the algorithm to re-stabilize. These results also indicate that the final design point is potentially a global optimum and not a local one.

B. Test C1

It is practically infeasible to conduct an exhaustive search due to the sheer size of the typical cache hierechy design space. However, for the sake of comparison and completeness, a minute instance of the problem can be explored using Algorithm 3 as well as an exhaustive method. Such a comparison enables to further investigate the optimality of the final design point.

In test C1, Algorithm 3 was compared with an exhaustive exploration of a small design space instance for a dual-core MPSoC (System C ) executing lame mp3 en- coding and jpeg encoding. In System C, the MPSoC contains a two-level cache hierarchy with two private L1 caches and one shared L2 cache. Table 5.10 reports the design space parameters used in Test C1. The first column states the cache level, and columns two to four respectively show the ranges of block size, set size and associativity. Column five reports the number of configurations in each caches sub-design-space, while the last column gives the total number of design points.

Table 5.10: Design Space Parameters for System C Block Sizes No. of Design Caches Set Sizes Associativities (Bytes) Configs Points

L1 8,16 32, 64 2,4 8 768 L2 8,16 64, 128 2,4,8 12

115 5. Iterative Design Space Exploration of Multi-Level Caches

Test C1 ) s e t y B (

e 16384 z i S

n o

i C1,1C1,1

t 4096 a r u g i f C1,2C1,2 n 1024 o C

e

h C2,1C2,1 c

a 256 C L1 L2 L2 L2 L2 d e t c e l Pre 1-FP 1-BP 2-FP 2-BP e S Iteration Step (a)

) 4.0 s n (

n i m _ k ,

j 3.0 , i T T1,1,kT1,1,k_min e m

i _min T

2.0 s

s T1,2,kT1,2,k_min e c

c _min A 1.0 e

g T2,1,kT2,1,k_min a r

e _min v

A 0.0 L1 L2 L2 L2 L2

Pre 1-FP 1-BP 2-FP 2-BP

Iteration Step (b)

Figure 5.15: Results from Test C1. Changes in: (a) selected configuration sizes for the caches Ci,j; (b) resulting Ti,j,kmin for the caches Ci,j as seen by the algorithm, at the design point reached in each iteration step.

116 5. Iterative Design Space Exploration of Multi-Level Caches

As illustrated in Figure 5.15, after two iterations, Algorithm 3 was able to converge to the design point described in Table 5.11. The first column in Table 5.11 speci- fies the cache identifier, and columns two to four respectively show the block size, set size and associativity of the selected configurations. The last column reports the size of the selected caches. The exhaustive search consumed several days (in the computing environment reported in Section 5.4) to find the absolute optimal design point depicted in Figure 5.16, which is the exact same point described in Table 5.11. The result further strengthens the case for optimality of the proposed iterative algorithm.

2.5

2.0

)

ns

(

TOT T 1.5

1.0 2048 4096 8192 16384 32768 Total Cache Size (Bytes) Design point Optimal design point

Figure 5.16: Design space in System C, showing the optimal design point. Vertical axis denotes TTOT . Horizontal axis denotes the total cache size of the hierarchy.

117 5. Iterative Design Space Exploration of Multi-Level Caches

Table 5.11: Optimal Design Point for System C

Cache Block Size (Bytes) Set Size Associativity Cache Size (Bytes)

C1,1 8 32 2 512

C1,2 8 64 2 1024

C2,1 16 128 8 16384

5.5.4 Alternative Iteration Policies

As an alternative to back-and-forth style iterations of Algorithm 3, Round Robin traversal of cache levels were investigated in Test A10 for System A. The optimization starts at the cache level L1 and proceeds one level at a time down to LN , as before.

After level LN the optimization moves directly back to L1, instead of performing a backward traversal through all cache levels.

Fig. 5.17 presents a summary of the Test A10 results. The final design point reached through this method is the same as that of Test A1, as given in Table 5.4. However, the Round Robin method took four iterations to reach the final point, as opposed to the three iterations in Test A1. A total of nine simulation steps (one for L1 and eight for L2, L3 in the four iterations) were required, as opposed to only seven simulation steps in Test A1 using Algorithm 3.

118 5. Iterative Design Space Exploration of Multi-Level Caches

Test A10 524288 ) s e

t 131072 y B (

e C1,1 z C1,1 i 32768 S C n C1,21,2 o i t 8192 a C1,3C1,3 r u g

i C1,4

f C1,4

n 2048 o C2,1C2,1 C

d

e C2,2C2,2 t 512 c e l C3,1C3,1 e S 128 L1 L2 L3 L1 L2 L3 L1 L2 L3 L1 L2 L3

1 2 3 4 Iteration Step (a)

Figure 5.17: Changes in selected cache configuration sizes for the caches Ci,j at each iteration step, in Test A10 where Round Robin traversal of cache levels is used. Final design point reached is the same as that of Test A1.

Discussion: Generic evolutionary algorithms such as [MPZS12] need the objective function (i.e. cache access time or execution time) to be evaluated individually for each design point in the population. Hence, such algorithms cannot exploit the capabilities of fast hardware cache simulators [NSJP14], which quickly explore sub- design-spaces of individual caches in a level. The algorithm presented in this chapter is designed to traverse the cache hierarchy level-by-level exploring individual cache sub-design-spaces, hence, employs the assistance of hardware cache simulation to perform fast design space exploration.

119 5. Iterative Design Space Exploration of Multi-Level Caches

5.6 Summary

This chapter presented an algorithm which, for the first time, traverses a multi- level MPSoC cache hierarchy iteratively and finds a suitable design point (set of cache configurations) improving the average memory access speed. Assistance of hardware simulation was used for fast calculation of cache hits and for real-time trace extraction, which enabled to perform multiple iterations over a cache hierar- chy. The convergence of the algorithm and stability of the final design point were demonstrated using several comprehensive tests. The tests show that the same de- sign point can be reached regardless of the starting point and also with different iteration policies.

Cache configuration simulation for each cache’s design space can be performed quickly due to the use of hardware simulators. Multiple synthesis on the MPSoC as required by the algorithm may still consume a considerable time. However, it is several orders of magnitude lower compared to spending days or weeks on trace extraction for software-based simulators. Advanced techniques such as partial re- configuration and incremental synthesis of FPGA hardware are available to be used by designers to greatly reduce the time consumed by repeated synthesis.

120 Chapter 6

Dark Silicon and Application Specific Cache Optimizations

6.1 Introduction

As technology nodes continue to shrink, future Silicon chips are predicted to have transistors in such abundance that the whole chip cannot be powered simultaneously as the power consumption per transistor does not continue to scale (known as the Dark Silicon phenomenon). According to Taylor [Tay12], a staggering 93.75% of a chip design has to be kept dark (powered off) by the year 2020 in order to maintain safe operating temperatures. Many researchers have proposed methods to leverage the Dark Silicon on a chip to perform application specific optimizations [CX13, CMP+14, TRGM13].

121 6. Dark Silicon and Application Specific Cache Optimizations

Cache memories are consistently used to bridge the performance gap between proces- sor and memory, and to reduce energy consumption in memory access [GRZVD04, HS12]. Assuming that the largest cache configuration provides the fastest mem- ory access time is a common misconception, as explained in Chapter 1. As shown in the literature [Kha14, LM08, SPP14b], different applications executing on the same processor often secure the best performance with different cache configura- tions. Configurable caches being readily available for designers further motivates the application-specific tuning of cache parameters.

In the domain of embedded systems, a typical processor repeatedly executes a set of applications. Using a fixed cache configuration in such a system, will often yield suboptimal performance. For example, Figure 6.1 show the estimated average cache access times for a processor executing four application programs (adpcm, bzip2,

fft, fdct) with different cache compositions. Data point C1 represents the scenario with all applications using their optimal cache configurations. Data points C2-C9 represent a set of scenarios where a fixed cache configuration is used for all four applications. From the experiments, C2-C9 perform between 42% - 179% slower compared to C1. Therefore, having the ability to use distinct cache configurations for different applications can allow significant memory access performance gains. To exploit this benefit a simple cache architecture is decsribed in this chapter (called switchable cache), which can change between different pre-determined configurations at run-time, by leveraging the Dark Silicon area offered by future chips.

When the available Dark Silicon budget limits the number of different configura- tions in the switchable cache, and a higher number of application programs are to be executed by the processor (say only four cache configurations can be accom- modated while eight applications executing on the system), selecting an optimal

122 6. Dark Silicon and Application Specific Cache Optimizations ) s

n 2.0 (

e m i 1.5 T

s s

e 1.0 c c A

e 0.5 g a r e

v 0.0 A C1C1 C2C2 CC33 C4C4 C5C5 CC66 C7C7 C8C8 CC99 Cache Composition

C1 : Each application using optimal cache configuration Optimal Cache Block Size Set Size Associativity Cache Size Configuration (Bytes) (Bytes) adpcm 4 64 8 2048 bzip2 4 64 16 4096 fft 8 16 4 512 fdct 8 64 2 1024

C2-C9 : Fixed cache configuration Fixed Cache Block Size Set Size Associativity Cache Size Configuration (Bytes) (Bytes)

C2 4 128 1 512 C3 4 64 4 1024 C4 8 64 16 8192 C5 8 64 4 2048 C6 16 64 4 4096 C7 16 32 2 1024 C8 32 128 2 8192 C9 32 16 8 4096

Figure 6.1: Average cache access time for a group of four applications (adpcm, bzip2, fft, fdct) when using variable and fixed cache configurations. set of configurations for the switchable cache becomes a new design problem. As described in Section 6.3.2, the design space for such a problem could easily grow to vast proportions (several trillions of design points). A new design-time algo-

123 6. Dark Silicon and Application Specific Cache Optimizations rithm is presented to rapidly pre-determine the optimal or a near-optimal set of switchable cache configurations, which maximizes the cache access performance for a given group of application programs. This work is the very first in the direction of switchable caches and their associated design space exploration problems.

Highlights:

• This chapter presents the first work to perform cache design optimization in the context of Dark Silicon.

• A cache architecture with minimal overheads is decsribed, which can switch at run-time between different configurations pre-determined at design-time, by leveraging the chip area due to Dark Silicon offered by future chips.

• A fast design space exploration algorithm is presented, to find the optimal or near-optimal set of switchable cache configurations for a given group of ap- plications along with the cache-to-application mapping, when there are more applications than the number of cache configurations that could be accommo- dated in the system. The presented DSE method is applicable to switcahble caches as well as traditional reconfigurable caches.

The rest of this chapter is organized as follows: Section 6.2 presents the design and implementation of the proposed switchable cache architecture; the DSE problem and methodology associated with switchable cache is detailed in Section 6.3; Section 6.4 presents demonstrations of the proposed DSE algorithm through extensive testing. Extended use cases for the swtchable cache architecture and the proposed design space exploration algorithm are discussed in Section 6.5.

124 6. Dark Silicon and Application Specific Cache Optimizations

6.2 Switchable Cache Architecture

Enabling application programs to use distinct cache configurations, as opposed to a fixed configuration, allows for greater performance (as seen in the example shown in Figure 6.1). The proposed architecture provides the facility to encapsulate a set of cache cores, each with a unique configuration, and a mechanism to change between the configurations at run-time.

Traditional reconfigurable caches employ dynamic re-organization of tag and data arrays to share hardware between different configurations. Therefore, complexity is increased with additional hardware to perform reconfiguration at run-time, thereby increasing the timing delays and the amount of Bright Silicon. Reconfiguration may alter the structures of tag and data arrays to achieve performance gains with vari- ous applications. However, re-configurable caches can only represent a limited set of configurations, as increasing one cache dimension necessitates another dimension to be reduced. For example, the 8KB re-configurable cache presented in [ZV03], with a maximum of four associative ways and two logical block sizes, can only represent 18 specific configurations. Contrastingly, a typical cache design space can consist over 300 configurations to select from. Hence, a given re-configurable cache is unlikely to provide significant performance gains for a wide range of applications. With switch- able cache, a simple multiplexing over conventional cache cores is proposed, which: imposes negligible overheads; is not limited to a small subset of configurations; and enable legacy hardware to be used.

Implementation of the switchable cache is described in Figure 6.2. From a system perspective, the overall module appears as a regular cache, with the exception of an additional control port. The Switchable Cache Control port can be used by the

125 6. Dark Silicon and Application Specific Cache Optimizations

CPU side Control Data Switchable Cache Control

Control Two - way Circuitry Register Conf Selection Unit

Conf 1 Conf 2 Conf 3 . . . Conf n

Control Two - way Multiplexer Circuitry

Control Data Memory side

Figure 6.2: Implementation of the switchable cache.

CPU to activate a cache core with desired configuration, or to check the currently used cache core. Configuration selection unit contains a single register and provides simple functionalities to select a cache configuration, through writing to the register, or check the current selection. Signals generated by the selection unit are used to: power up only the desired cache core; and to multiplex the memory accesses among the cache cores. Two multiplexer layers are used, one at the CPU side interface and the other at the memory side interface. Each layer contains circuitry to multiplex signals both ways (to and from the cache). Selection unit generates corresponding control signals for the , based on the value of the selection register.

126 6. Dark Silicon and Application Specific Cache Optimizations

Figure 6.3: Example switchable cache use cases. Each application uses its optimal cache configuration.

Figure 6.3 illustrates a usage scenario for the switchable cache. The system executes four application programs, and contains cache cores with optimal configurations for each application. Only the cache core corresponding to the active application is kept powered. All other cache cores are powered off, or kept dark.

To evaluate the overheads of the two-way multiplexer circuitry and control logic, a switchable and a fixed cache were set with same configuration (block size = 16B, set size = 256, associativity = 1, size = 4KB). In Table 6.1, columns two and three respectively report the path delay and power from Synopsys Design Compiler using 45nm technology. Column four gives the logic utilization in Adaptive Logic Modules (ALMs) from Altera Quartus II. The two-way multiplexer logic was shown to impose an additional 0.13ns delay on cache accesses.

127 6. Dark Silicon and Application Specific Cache Optimizations

Table 6.1: Overheads of Switchable Cache

Path Delay Power Logic Utilization

(ns) (mW) (ALMs)

Fixed Cache 1.19 0.3755 3373

Switchable Cache 1.32 0.3930 3507

Overhead 0.13 (10.9%) 0.0175 (4.7%) 134 (3.9%)

Table 6.2: Candidate Cache Configurations

Block Size No. of Set Size Associativity (Bytes) Configs

4 to 256 1 to 256 1, 2, 4, 8, 16 315

A set of experiments were performed to evaluate the performance of switchable cache as opposed to using a fixed cache. Memory access patterns of four individ- ual applications (adpcm, bzip2, fft, fdct) were analysed to find the optimal cache configuration for each of the applications, out of 315 candidate configurations as described in Table 6.2. First three columns in Table 6.2 present the ranges of block size, set size and associativity in that order. Column four gives the total number of configurations with the given parameters. Simulation techniques from [SPP14b] were used to obtain the cache hit rates for every combination of application and cache configuration, using simultaneous hardware simulation. CACTI 6.5 [MBJ07] with 32nm technology was used for the analysis and calculation of cache access time. The experiments were carried out using an SoC contaning a Nios II/f [Nio] embed- ded processor, on a Stratix V GX FPGA and 512 megabytes of DDR3 SDRAM. All candidate cache configurations were simultaneously simulated in hardware using real-time memory access traces of all four applications, to calculate cache hit rates

128 6. Dark Silicon and Application Specific Cache Optimizations

Figure 6.4: Detailed schematic symbol showing all signals for the switchable cache as implemented in Altera Qsys. and average access times. Figure 6.4 presents a detailed schematic symbol of the switchable cache as implemented in Altera Qsys system integration tool [Altb].

129 6. Dark Silicon and Application Specific Cache Optimizations

Table 6.3: Average Cache Access Times

Cache Composition Average Cache Access Time (ns)

C1 0.66 C2 1.58 C3 0.99 C4 1.18 C5 0.94 C6 1.41 C7 1.85 C8 1.53 C9 1.57

Results from experiments are reported in Figure 6.1 and Table 6.3. First column of Table 6.3 gives the cache composition (which are described in Figure 6.1) and the second column reports the average time per cache access. In the first experiment, cache cores with optimal configuration for each application were included in the switchable cache (composition C1). Corresponding average time per cache access was calculated to be 0.66ns. In subsequent experiments, fixed cache cores with a random candidate configuration (compositions C2 − C9) were used for all applications. The calculated average cache access times ranged from 0.94ns to 1.85ns. The experiments show that having the ability to change between optimal cache configurations for individual applications can improve memory access speed from 42% (compared to

fixed configuration C5) to 179% (compared to fixed configuration C7).

When deciding the set of unique cache configurations to put into the switchable cache, the optimal configuration for every application is selected. This requires a priori knowledge about the applications at design-time. Design space exploration

130 6. Dark Silicon and Application Specific Cache Optimizations methods such as [SPP14b] can be used to identify the optimal cache configuration for each application. Application-to-cache mapping is not hardwired in the proposed switchable cache mechanism. Using the provided control port, application program itself has the capability to activate the required cache configuration at run-time. In the case where a new program (which was not known at design-time) is to be executed on the system, an analysis such as in [SPP14b] can be performed to identify the most suitable cache configuration out the ones already available in the fabricated switchable cache. The program can then activate the selected cache configuration at run-time.

131 6. Dark Silicon and Application Specific Cache Optimizations

6.3 Swicthable Cache Tuning

A new problem arises when there are more applications than the number of switch- able cache configurations affordable with the available Dark Silicon budget. In such a scenario, application programs will invariably have to share cache configurations as depicted in Figure 6.5. Thus, a set of configurations has to be selected for the switchable cache and assigned to the programs in such a way such that the mem- ory access performance of all the applications as a group is optimized. The same optimization problem is applicable for designing re-configurable caches as well.

It should be noted that to achieve an optimal solution for the complete group of applications, certain application programs may need to use sub-optimal cache con- figurations.

Figure 6.5: Example scenario with eight application programs and four switchable cache configurations. More than one application sharing the same cache configu- ration (Applications B and E share cache configuration 2 to achieve best perfor- mance).

132 6. Dark Silicon and Application Specific Cache Optimizations

Subsections 6.3.1 and 6.3.2 respectively defines the problem and provides a mathe- matical analysis.

6.3.1 Problem Formulation

Given an SoC containing a switchable cache, as described in Section 6.2, with:

• a Dark Silicon budget, in number of switchable configurations NS;

• a set of known application programs Ai

(Ai|1 ≤ i ≤ NA, NA > NS) to be executed, with:

– known frequency of occurrence fi for each application Ai (normalized to a scale of 0 to 1); and

• a set of known candidate cache configurations Kj

(Kj|1 ≤ j ≤ NCC , NCC > NS) with:

– known hit latency HLj for each configuration Kj; and

– known update latency ULj for each configuration Kj.

select the set of NS configurations from the candidates for the switchable cache which minimizes the average cache access time for the set of NA application programs.

Note: NA = NS = NCC gives the simplest form of the problem, which has a trivial solution.

133 6. Dark Silicon and Application Specific Cache Optimizations

6.3.2 Analysis

The term Tij is defined here as the cache access time for application program Ai using cache configuration Kj. Equation 6.1 describes the system of equations representing cache access times of configuration Kj for all applications Ai. HRij represents the cache hit rate achieved by application Ai with cache configuration Kj. Tm denotes the access time for the main memory, which is accessed in case of a cache miss.

    T1j HR1j          T   HR   2j   2j           .   .           .   .        = HLj + (1 −  ) × (Tm + ULj) (6.1)      Tij   HRij           .   .           .   .          TNAj HRNAj

The goal is to find a set of cache configurations such that the weighted average cache access time over all application programs is minimized. In other words, Tavg as given in Equation 6.2 needs to be minimized, where the coefficient fi is the normalized frequency of occurence for application Ai. Constant values for fi may be determined at design-time based on the expected system behavior, or assumed to be fi=1 when the occurrence frequencies are indeterminable.

134 6. Dark Silicon and Application Specific Cache Optimizations

N PA fi × Tij T = i=1 (6.2) avg N PA fi i=1

Without the restriction of NS (Dark Silicon budget), NA cache configurations could be selected from the candidates, each of which minimizes Tij for at least one ap- plication. Let NU be defined as the size of the set of unique configurations, which minimize Tij for all applications (note that NU < NA). When the number of config- urations that can be put into the switchable cache is limited (i.e. NS < NU ), a set of NS configurations need to be carefully selected from the set of candidates, to be shared between applications according to a particular assignment.

NCC NA Sizeof design space = C NS × NS (6.3)

NCC Equation 6.3 gives the total size of the design space. There are C NS possible ways to select NS configurations for the switchable cache out of NCC candidates.

NA For any selected NS configurations, there are (NS) many possible ways to map

NA applications (i.e. assign applications to selected cache configurations). For example, if there are eight application programs (NA = 8), 315 candidate cache configurations (NCC = 315) and need to select four configurations for the switchable

12 cache (NS = 4), Equation 6.3 gives us over 26 trillion (26.38 × 10 ) design points.

The size of the design space increases exponentially with NA and NCC .

Actual cache access time is used as the objective function instead of clock cycles, since clock cycle time depends on other components in the system. However if a clock cycle time Tclock can be provided, ceil(Tavg/Tclock) can be used as the objective.

135 6. Dark Silicon and Application Specific Cache Optimizations

6.3.3 Exploration Algorithm

Overview of the proposed design space exploration method is presented in Algo-

rithm 4. The first task is to obtain the cache access timing Ti,j for all NA applica-

tion programs when using all NCC cache configurations. Simulation techniques from

[SPP14b] and CACTI 6.5 [MBJ07] were used to obtain Ti,j as given by lines 1-3 in Algorithm 4. All cache configurations are simultaneously simulated using a hard- ware simulator [SPP14b] connected to a processor which executes the applications.

Algorithm 4: Selecting optimal NS cache configurations for switchable cache

// Simulate: Obtain Ti,j for all NA programs on all NCC cache configurations.

1 for i:= 1 to NA do

2 ∀ j := 1 to NCC : simultaneously evalulate HRi,j

3 ∀ j := 1 to NCC : calculate Ti,j

// Preprocess: Sorting Ti,j for all NA programs.

4 for i:= 1 to NA do

5 Sort list of Ti,j in ascending order.

// Search: determine the optimal set of NS cache configurations, by tree-search. th 6 Create root node with the set of top (0 ) cache configurations in the sorted lists as the selected set.

7 for i:= 1 to NA do 8 root→selected conf[i] := 0

9 root→NU := determine NU (root→selected conf) // Find number of unique cache configurations in the selection.

10 root→Tavg := determine Tavg(root→selected conf) // Find Tavg for selected set of cache configurations.

11 min Tavg := ∞ // Currently found best Tavg 12 optimal := find minimum(root) // Recursive search function.

136 6. Dark Silicon and Application Specific Cache Optimizations

The hardware simualtor provides the hit rates on every cache configuration, which are then combined with cache timing data from [MBJ07] to obtain Ti,j values. Each application’s list of Ti,j is then sorted in the ascending order (lines 4 and 5).

To efficiently explore the vast design space, the exploration problem is formulated as a search tree. Fast search times are achieved through careful design of the tree nodes and tree expansion procedure, which are explained in detail in the following paragraphs. Figure 6.6(a) depicts the structure of a tree node. A tree node rep- resents a selected set of cache configurations assigned to the set of applications in a particular manner. Hence, each tree node is attributed with: NU - number of unique cache configurations in the selected set; and Tavg - average cache access time when using the selected set of configurations. NU in a selected set may range from one configuration (being shared among all applications) to as many configurations as the number of applications.

At the beginning of the search, a root node is created (line 6 of Algorithm 4). In the root node, each application is assigned its optimal cache configuration, obtained from the top of the sorted lists (lines 7 and 8). NU and Tavg attributes of the root are then calculated (lines 9 and 10). Tavg at the root node is the lowest in the search tree. However, it should be noted that the root node may not represent a valid design point because NU at the root is almost always greater than NS, except in rare cases where many applications share the same optimal cache configuration.

Current best min Tavg is initialized to a large number (line 11), and the recursive search function is called on the root node (line 12).

137 6. Dark Silicon and Application Specific Cache Optimizations

Tree Node selected_conf

NU Tavg next (a)

0 0 0 0 0 0 0 0

NU 6 Tavg 0.6 A discarded node

1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 . . . 0 0 0 0 0 0 0 1 NU 7 Tavg 0.9 NU 5 Tavg 1.0 NU 5 Tavg 0.8 NU 6 Tavg 0.9

A valid node 1 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 2 0 0 0 0 0 . . . 0 0 1 0 0 0 0 1 NU 5 Tavg 1.2 NU 4 Tavg 1.3 NU 4 Tavg 1.1 NU 5 Tavg 1.2 (b)

Figure 6.6: (a) Search tree node structure. (b) Example of tree level expansion.

Algorithm 5 describes the recursive tree-search function to find the optimal (or near optimal) design point. First the function expands the next level of tree nodes based on the provided root node (lines 1-9). Figure 6.6(b) illustrates the expansion.

NA many new nodes are created from a root node. Each new node has a single application’s cache configuration changed compared to the root. The change is done by selecting the next configuration in order from the sorted list of Ti,j (lines

3-5), which aims to reduce NU by letting Tavg increase slightly. The target is to find a node with NU equal to (or less than) NS, which denotes a valid design point.

To improve the efficiency of the algorithm, a heuristic is applied which restricts

138 6. Dark Silicon and Application Specific Cache Optimizations

Algorithm 5: Recursive tree-search to find minimal Tavg. Function: find minimum Input: root node. // Expand next tree level, with only one application’s cache config changed in each node.

1 for i:= 1 to NA do 2 Create new node.

3 for k:= 1 to NA do 4 new→selected conf[k] := root→selected conf[k]

5 new→selected conf[i] := root→selected conf[i] + 1 // Find number of unique cache configurations in the selection.

6 new→NU := determine NU (new→selected conf) // Find Tavg for selected set of cache configurations. 7 new→Tavg := determine Tavg(new→selected conf) // Heuristic: NU must be non-increasing in next tree level. 8 if new→NU ≤ root→NU then 9 root→next[i] := new

// Find the node with minimum Tavg in the new level. 10 min node := determine min(root→next)

// Base Case: Minimum node on next level satisfies condition of NS. 11 if min node→NU ≤ NS then 12 if min node→Tavg < min Tavg then 13 min Tavg := min node→Tavg // Current best Tavg

14 return(min node) // Recursive Case: Minimum node on next level does not satisfy the

condition on NS. // Sub-trees under all new nodes (non-discarded) should be searched. 15 else

16 for i:= 1 to NA do 17 if root→next[i]→Tavg < min Tavg then 18 branch min node := find minimum(root→next[i])

19 if branch min node→Tavg < min Tavg then 20 min Tavg := branch min node→Tavg 21 min node := branch min node

22 return(min node)

139 6. Dark Silicon and Application Specific Cache Optimizations

the new nodes to have non-increasing NU (lines 8 and 9). The tree is not further expanded along the new nodes that do not meet this criterion. Once NU and Tavg are calculated for the new nodes (lines 6 and 7), the nodes with increased NU compared to the root are immediately discarded. First node on the second level in Figure 6.6(b) is an example of a discarded node.

The node with minimum Tavg out of the new nodes is then found in line 10, called min node. If min node satisfies the condition of NS (i.e. NU ≤NS), it is a valid design point and no better solution could be found within the newly created nodes or in their sub-trees (lines 11-14). Third node on the third level in Figure 6.6(b) is an example of a valid design point. Since the exploration started with the best cache configuration for each application, the min node described above is a potential candidate to be the optimal design point.

Otherwise if min node does not satisfy NU ≤NS, sub-trees under all next-level nodes have to be searched (lines 15-22) to find a valid node. Search is bounded by the best min Tavg currently found, which only expands nodes with Tavg

NCC Without the heuristic, the worst case time complexity of Algorithm 5 is O((NA) ) and the worst case space complexity is O(NA ×NCC ). However, applying the heuris- tic along with starting the exploration at individual optimal cache configurations drastically reduce the actual average case complexities. Moreover, all assessed nodes except the one with min Tavg is cleared from the memory in every recursive step to further minimize the algorithm’s memory footprint.

140 6. Dark Silicon and Application Specific Cache Optimizations

6.4 Experiments & Results

A set of experiments were carried out to evaluate the algorithm. Seven groups of eight applications were used as reported in Table 6.4, with twelve benchmarks from SPEC2006, MiBench and WCET project (1:jpeg 2:aes 3:adpcm-encode 4:adpcm- decode 5:lms 6:mp3-encode 7:mp3-decode 8:fft 9:fir 10:fdct 11:bzip-compress 12:bzip2- decompress). First row gives the group number and the second row lists the appli- cations. In all experiments, the number of applications in a group is eight (NA=8) with equal frequency of occurrence (fi=1 ∀i) number of candidate cache configura- tions is 315 (NCC =315 from Table 6.2) and the number of switchable configurations in the cache is four (NS=4).

Table 6.4: Application Groups in Experiments

Group Group Group Group Group Group Group A B C D E F G 6 7 3 11 3 10 9 12 10 5 7 9 5 7 12 1 8 7 3 2 5 8 12 9 3 8 9 12 4 1 2 12 2 7 1 4 2 12 3 4 4 11 6 10 5 6 9 10 11 1 6 10 6 2 7 11

Table 6.5: Design Space Exploration Results - Solutions

Tavg Chip Area Eavg Speed Group (ns) (mm2) (pJ) Improvement A 1.165 0.0458 3.8 17.8% B 1.025 0.0335 3.1 16.5% C 0.932 0.0455 3.7 21.5% D 1.104 0.0355 3.1 14.9% E 0.809 0.0455 3.7 26.2% F 1.002 0.0363 3.6 20.3% G 1.058 0.0455 3.7 4.7%

141 6. Dark Silicon and Application Specific Cache Optimizations

Table 6.6: Design Space Exploration Results - Statistics

Valid Design Search Time Exhaustive Search Group Nodes Visited Speed-up Points (s) Time (hrs) A 2,342,817 2,866 1.085 103.9 3.4×105 B 110,945 752 0.078 103.8 4.8×106 C 23,945 64 0.024 103.8 1.6×107 D 6,261,377 7,491 1.964 103.8 1.9×105 E 10,279 20 0.016 103.7 2.3×107 F 449 36 0.009 103.8 4.1×108 G 19,681 100 0.021 103.8 1.8×107

Table 6.5 presents a summary of the design space exploration results for the seven experiments. First column gives the application group. Columns two to five respec- tively present the attributes of the selected design point: average cache access time

- Tavg; chip area for the caches; average energy consumption per access - Eavg; and cache access speed improvement compared to using a fixed cache (with the largest configuration out of all the applications’ individual optimal caches in the group). Group E sustained the maximum speed improvement (26.2%).

Statistics regarding the algorithm’s performance are reported in Table 6.6. First column gives the application group. Columns two and three respectively give the total number of tree nodes visited by the algorithm and the number of valid design points considered as potential optimal solutions. Search time taken by the algorithm is reported in column four, and the time taken by the exhaustive search is given in column five. The last column gives the search speed-up of the proposed method compared to the exhaustive search.

142 6. Dark Silicon and Application Specific Cache Optimizations

The experimentation was done on a machine with 2.2GHz Intel Xeon processor and 256GB of memory. In a majority of the experiments the algorithm consumed only a fraction of a second to find the solution, except for groups A and D which took approximately one and two seconds. The search starts at each application using its optimal cache configuration, and at each algorithm step Tavg is allowed to degrade slightly by sharing cache configurations among applications. The algorithm finishes faster when the solution has a Tavg relatively close to that of the starting point.

In experiments A and D, Tavg of the final solution is comparatively higher, which makes the algorithm take a slightly longer time.

To assess the optimality of the solutions, exhaustive explorations were conducted on the same design spaces. With the given design space parameters, each exhaustive search took over 103 hours, on the same machine described above. The solutions found by our algorithm were compared with the optimal solutions from the exhaus- tive search and verified to be exactly matching.

It should be noted that a re-configurable cache is unlikely to cover all of the selected unique configurations, as its small design space contains only a limited number of inter-dependent configurations, hence having to resort to higher average memory access times.

Figure 6.7 depicts the design space of application group A with respect to chip area budget (using 32nm technology). Horizontal axes mark the chip area in square millimetres and the vertical axes mark Tavg in nanoseconds. Each design point represents a selected set of configurations. Figure 6.7(a) presents the complete design space, and Figure 6.7(b) focuses on the section of interest containing the absolute optimal design point as well as the Pareto-optimal points (which denote

143 6. Dark Silicon and Application Specific Cache Optimizations

0.040 0.045 0.050 1600.020 0.025 0.030 0.035 0.040 0.045 0.0501.200 18% 17.8% 17.7%

140 1.195

)

ns (

1.190 120 17.0%

avg 17%

T

- 100 1.185

up - 1.180 80 16%

1.175Speed 60 15.4% 1.170

40 15% Average Access TimeAccessAverage 20 1.165

0 1.160 14% 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.022 0.030 0.040 0.050 Chip Area (mm2) Chip Area Budget (mm2) (a) (c) 0.020 0.025 0.030 0.035 0.040 0.045 0.050 1.200

1.195

1.190

)

ns

(

avg 1.185

T

-

1.180

1.175 AverageAccessTime

1.170

Pareto-optimal design points 1.165 Absolute optimal design point

1.160

0.020 0.025 0.030 0.035 0.040 0.045 0.050

Chip Area (mm2) (b)

Figure 6.7: Average cache access time against chip area for the switchable cache in Group A. Each design point represents a set of selected cache configurations. (a) Complete design space. (b) Optimal and Pareto-optimal points. (c) Speed-up for a given area budget, over using largest fixed cache out of all applications’ individual optimal configurations.

144 6. Dark Silicon and Application Specific Cache Optimizations

the design points where Tavg cannot be further reduced without an increase in chip area). A suitable Pareto point can be chosen when the chip area is constrained to an upper limit. Design points with large caches (area > 0.0458mm2 for group A) yield higher Tavg, as the array look-up time to determine a cache hit increases with cache size. Figure 6.7(c) shows the speed-ups achieved by using a Pareto-optimal configuration for a given chip area budget, compared to using a fixed cache (with the largest configuration out of all applications’ individual optimal configurations). In application group A, speed-ups of up to 17.8% were achieved.

Similarly, Figure 6.8(a) presents the design space of application group A with re- spect to average energy per cache access, and Figure 6.8(b) focuses on the section of interest containing the absolute optimal design point and Pareto-optimal points. Speed-ups up to 17.8% were achieved for given energy budgets over using a fixed cache with the largest configuration out of all applications’ individual optimal con- figurations, as shown in Figure 6.8(c).

Figure 6.9(a) depicts energy-delay-product of the design space with respect to the chip area, whereas 6.9(b) shows the optimal design point and the Pareto front.

145 6. Dark Silicon and Application Specific Cache Optimizations

0.020 0.025 0.030 0.035 0.040 0.045 0.0501.200 160 18% 17.8% 1.195 17.7%

140

) ns

( 1.190 120 17.0%

avg 17%

T

- 1.185

100 up 1.180- 80 16%

1.175Speed 60 15.4% 1.170 40 15%

Average Access TimeAccessAverage 1.165 20

1.160 0 14% 0.0 0.01 0.02 0.03 0.04 0.05 0.06 2.5 3.0 3.5 4.0

Average Access Energy - Eavg (nJ) Energy Budget (pJ) (a) (c) 0.020 0.025 0.030 0.035 0.040 0.045 0.050 1.200

1.195

1.190

)

ns

(

avg 1.185

T

-

1.180

1.175 Average Access TimeAccessAverage 1.170

Pareto-optimal design points 1.165 Absolute optimal design point

1.160

2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.0

Average Access Energy - Eavg (pJ) (b)

Figure 6.8: Average cache access time against average cache access energy for the switchable cache in Group A. Each design point represents a set of selected cache configurations. (a) Complete design space. (b) Optimal and Pareto-optimal points. (c) Speed-up for a given energy budget, over using largest fixed cache out of all applications’ individual optimal configurations.

146 6. Dark Silicon and Application Specific Cache Optimizations

0.020 0.025 0.030 0.035 0.040 0.045 0.050 0.50 1.200

0.45 1.195

) 0.40 1.190

nJ ns nJ 0.35 ( 1.185 0.30

Product Product 1.180

- 0.25

0.20 1.175

Delay - 0.15 1.170

Energy 0.10 1.165 0.05

0.00 1.160 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Chip Area (mm2) (a)

0.020 0.025 0.030 0.035 0.040 0.045 0.050

10

9

8 )

7

pJ ns pJ (

6

Product Product -

5

Delay -

4 Energy 3

2 Pareto-optimal design points

1 Absolute optimal design point

0

0.000 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.010

Chip Area (mm2) (b)

Figure 6.9: Energy-Delay-Product per cache access against chip area for the switch- able cache in Group A. Each design point represents a set of selected cache config- urations. (a) Complete design space. (b) Optimal and Pareto-optimal points.

147 6. Dark Silicon and Application Specific Cache Optimizations

6.5 Discussion

6.5.1 Optimizing for Energy

The illustrated design space exploration is to minimize Tavg under energy and area costs. Alternatively, cache access energy Eavg of applications can be minimized with the same algorithm, trading off area and performance, as outlined below.

The term Eij can be defined as the cache access energy (per access) for application program Ai using cache configuration Kj. Equation 6.4 describes the system of equations representing cache access energy of configuration Kj for all applications

Ai. HRij represents the cache hit rate achieved by application Ai with cache config- uration Kj. Em denotes the access energy for the main memory, which is accessed in case of a cache miss.

    E1j HR1j          E   HR   2j   2j           .   .           .   .        = HEj + (1 −  ) × (Em + UEj) (6.4)      Eij   HRij           .   .           .   .          ENAj HRNAj

148 6. Dark Silicon and Application Specific Cache Optimizations

N PA fi × Eij E = i=1 (6.5) avg N PA fi i=1

The goal is to find a set of cache configurations such that the weighted average cache access energy over all application programs is minimized. In other words,

Eavg as given in Equation 6.5 needs to be minimized, where the coefficient fi is the normalized frequency of occurence for application Ai. Constant values for fi may be determined at design-time based on the expected system behavior, or assumed to be fi=1 when the occurrence frequencies are indeterminable.

6.5.2 Extended Usage Scenarios for Switchable Cache

Outside the context of Dark Silicon, the concept of switching cache configurations can be extended to many intriguing usage scenarios. This section provides brief insights on several such opportunities.

In multiprocessor systems

In multiprocessor systems where application programs (or tasks) can migrate be- tween processor cores [BABP06, CHC+04], cache switching may be used to prevent loss of cached data and cold re-starts. In such a scenario, more than one configu- ration in the switchable cache may be active at the same time to serve concurrent applications.

149 6. Dark Silicon and Application Specific Cache Optimizations

Figure 6.10 illustrates an example of a switchable cache being used in a multipro- cessor system. The switchable cache needs to employ multiple data ports to support several processors at the same time. The design of switchable cache control may be modified in such a way that the port-to-cache-configuration mapping can be altered through control signals, initiated by individual processors or a centralized controller (see Figure 6.11).

The design space exploration problem, when using switchable cache with a multipro- cessor system, is much the same as described in Section 6.3. However, considering concurrent execution of multiple application can add further complexity to the de- sign problem.

Before Task Migration After Task Migration

Figure 6.10: Potential usage of a multi-port switchable cache in multiprocessor system. Application B migrates from CPU 2 to CPU 4, while still using the same cache configuration 3.

150 6. Dark Silicon and Application Specific Cache Optimizations

Figure 6.11: Overview of a switchable cache with multiple data/address ports.

In phase & data dependent optimizations

Many research works have targeted hardware optimizations to exploit different exe- cution phases present within application programs [ICM06, IKGC11, SIKN07]. Most such works aim to reduce energy consumption by the processor through various opti- mizations. It is also possible to employ cache switching between different execution phases of the same application program, in order to improve performance, as well as reduce energy consumption.

Application Execution Phases

Memory Intensive Memory Intensive

Start Timeline End

• Switch to the relevant cache • Power off the unused caches Figure 6.12: Example of switching caches between different phases in an applica- tion’s execution.

151 6. Dark Silicon and Application Specific Cache Optimizations

Figure 6.12 shows a simple example of an application’s execution timeline with different phases. Optimal cache configurations for each may be identified through trace driven simulations on each phase. Cache switching can be done at identified points between phases, to always obtain optimal cache performance. The immediate issue here is the loss of cached data between phases, if only one configuration is kept active at a time. To mitigate this issue: either efficient data migration is required; or all cache configurations could be kept active in parallel to get the best possible performance at the cost of power and chip area.

Furthermore, similar optimizations could be performed on highly data dependent streaming pipeline processing systems, such as jpeg image compression. Similar to works such as [JSPH11], pre-processing can be used to identify the nature of input data blocks at the begining, and the information could be used to select suitable cache configurations for the processors in the system.

Figure 6.13 shows an example scenario, where data-dependent cache switching may

Data-dependent cache selection information

CPU 3

Input SC Output Data Pre- CPU CPU CPU Data Blocks process 1 2 5 Blocks

Determine optimal SC SC CPU SC cache config. for 4 data block SC

Stage 1 Stage 2 Stage 3 Stage 4

Processor Pipeline Stages

Figure 6.13: Example of using cache switching in a pipelined multiprocessor system.

152 6. Dark Silicon and Application Specific Cache Optimizations be employed. The system contains six processors, each with a switchable cache, in four pipeline stages. Data blocks enter the pipeline from the left end and leave from the right end. The pre-processing stage determines the suitable cache configurations for each of the processors, for the incoming data block, and records those informa- tion. CPUs in the pipeline may then use the recorded information to perform cache switching before processing each data block.

In critical systems, to improve reliability

Another potential extended application is in critical/secure systems, as a mode of redundant caching in order to improve reliability and protection against attacks targeting cached data. More than one cache configuration can be kept active and used by the same processor, where the cached data are duplicated between the two caches. The redundant data may be potentially used to detect system failures or malicious attacks and then for recovery purposes as well.

153 6. Dark Silicon and Application Specific Cache Optimizations

6.6 Summary

In this chapter, a cache architecture was proposed where individual applications can use different cache configurations which optimize all applications’ memory access performance by leveraging the unavoidable Dark Silicon in future chips. Experi- mental data were provided to show that having the ability to switch between cache configurations can provide substantial performance gains in multi-programmed en- vironments.

Further, a rapid design space exploration algorithm was presented to identify the set of optimal cache configurations for a switchable cache, based on the group of ap- plications that is expected to be executed on the system. The proposed formulation of the search tree allowed to achieve significantly fast search times. With rigorous testing, the algorithm was shown to be able to quickly find the solution, and the accuracy of the results were verified through comparison against exhaustive search. The concept of switchable cache can be further extended into multiprocessor sys- tems, phase-and-data-dependent systems, critical and secure systems, which open up many new avenues to be explored.

154 Chapter 7

Answer Set Programming in Cache Design Space Exploration

7.1 Introduction

Design optimization of caches consists of a diverse set of problems, which include: optimizing individual caches [LM08]; optimizing hierarchical caches [ZGR11]; con- figuring multiprocessor caches [NJR+]; and tuning reconfigurable and switchable caches [NJRP15]. Due to the number of configurable cache parameters and the associated value ranges, typical cache design optimization problems concern large design spaces of up to trillions of design points. Exploring such design spaces within reasonable search times require specialized algorithms.

The tuning of switchable caches (or reconfigurable caches) is such a design problem that presents a vast design space to be explored, as discussed in Chapter 6. As using

155 7. Answer Set Programming in Cache Design Space Exploration large cache configurations rarely provide best performance [SJP13], the switchable cache architecture allows several pre-determined configurations to be on board for a given cache, and dynamically switches between them according to the application under execution. Only one configuration is active at a given time while the others are kept powered off, to exploit the additional area available in chips with dark silicon. Programs have the ability to activate the assigned optimal cache configuration using a control signal, allowing each application to gain the best cache access performance.

The design problem calls to select the best set of configurations for a switchable cache, to be used in a system executing a group of application programs. With more applications in the system than the number of switchable cache configurations, an optimal set of cache configurations has to be selected for the complete application group. A problem instance with eight application programs, four switchable con- figurations and a pool of 315 candidate configurations to be selected from forms a design space with 26.38 trillion design points (see Section 6.3.2). Switchable cache tuning can be identified as an NP -Hard optimization problem since the knapsack problem can be understood as a distinct case of it.

A brute-force exhaustive search for the above design space instance takes in excess of 100 hours to find the optimal solution, as per experimental data provided in Section 6.4. The heuristic tree search algorithm presented in Chapter 6 can quickly minimize the cache access times for the group of applications. However, the search does not guarantee the optimal solution to the problem, and search times can vary between seconds to hours. A robust design tool should employ a more consistent exploration methodology where the optimal solution can be guaranteed within a reasonable time for every problem instance.

156 7. Answer Set Programming in Cache Design Space Exploration

Answer Set Programming (ASP) is a declarative programming technique that is pri- marily aimed at solving difficult NP -Hard problems while guaranteeing optimality. This chapter explores the use of ASP to guarantee the optimal solution for the switchable cache tuning problem within reasonable search time, as opposed to fast but sub-optimal problem-specific heuristics and slow exhaustive searches.

Highlights:

• This chapter proposes an ASP-based problem encoding for the NP -Hard switchable cache tuning problem, which guarantees optimal solutions and con- sistent search times.

• The performance and optimality of the ASP-based method is compared against problem-specific heuristic and exhaustive search algorithms.

• The use of ASP search and optimization strategies are investigated as reliable cache design space exploration methods, through extensive experiments.

The rest of the chapter is organized as follows: Section 7.2 breifs prior applications of ASP in related problem domains; Section 7.3 defines the target design problem; an overview of Answer Set Programming and the problem encoding are presented in Section 7.4; and Section 7.5 presents experimental data and results.

7.2 Related Applications of ASP

Answer Set Programming has been used in a number of design space exploration problems in recent years. Coban et al. [CTE08] represented a wire routing problem

157 7. Answer Set Programming in Cache Design Space Exploration

(in circuit placement and routing) using three different declarative techniques: An- swer Set Programming (ASP); Constraint Programming (CP); and Integer Linear Programming (ILP), and then compared the formulations for knowledge represen- tation and computational efficiency. The authors of [CTE08] have shown that the ASP formulation performs up to 96 times faster than the ILP formulation. In their experiments, CP struggled to solve the problems within a reasonable time. A simi- lar finding was made by Ishebabi et al. [Ish09] for solving multiprocessor synthesis problems. They have shown that an ASP formulation is faster by up to three orders of magnitude compared to ILP formulation (few seconds versus up to 8 hours).

Muhlbauer et al. [MGB11] presented an automatic system synthesis in a hard- ware/software co-design problem on an FPGA that focuses on handing streaming data. Cilardo et al. [CSM14] proposed an automated ASP-based method to per- form an optimized mapping in a Simulink to MPSoC translation. The constrains optimized in the DSE are resource utilization and execution time. The authors show that ASP not only converges to the optimal solution, but also does it within reasonable execution time.

Yonga et al. [YMB15] presented an ASP formulation for the synthesis of a het- erogeneous system-on-chip based distributed camera network. The authors have concluded that the use of ASP has helped them to overcome the size explosion and exponential synthesis time problems encountered in other synthesis approaches, such as ILP or CP.

None of the works mentioned above focus on ASP-based cache design space explo- ration. In other domains, ASP-based formulations have shown guaranteed optimality within reasonable search times, as an alternative to approximate approaches such

158 7. Answer Set Programming in Cache Design Space Exploration as application-specific heuristics. Moreover, ASP is found to be more efficient than ILP as it is shown to overcome the size explosion and exponential synthesis time problems. Therefore, for the first time, this chapter proposes and studies an ASP- based method to solve the cache design space exploration problem and investigate the suitability of different ASP search and optimization strategies.

7.3 Problem Formulation

Given a switchable cache system with a constrained NS number of switchable con-

figurations, NCC number of candidate cache configurations (each configuration is represented by Kj with a known hit latency HLj and a known update latency ULj, where Kj|1 ≤ j ≤ NCC , NCC > NS ) and NA number of application programs

(each program is represented by Ai, where Ai|1 ≤ i ≤ NA, NA > NS and executed at a normalized frequency fi, where fi|0 ≤ fi ≤ 1), select the set of NS number of cache configurations such that the average cache access time for the given set of application programs is optimal.

Note: An instance with NCC = 315, NA = 8 and NS = 4 has a design space of

315 8 26.38 trillion points ( C 4 × 4 ).

The term Tavg, given in Equation 7.1, is defined as average cache access time over the group of applications, which is the objective function to be minimized.

N XA Tavg = Tij /NA (7.1) i=1

159 7. Answer Set Programming in Cache Design Space Exploration

    T1j HR1j          T   HR   2j   2j           .   .           .   .        = HLj + (1 −  ) × (Tm + ULj) (7.2)      Tij   HRij           .   .           .   .          TNAj HRNAj

The term Tij represents the cache access time for application program Ai using cache configuration Kj. System of equations 7.2 describes Tij for all applications and all cache configurations. HRij denotes the cache hit rate achieved by application Ai with cache configuration Kj. Tm denotes the access time for the main memory, which is accessed in case of a cache miss.

7.4 Answer Set Programming (ASP)

7.4.1 Overview

Answer Set Programming provides a declarative framework for representing and reasoning about logical problems. It has a well defined formal set of semantics based on the logic programming stable model semantics [GL88].

While ASP was originally developed for modeling problems in the artificial intelli- gence (AI) sub-field of knowledge representation and reasoning, it has gained broad

160 7. Answer Set Programming in Cache Design Space Exploration interest due to the development of high-performance ASP reasoners. Two of the most prominent of these are the Potassco collection of solvers [GKK+11] and the DLV system [LPF+06].

An important feature of ASP is that it allows problems to be represented in a compact and intuitive manner using a syntax very similar to that of a Prolog [CM03] program. An ASP program consists of a set of rules and facts. Rules can contain first-order logic variables, and consequently, can be used to compactly represent relations between categories of objects. Facts are a set of literals describing the state of the problem space.

Given an ASP program, its answer sets consist of minimal sets of facts that are con- sistent with the program. Intuitively, these answer sets represent possible solutions to the problem. Furthermore, an ASP program can contain optimization statements that express cost functions over the facts in the answer sets [SNS02]. Through such statements, preferences over the possible solutions can be specified and an optimal solution can be identified.

Computing the answer sets of a logic program is undertaken in two stages. In the first stage the rules of the logic program are grounded, such that all logic variables are replaced with their ground instances based on the provided facts. The second stage consists of solving, where the answer sets of the grounded logic program are generated and any optimization statements are applied. It is particularly important to keep the distinction between these two stages in mind when modeling a problem. The reasoner is most effective during the solving stage; when the many optimization techniques and search strategies provided by the solver can be employed. However, a poorly chosen problem representation can result in a blow out during grounding,

161 7. Answer Set Programming in Cache Design Space Exploration where the grounder fails to terminate in a timely manner or may generate a com- binatorial explosion in the size of the resulting ground program. For a discussion of the practical aspects of modeling problems with ASP the interested reader is referred to [GKKS12].

7.4.2 Problem Encoding in ASP

The switchable cache tuning problem can be encoded as an ASP logic program, as described below. There are three distinct components in the problem encoding: input data to the problem (facts) ; logic to generate the candidate solutions (answer sets); and logic to declare the optimization criteria. Input data to the switchable cache tuning problem can be defined as a list of facts in the following format:

fact(application, cache config, access time). (7.3)

Facts are provided in the format given by Equation 7.3, containing cache access times (Tij) for every application using every cache configuration. For example, the literal fact(app05,cache202,12301) says application program app05 using cache configuration cache202 achieves an average access time of 1.2301 nanoseconds. A total of NCC × NA literals are required to describe the problem space.

The answer set generation is specified in such a way that the objective function

(Tavg) could be evaluated for each stable model. The rule given in Equation 7.4 gen- erates the candidate solutions (i.e., the answer sets), which states that every solution must consist of between one and four (inclusive) selected cache configurations. The objective function (Tavg in Equation 7.1) needs to be evaluated on each solution.

162 7. Answer Set Programming in Cache Design Space Exploration

1 { selected cache(K): fact( ,C, ) } 4. (7.4)

Next the logic for the optimization criteria is formulated. The rules stated in Equa- tions 7.5-7.7 declare the logical criteria for this evaluation. A predicate is defined that encodes the minimum access time per application for a given cache configura- tion. This minimum time can be defined with respect to the set of all access times for that application, by finding the time that is not dominated by a lower access time.

performance(A, T ): − fact(A, K, T ), selected cache(K). (7.5)

non minimal time(A, T 2) : − performance(A, T 1), performance(A, T 2),T 1 < T 2.

(7.6)

minimal time(A, T ): − performance(A, T ), not non minimal time(A, T ). (7.7)

Equation 7.5 defines a predicate for the performance of an application over all se- lected caches in a solution. Equation 7.6 says that out of any two instances for the performance predicate, the one with the higher access time is non-minimal. The rule in Equation 7.7 creates a double negation to find the instance of the performance predicate with minimal access time.

163 7. Answer Set Programming in Cache Design Space Exploration

NCC When selecting NS cache configurations, CNS ×NS ×NA many propositions will be enumerated at the grounding stage for the rule minimal time(A, T ) given in

Equation 7.7. This amounts to over 12 billion propositions with NCC = 315, NA = 8 and NS = 4. In the experiments done in this chapter, the ASP grounding stage took 5.2 seconds on average to produce the stable models and create the propositional database of size 35.9MB.

#minimize { T,A : minimal time(A, T ) }. (7.8)

Finally, the optimization criteria is given in terms of a cost function that accumu- lates the value of the minimum application access times for each cache configura- tion (Equation 7.8). The optimal solution will be the set of cache configurations that is minimal over all answer sets.

Alternatively, Equation 7.8 may be replaced with Equations 7.9 and 7.10, where min- imal Tavg could be explicitly calculated for each selection of caches, before searching for the optimal solution.

minimal avg time(T avg): −T avg = #avg { T,A : minimal time(A, T ) } (7.9)

#minimize { T : minimal time(A, T ) }. (7.10)

NCC NA However, the averaging statement creates CNS ×NS many propositions at the

164 7. Answer Set Programming in Cache Design Space Exploration grounding stage for the predicate minimal avg time(T avg), which is a staggering

26.38 trillion with NCC = 315, NP = 8 and NS = 4. This essentially leads to a blow out situation, as described in Section 7.4.1, consuming excess time and mem- ory during the grounding stage. This difference between encodings highlights the importance of careful formualtion of the problem representation.

165 7. Answer Set Programming in Cache Design Space Exploration

7.5 Experiments & Results

This section presents experimental data in order to compare the use of an ASP solver against the heuristic tree search algorithm and an exhaustive brute force search. State of the art ASP solver Clasp [GKS12] version 4.5.3 was used in all the experiments presented in this chapter. The execution environment used for the experiments consists of four 8-core Intel Xeon processors (32 physical cores / 64 virtual cores) working at 2.2GHz, and 256GB of memory. Twelve benchmark applications were used from the suites SPEC2006, MiBench, and WCET project (jpeg, aes, adpcm-encode, adpcm-decode, lms, mp3-encode, mp3-decode, fft, fir, fdct, bzip-compress, bzip2-decompress).

In all experiments, the number of applications in a group was set to eight (NA = 8), the number of candidate cache configurations was set to 315 (NCC = 315 from Table 7.1) and the number of switchable configurations in the cache was set to four

(NS = 4). With these parameters, the design space consists of 26.38 trillion design points. Columns one to three in Table 7.1 respectively present the ranges of cache block sizes, set sizes and associativities as powers of two.

Values for the terms HLj and ULj for all candidate configurations were obtained from CACTI6.5 [MBJ07] cache analysis tool. To accurately determine the cache hit rates HRij for all applications on all candidate configurations, rapid parallel

Table 7.1: Candidate Cache Configurations Block Sizes No. of Set Sizes Associativities (Bytes) Configs

2b(2 ≤ b ≤ 8) 2s(0 ≤ s ≤ 8) 2a(0 ≤ a ≤ 4) 315

166 7. Answer Set Programming in Cache Design Space Exploration hardware simulations were performed using an Altera Stratix V GX FPGA device and NIOS II/f embedded processors.

7.5.1 Comparison of ASP & Heuristic Searches

Applications were grouped into sets of eight, as given in Table 7.2. To compare the performance of the Clasp ASP solver with the encoding described in Section 7.4.2

Table 7.2: Application Groups Group Applications mp3-e, mp3-d, adpcm-e, adpcm-d, jpeg, aes, 1 bzip2-c, bzip2-d adpcm-e, fdct, fir, aes, mp3-d, jpeg, bzip2-d, 2 adpcm-d fdct, lms, mp3-d, aes, bzip2-d, adpcm-e, fir, 3 adpcm-d lms, mp3-d, bzip2-d, adpcm-d, bzip2-e, mp3-e, 4 jpeg, fdct fft, mp3-d, adpcm-e, lms, mp3-e, fir, aes, 5 fdct lms, fft, bzip2-d, bzip2-e, jpeg, mp3-e, fir, 6 fdct adpcm-e, fft, fir, mp3-e, aes, mp3-d, bzip2-d, 7 bzip2-e mpe-3, mp3-d, adpcm-e, fft, jpeg, aes, fdct, 8 bzip2-d jpeg, aes, fdct, bzip2-d, mp3-e, adpcm-d, 9 adpcm-e, fft fir, fdct, bzip2-d, adpcm-d, adpcm-e, fft, jpeg, 10 aes

167 7. Answer Set Programming in Cache Design Space Exploration against the heuristic search algorithm, both methods were applied to each appli- cation group. Additionally, exhaustive brute-force searches were performed on all application groups to verify the optimality of the solutions achieved by the other two search methods.

Table 7.3 presents the CPU times consumed and optimality achieved by all the three search methods. The first column shows the application group while the second column displays the time taken by the exhaustive search to find the optimal solution. Columns three and four respectively present the search time taken by the heuristic search and whether the final solution is optimal. Similarly, the search time of the ASP solver and the solution’s optimality is given in columns five and six.

Table 7.3: Search Times and Optimality

Application Exhaustive Heuristic Search ASP Solver

Group Search Time Search Time Optimal Search Time Optimal

1 103.9 hrs 1.085 sec yes 6.19 mins yes

2 103.8 hrs 0.078 sec yes 8.43 mins yes

3 103.8 hrs 0.024 sec yes 2.90 mins yes

4 103.8 hrs 1.964 sec yes 4.38 mins yes

5 103.7 hrs 0.016 sec yes 3.29 mins yes

6 103.8 hrs 0.009 sec yes 1.73 mins yes

7 103.8 hrs 0.021 sec yes 3.30 mins yes

8 103.7 hrs 6.310 hrs yes 11.37 mins yes

9 103.6 hrs 4.730 min no 6.16 mins yes

10 103.7 hrs 1.863 hrs no 5.61 mins yes

168 7. Answer Set Programming in Cache Design Space Exploration

From the results, the heuristic tree search performs much faster for application groups 1 to 7 (in under two seconds). However, for application groups 8 and 10, the search times are several orders slower (over six hours) than that of the ASP solver, while failing to find the optimal solution at all in groups 9 and 10. To achieve fast search times, the heuristic search starts with each application assigned with its optimal cache configuration, and progressively reduce the number of unique config- urations (NU ) in the selection until the criteria NU ≤NS is met. An application’s selected configuration is changed to the next-fastest-in-order, only if that change reduces NU . This condition in the heuristic may cause the search to ignore cer- tain tree branches. On the rare occations where the optimal solution resides deep down a tree branch and the heuristic causes to igonore the same branch, the search may converge to sub-optimal solutions. The failures can also occur possibly due the search traversing deep down certain branches before encountering bounds, in the cases where final cache access times for applications are significantly different to the starting values.

On the contrary, the search times for the ASP solver are consistent in the range of 1.73 minutes to 11.37 minutes, with an average of 5.33 minutes. The solutions found by the ASP solver are optimal in every problem instance.

Figure 7.1 presents a closer look at the search times taken for application groups 8, 9 and 10. Horizontal axis marks the application groups, and the vertical axis represents the search time in logarithmic scale. Altogether, the results reveal that the ASP solver achieves consistent and reasonably fast search times along with better optimality, while the problem-specific heuristic falls short on both aspects in certain problem instances.

169 7. Answer Set Programming in Cache Design Space Exploration

10000 6222 6216 6222

(103.7 hrs) (103.6 hrs) (103.7 hrs) )

s 1000 e

t 378.60

u n

i 111.78

m (

100 e

m 4.73

i T

11.37

h 10 6.16

c 5.61

r

a

e S 1 App Group 8 App Group 9 App Group 10

Exhaustive Search Heuristic Search ASP

Figure 7.1: Comparison of search times for application groups 8, 9 and 10.

The Clasp ASP solver was configured to use 64 threads in the above comparisons. Parallelizing the brute-force exhaustive search in a similar manner is indeed possible, with guaranteed optimal solutions similar to ASP. However, even if a theoretically maximal parallel efficiency was assumed, an exhaustive search would still be signif- icantly slow (approximately 1.6 hours) compared to the ASP solver.

170 7. Answer Set Programming in Cache Design Space Exploration

7.5.2 ASP Search Strategies & Parallelism

The ASP solver Clasp [GKS12] inherently incorporates different search strategies and the ability to use up to 64 parallel threads to perform the solving. The avail- able search strategy options are listed below. The interested reader is referred to [GKKS12] for further discussion on the search and optimization strategies.

• auto - strategy based on problem type

• frumpy - conservative defaults

• jumpy - aggressive defaults

• tweety - defaults geared towards ASP problems

• handy - defaults geared towards large problems

• crafty - defaults geared towards crafted problems

• trendy - defaults geared towards industrial problems

• many - default portfolio

The ASP encoding from Section 7.4.2 was tested under all 8 strategy options of Clasp and thread counts 1, 8, 16, 32 and 64 for the switchable cache tuning problem on application group 8. The resulting search times are presented in Table 7.4 and compared in Figure 7.2. Horizontal axis in Figure 7.2 represents the thread numbers, while the vertical axis represents search times. Average CPU time per thread is reported in order to make the comparison independent of the execution environment.

171 7. Answer Set Programming in Cache Design Space Exploration

Table 7.4: ASP - Search Times (in Minutes) for Multiple Threads & Search Strate- gies

Number of threads Option 1 8 16 32 64

auto 458.1 551.9 672.5 633.3 290.7

frumpy 24h+ 24h+ 24h+ 24h+ 24h+

jumpy 113.1 28.9 20.7 17.3 11.4

tweety 343.1 119.1 82.6 53.1 25.6

handy 135.1 43.7 27.1 15.1 11.4

crafty 249.1 63.8 66.3 32.0 38.5

trendy 163.0 52.1 45.0 25.9 17.1

many 456.9 492.6 683.4 610.7 285.0

As expected, the majority of the search strategies benefit from increasing number of threads. Strategy jumpy with an aggressive search configuration achieved the fastest search times, closely matched by the handy strategy (which is specifically aimed at large problems) especially at higher thread counts. Strategy trendy, which is aimed at solving industrial problems, also sustained reasonable search times. Notably, strategy frumpy, which uses a conservative search configuration, failed to converge after 24 hours in each of the tests.

The behavior of strategies auto and many were contrasting to the others. Search times increased with thread counts up to 16 and reduced thereafter, as can be seen in Figure 7.2. The increases in search time are caused by different threads using different search and optimization strategies. While multiple strategies can lead to greater robustness, in some cases the information being generated by the

172 7. Answer Set Programming in Cache Design Space Exploration

700

600

)

s

e t

u 500

n

i

m (

400

e

m i

T 300

h c

r 200

a e S 100

0 1 8 16 32 64 No. of Threads

auto jumpy tweety handy crafty trendy many

Figure 7.2: Comparison of search times for application group 8, using different ASP search strategies and multiple threads. different optimization strategies can also result in negative interference, for example by flooding some threads with unhelpful information. This could potentially cause a thrashing scenario where extra threads become an overhead. However with higher thread counts such as 32 and 64, enough threads in the system may share the same search and optimization strategies. Therefore, the benefits drawn from the information generated by other threads could outweigh the overheads.

173 7. Answer Set Programming in Cache Design Space Exploration

To gain further insight, the times taken by different ASP search strategies were analyzed to locate the optimal solution, as opposed to verifying the solution’s op- timality. An allocation of 16 execution threads were assigned for each of the tests. Figure 7.3(a) reports the actual search times spent by each option to arrive at the optimal solution while Figure 7.3(b) presents the normalized times in proportion to total search times.

14.0

) 12.2

s e

t 12.0

u n

i 10.0 9.0

m (

8.0

e m

i 6.0

T

h 4.0

c r

a 1.2 e 2.0 1.0 1.1 0.8 0.4 S 0.0 auto jumpy tweety handy crafty trendy many (a) 100%

e 80%

m

i

T

e 60%

g

a t

n 40%

e c r 18.40% 20.00% e 20%

P 3.86% 4.06% 0.15% 1.45% 0.06% 0% auto jumpy tweety handy crafty trendy many

Find Optimal Verify Optimality (b)

Figure 7.3: Search times spent to: (a) find the optimal solution; (b) Verify the optimality of the solution.

174 7. Answer Set Programming in Cache Design Space Exploration

The breakdown of search times reveals that most search strategies arrive at the optimal solution quickly, approximately in one minute. The exceptions were crafty and trendy options, which took the longest, 12.2 and 9.0 minutes respectively. A significantly large fraction of search time was spent on guaranteeing the optimality of the answer (between 80.00% and 99.94%), by all search strategies.

Interestingly, option many was the fastest (0.4 minutes) to find the solution, even though it was the slowest to complete the search (683.4 minutes). The fraction of time spent by the option many to arrive at the solution was a mere 0.06% of the total, which then spent 683 minutes to complete the search and verify optimality. Similarly, option auto took only one minute to find the solution, which is only 0.15% of its total search time (672.5 minutes).

Further investigations were performed on the search strategies and the answers (sta- ble models) they select in the optimization process. It was noted that the starting models for strategies many and auto were generally in the vicinity of the opti- mal solution (error < 50%), in most cases. However, the starting models for the other strategies contained errors several orders of magnitude larger compared to the optimal, which is likely the cause for taking relatively higher times to achieve convergence.

175 7. Answer Set Programming in Cache Design Space Exploration

7.6 Summary

This chapter presented an ASP-based problem encoding to perform design space exploration for the switchable cache tuning problem , using the State of the art ASP solver Clasp [GKS12]. The search performance of the presented DSE method was comapred against the problem-specific heuristic tree search, proposed in Chapter 6. Through extensive experiments, it was demonstrated that the proposed method is indeed reliable in the context of cache design optimization. The experimental results show that consistent and reasonable search times can be achieved using ASP, in addition to guaranteed optimal solutions. Further, the thread level parallelism and different search and optimization strategies of ASP were evaluated, and certain optimization strategies were demonstrated to provide far better search performance than others in cache design space exploration.

176 Chapter 8

Conclusion

Cache memories play a vital role in improving memory access performance for mod- ern SoCs and MPSoCs. The behaviour of cache performance and energy consump- tion depend not only on the cache’s architecture and configuration, but also on the sequence of memory accesses observed by the cache. The combination of processor micro-architecture and the application programs executed on the system determine the memory access sequence. Consequently cache performance becomes highly ap- plication dependent. Applications executed on embedded processing systems are typically known a priori at design-time, thus presenting opportunity to tune a sys- tem for better cache performance. Such application specific cache optimizations can allow processor systems to achieve better memory access performance and mitigate memory related bottleneck issues.

Design-time optimizations of cache memories involve exploring vast design spaces, especially for multiprocessor systems. With the parameters block size, set size and associativity, the design space of a single cache can contain hundreds of configura-

177 8. Conclusion tions. Accurately exploring a cache design space requires the memory access trace received by the cache to be extracted and used in simulations to count the hit rates sustained by different configurations. Such cache simulations are highly time intensive, largely due to the costs of memory access trace extractions.

A design space for a generic multiprocessor cache hierarchy is the cross product of the sub-design-spaces of all individual caches. With typical cache parameter ranges, such a design space can contain several trillions of design points. Moreover, exploring such a design space requires multiple memory access traces (for different caches) to be obtained multiple times, due to the dependencies between connected caches. Thus, proper explorations of generic multiprocessor cache hierarchies have seldom been attempted. Prior research works as well as designers in practise use approximations, such as using sampled fractions of memory access traces, or design space pruning to produce sub-optimal results within feasible time frames.

This dissertation presented the first hardware-based framework to rapidly explore generic multiprocessor multi-level cache hierarchy design spaces using parallel mul- tiple cache configuration simulation. The framework couples an FPGA device to the design process, which houses specialized hardware modules to count cache hits for multiple cache sub-design-spaces simultaneously. The proposed framework was able to achieve up to 456 times fast simulation times compared to fastest known software-based multiprocessor multi-level cache simulation tool. The seamless inte- gration of the simulator modules to the MPSoC under investigation eliminates the need to pre-extract memory access traces, which was a major limitation in prior software-based methods, and also allows to effectively capture access contention on shared caches. Real-time trace extraction, accelerated simulation in hardware com- bined with the flexibility to connect hardware simulator modules at different places

178 8. Conclusion in the memory hierarchy enables generic MPSoC cache hierarchies to be explored with ease, whereas prior methods were predominantly limited to two cache levels.

Using the benefits of the hardware-based simulation framework, a novel design space exploration algorithm was presented to explore an unprecedented portion of the massive multiprocessor multi-level cache hierarchy design space. The new algorithm is of evolutionary nature, and traverses a cache hierarchy in several iterations of carefully crafted steps until convergence is achieved. The experimental results show that the iterative exploration algorithm was able to improve an MPSoC’s average cache access time by up to 18.9%, while simultaneously reducing total cache size by up to 74.15% compared to the state-of-the-art methods. Convergence and stability of the proposed method were thoroughly evaluated through extensive testing. The optimality of the results were empirically demonstrated, which was lacking in prior works, by comparing against an exhaustive search over an design space instance.

Switchable cache architecture was introduced for multi-programmed environments where several applications use the same processor and cache memory. The proposed architecture exploits Dark Silicon available in future chips in order to house several configurations within the same cache, where only one configuration is activated at a time while the remaining ones are kept in Dark Silicon. The processor is given the ability to switch between available cache configurations at run-time, in order to achieve better application dependent cache performance. With minute overheads, the switchable cache concept could provide improved cache performance across all applications in the system as opposed to using a fixed cache. Compared to run-time re-configurable caches which impose high overheads and cover limited selection of cache configurations, the proposed method can provide memory access performance gains to a large number of application programs.

179 8. Conclusion

A design space exploration problem presents itself when many applications share a switchable cache with a limited number of concurrent configurations. To this end, a new design-time algorithm was presented to rapidly pre-determine the optimal or a near-optimal set of cache configurations for a switchable (or re-configurable) cache, for a given group of applications. The presented algorithm is the very first in the direction of design space exploration associated with switchable caches, and per- forms a heuristic tree search starting with each application using individual optimal cache configuration. Using the data provided by the hardware cache simulators, the proposed heuristic algorithm could rapidly find the solution, in under two seconds in most experiments. Alternative design space exploration methods are examined by employing the declarative logic programming language Answer Set Programming. ASP provides the benefit of not having to declare the method to find the solution, but only requires the problem to be described in an efficient manner. The presented work is the first use of ASP to solve large cache design optimization problems, where optimal solutions are guaranteed reliably within reasonable search times.

In conclusion, this dissertation presented a selection of novel application specific cache design optimization techniques along with a hardware acceleration framework and a cache architecture to improve memory access performance in multiprocessor multi-level cache hierarchies and multi-programmed systems. In addition to signifi- cant performance improvements, the presented contributions enable accurate cache design space explorations to be practical and feasible without compromising the time-to-market of modern embedded computing devices. By enabling fast and thor- ough exploration of vast cache design spaces, designers and researchers are granted the ability to gain invaluable insight into application dependent cache behaviour, furthering the domain of cache memory optimization.

180 8. Conclusion

8.1 Future Work

With the proposed technology and concepts in cache design optimization, many new avenues open up which present intriguing prospects for further improvements. Such extensions include accommodating advanced concepts such as block pre-fetching, cache partitioning, etc. in the simulation which can affect the hit rate and ac- cess/update latencies, in addition to the possibility of the proposed seach algorithms being used in other problem domains. One of the most prominent avenues to explore is the inclusion of the effects of cache coherency management in simulation.

8.1.1 Simulating Cache Coherency

Coherency management comes in to the picture when multiple processors use in- dividual private caches, but share access to common data blocks. In such envi- ronments, scenarios may rise where two or more processors have copies of a given data block in each of their private caches simultaneously. One processor updating or writing to such a data block essentially makes the copies present in other caches obsolete or stale.

End systems implement various mechanisms to maintain coherence among cached shared data, such as directories, snooping, snarfing, etc., all of which require ad- ditional logic circuitry, communication buses and introduce delays on the critical path. In general, coherency management protocols can be categorized into two classes: write invalidate; and write update. Write invalidate techniques involve re- maining copies being marked as invalid once one copy is updated, by listening to all memory access addresses. Write update protocols use both address and data infor-

181 8. Conclusion mation to update all the remaining cached copies as soon as one copy gets updated. In comparison, write update protocols promise better utilization of cache space al- beit with high communication and logic overhead, while write invalidate protocols are far simpler to implement yet suffer from higher cache miss rates. The latter is usually preferred in embedded devices where simplicity is highly desirable.

Including the effects of a given cache coherence mechanism in the simulation of multi- processor cache design space is a challenging task, especially prior to the introduction of hardware assisted simulation. With the enabling of real-time parallel simulation of multiprocessor caches in hardware, an opportunity opens up to use the real-time memory access sequences from parallel processors to simulate coherency manage- ment. Such functionality could potentially be achieved through either centralized or distributed evaluation of the access addresses received by parallel simulator mod- ules. A centralized control body could monitor every memory access generated by every processor in the MPSoC and produce control signals which could direct indi- vidual cache simulator modules to either update or invalidate corresponding address tags in all simulated configurations. Distributed schemes could see all collaborating simulator modules sharing access addresses achieve similar functionality. If a unique and known portion of the address space is used for shared data in the multiproces- sor system, monitoring addresses and generating signals could be made considerably less complex a challenge.

182 8. Conclusion

8.1.2 Future of Run-Time Cache Switching

The switchable cache architecture proposed in Chapter 6 exploits Dark Silicon in order to perform application dependent cache optimizations. Outside the context of Dark Silicon, the concept of switching cache configurations may potentially be extended to many other use cases, as discussed in Section 6.5.2.

For example, multiple configurations in a switchable cache could be kept active simultaneously in order to cater for multiprocessor systems where application pro- grams can migrate between processor cores. Migrating applications could use the switching functionality to use the same cache configuration which was used prior to the migration, which enables to retain optimal cache access times as well as avoid losing cached data and avoid cold re-starts with a new cache memory.

Some application programmes can exhibit distinct memory access behaviours during different phases of execution within the application itself. In addition to switching caches between different applications, it is also possible to employ cache switching between different execution phases of the same application in order to improve mem- ory access performance. Such functionality would require memory access traces to be analysed in design time to identify the distinct execution phases. Similar op- timizations may be applied on data dependent streaming pipeline multiprocessor systems, such as jpeg image compression and h.264 video encoding. In such sys- tems, pre-processing of incoming data may be used to predict the memory access behaviour when input data blocks reach processing stages, and the information could potentially be used to select suitable cache configurations at separate processors in the pipelined system.

183 8. Conclusion

Cache switching could be used as a mode of redundancy, in order to improve relia- bility and protection against attacks targeting cached data. For example, two cache configurations can be activated and used by the same processor, where the cached data are duplicated between the two caches. The redundant data may potentially be used to detect system failures or malicious attacks and thereafter for recovery purposes as well.

Operating systems could use cache switching at context switches in multi-threaded environments. Such applications would however require efficient data relocation into the cache when a thread gets re-activated, in order to retain previously cached data. The design trade-off between using cache switching to obtain optimal application dependent cache access times and under utilization of the cache by losing previously cached data is an interesting research question in itself.

184 Bibliography

[Aga87] A. Agarwal. Analysis of Cache Performance for Operating Systems and Multiprogramming. PhD thesis, Stanford University, Stanford, CA, 1987.

[Alta] Altera DE5 Development and Education Board.

[Altb] Altera Qsys System Integration Tool.

[ARM] ARM Cortex Processors.

[BABP06] S. Bertozzi, A. Acquaviva, D. Bertozzi, and A. Poggiali. Support- ing Task Migration in Multi-Processor Systems-on-Chip: A Feasibility Study. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE’06), pages 15–20. European Design and Au- tomation Association (EDAA), 2006.

[BCB74] J. Bell, D. Casasent, and C.G. Bell. An Investigation of Alternative Cache Organizations. IEEE Transactions on Computers, 23(4):346– 351, 1974.

[BJS+14] H. Bokhari, H. Javaid, M. Shafique, J. Henkel, and S. Parameswaran. darkNoC: Designing Energy-Efficient Network-on-Chip with Multi- Vt Cells for Dark Silicon. In Proceedings of the 51st Annual IEEE/EDAC/ACM Design Automation Conference (DAC’14). IEEE, 2014.

[BM96] P. L. Bird and T. N. Mudge. An Instruction Stream Compression Technique. Technical report, The University of Michigan, USA, 1996.

185 Conclusion

[CHC+04] J. Chen, H. Hsu, K. Chuang, C. Yang, Pang A., and Kuo T. Multi- processor Energy-Efficient Scheduling with Task Migration Considera- tions. In Proceedings of the 16th Euromicro Conference on Real-Time Systems, 2004 (ECRTS’04.), pages 101–108, June 2004.

[CM03] W. Clocksin and C. S. Mellish. Programming in PROLOG. Springer Science & Business Media, 2003.

[CMP+14] E. G. Cota, P. Mantovani, M. Petracca, M. R. Casu, and L. P. Car- loni. Accelerator Memory Reuse in the Dark Silicon Era. Computer Architecture Letters, 13(1):2012–2015, 2014.

[CPN+09] E. S. Chung, M. K. Papamichael, E. Nurvitadhi, J. C. Hoe, and K. Mai. PROTOFLEX: Towards Scalable, Full-System Multiprocessor Simula- tions Using FPGAs. ACM Transactions on Reconfigurable Technology and Systems, 2(2), 2009.

[CR09] S. Chattopadhyay and A. Roychoudhury. Unified Cache Modeling for WCET Analysis and Layout Optimizations. In Proceedings of the 30th IEEE Real-Time Systems Symposium (RTSS’09), pages 47–56, Dec 2009.

[CSM14] A. Cilardo, D. Socci, and N. Mazzocca. ASP-based optimized mapping in a simulink-to-MPSoC design flow. Journal of Systems Architecture, 60:108–118, 2014.

[CTE08] E. Coban, F. T¨ure,and E. Erdem. Comparing ASP, CP, ILP on two Challenging Applications: Wire Routing and Haplotype Inference. In Proceedings of the 2nd International Workshop on Logic and Search (LaSh’08), pages 166–180, 2008.

[CX13] J. Cong and B. Xiao. Optimization of Interconnects Between Accel- erators and Shared Memories in Dark Silicon. In Proceedings of the IEEE/ACM International Conference on Computer Aided Design (IC- CAD’13). IEEE, 2013.

[FP91] M. Farrens and A. Park. Dynamic Base Register Caching: A Technique for Reducing Address Bus Width. In Proceedings of the 18th Annual

186 Conclusion

International Symposium on Computer Architecture (ISCA’91), pages 128–137. ACM, 1991.

[FW98] C. Ferdinand and R. Wilhelm. On Predicting Data Cache Behav- ior for Real-Time Systems. In Proceedings of the ACM SIGPLAN Workshop on Languages, Compilers, and Tools for Embedded Systems (LCTES’98), pages 16–30. Springer-Verlag, 1998.

[GG04] A. Ghosh and T. Givargis. Cache Optimization for Embedded Pro- cessor Cores: An Analytical Approach. ACM Transactions on Design Automation of Electronic Systems, 9(4):419–440, October 2004.

[GKK+11] M. Gebser, B. Kaufmann, R. Kaminski, M. Ostrowski, T. Schaub, and M. Thomas Schneider. Potassco: The Potsdam Answer Set Solving Collection. AI Communications, 24(2):107–124, 2011.

[GKKS12] M. Gebser, R. Kaminski, B. Kaufmann, and T. Schaub. Answer Set Solving in Practice. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers, 2012.

[GKS12] M. Gebser, B. Kaufmann, and T. Schaub. Multi-threaded Asp Solving with Clasp. Theory and Practice of Logic Programming, 12(4-5):525– 545, September 2012.

[GL88] M. Gelfond and V. Lifschitz. The Stable Model Semantics for Logic Programming. In R. A. Kowalski and K. A. Bowen, editors, Proceedings of the 5th International Conference and Symposium on Logic Program- ming, pages 1070–1080. MIT Press, 1988.

[GRLC08] A. Gordon-Ross, J. Lau, and B. Calder. Phase-based Cache Reconfigu- ration for a Highly-configurable Two-level Cache Hierarchy. In Proceed- ings of the 18th ACM Great Lakes Symposium on VLSI (GLSVLSI’08), pages 379–382. ACM Press, 2008.

[GRZVD04] A. Gordon-Ross, C. Zhang, F. Vahid, and N. Dutt. Chapter 6 Tuning Caches to Applications for Low-Energy Embedded Systems. In Ultra Low-Power Electronics and Design. Spinger, 2004.

[Hil] M. D. Hill. Dinero IV Trace-Driven Uniprocessor Cache Simulator.

187 Conclusion

[HJP09] M. S. Haque, A. Janapsatya, and S. Parameswaran. SuSeSim : A Fast Simulation Strategy to Find Optimal L1 Cache Configuration for Embedded Systems. In Proceedings of the 7th IEEE/ACM Interna- tional Conference on Hardware/Software Co-Design and System Syn- thesis (CODES+ISSS’09), pages 295–304, 2009.

[HKH+13] M. S. Haque, A. Kumar, Y. Ha, Q. Wu, and S. Luo. TRISHUL: A Single-Pass Optimal Two-level Inclusive Data Cache Hierarchy Selec- tion Process for Real-time MPSoCs. In Proceedings of the Asia and South Pacific Design Automation Conference (ASP-DAC’13), pages 320–325, 2013.

[HNL06] J. Hong, E. Nurvitadhi, and S. L. Lu. Design, Implementation, and Verification of Active Cache Emulator (ACE). In Proceedings of the 14th ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA’06), pages 63–72. ACM, 2006.

[HP11] J. L. Hennessy and D. A. Patterson. Computer Architecture: A Quan- titative Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 5th edition, 2011.

[HPJP10] M. S. Haque, J. Peddersen, A. Janapsatya, and S. Parameswaran. DEW : A Fast Level 1 Cache Simulation Approach for Embedded Pro- cessors with FIFO Replacement Policy. In Proceedings of the Design Automation & Test in Europe Conference (DATE’10), pages 496–501, 2010.

[HPJP12] M. S. Haque, J. Peddersen, A. Janapsatya, and S. Parameswaran. SCUD : A Fast Single-Pass L1 Cache Simulation Approach for Embed- ded Processors with Round-Robin Replacement Policy. In Proceedings of the Design Automation Conference (DAC’10), pages 356–361, 2012.

[HPP11] M. S. Haque, J. Peddersen, and S. Parameswaran. CIPARSim: Cache Intersection Property Assisted Rapid Single-Pass FIFO Cache Simula- tion Technique. In Proceedings of the IEEE/ACM International Con- ference on Computer-Aided Design (ICCAD’11), pages 126–133. IEEE, November 2011.

188 Conclusion

[HRA+12] M. S. Haque, R. Ragel, A. Ambrose, S. Radhakrishnan, and S. Parameswaran. DIMSim : A Rapid Two-Level Cache Simula- tion Approach for Deadline-Based MPSoCs. In Proceedings of the 8th IEEE/ACM/IFIP International Conference on Hardware/Software Co-Design and System Synthesis (CODES+ISSS’12), pages 151–160, 2012.

[HS89] M.D. Hill and A.J. Smith. Evaluating Associativity in CPU Caches. IEEE Transactions on Computers, 38(12):1612–1630, 1989.

[HS90] P. Heidelberger and H. S. Stone. Parallel Trace-driven Cache Simula- tion by Time Partitioning. In Proceedings of the 22nd Conference on Winter Simulation (WSC’90), pages 734–737. IEEE Press, 1990.

[HS12] M. D. Hill and M. M. Swift. Reducing Memory Reference Energy with Opportunistic Virtual Caching Arkaprava Basu. In Proceedings of the 39th Annual International Symposium on Computer Architechture (ISCA’12). IEEE, 2012.

[HXXY11] W. Han, L. Xiang, G. Xiaopeng, and L. Yi. GPU Accelerating for Rapid Multi-Core Cache Simulation. In Proceedings of the IEEE Inter- national Symposium on Parallel and Distributed Processing Workshops and Phd Forum, pages 1387–1396. IEEE, May 2011.

[ICM06] C. Isci, G. Contreras, and M. Martonosi. Live, Runtime Phase Mon- itoring and Prediction on Real Systems with Application to Dynamic . In Proceedings of the 39th Annual IEEE/ACM In- ternational Symposium on (MICRO’06), pages 359– 370. IEEE Computer Society, 2006.

[IDZ80] R. L. Iman, J. M. Davenport, and D. K. Zeigler. Latin hypercube sam- pling (program users guide). Technical Report SAND79-1473, Sandia Laboratories, Albuquerque, Jan 1980.

[IKGC11] N. Ioannou, M. Kauschke, M. Gries, and M. Cintra. Phase-Based Application-Driven Hierarchical Power Management on the Single-chip Cloud Computer. In Proceedings of the International Conference on

189 Conclusion

Parallel Architectures and Compilation Techniques (PACT’11), pages 131–142, Oct 2011.

[Ish09] H. Ishebabi. Answer Set versus Integer Linear Programming for Au- tomatic Synthesis of Multiprocessor Systems from Real-Time Parallel Programs. International Journal of Reconfigurable Computing, 2009.

[Iye03] R. Iyer. On Modeling and Analyzing Cache Hierarchies using CASPER. In Proceedings of the 11th IEEE/ACM International Symposium on Modeling Analysis and SImulation of Computer Telecommunications Systems (MASCOTS’03), number c. IEEE, 2003.

[JIS06] A. Janapsatya, A. Ignjatovic, and Parameswaran S. Finding Optimal L1 Cache Configuration for Embedded Systems. In Proceedings of the Asia and South Pacific Design Automation Conference (ASP-DAC’06), pages 1–6, 2006.

[JSPH11] H. Javaid, M. Shafique, S. Parameswaran, and J. Henkel. Low-Power Adaptive Pipelined MPSoCs for Multimedia: An H.264 Video Encoder Case Study. In Proceedings of the 48th Design Automation Conference (DAC’11), pages 1032–1037. ACM, 2011.

[KCDM98] C. Kulkarni, F. Catthoor, and H. De Man. Hardware Cache Optimiza- tion for Parallel Multimedia Applications. In David Pritchard and Jeff Reeve, editors, Euro-Par’98 Parallel Processing, volume 1470 of Lecture Notes in Computer Science, pages 923–932. Springer Berlin Heidelberg, 1998.

[KCHP01] E. Keogh, S. Chu, D. Hart, and M. Pazzani. An Online Algorithm for Segmenting Time Series. In Proceedings IEEE International Conference on Data Mining, (ICDM’01), 2001.

[KGB96] A. Kagi, J.R. Goodman, and D. Burger. Memory bandwidth limi- tations of future microprocessors. In Proceedings of the 23rd Annual International Symposium on Computer Architecture (ISCA’96), pages 78–78, May 1996.

190 Conclusion

[Kha14] R. Khatwal. Application Specific Cache Simulation Analysis for Ap- plication Specific Instruction set Processor. International Journal of Computer Applications, 90(13):31–41, 2014.

[LDK99] S. Liao, S. Devadas, and K. Keutzer. A Text-Compression-Based Method for Code Size Minimization in Embedded Systems. ACM Transactions on Desisgn Automation of Electronic Systems, 4(1):12– 38, January 1999.

[LL03] S. Lu and K. Lai. Implementation of HW$im A Real-Time Config- urable Cache Simulator. In Field Programmable Logic and Application, pages 638–647. Springer Berlin Heidelberg, 2003.

[LM08] Y. Liang and T. Mitra. Static Analysis for Fast and Accurate Design Space Exploration of Caches. In Proceedings of the 6th IEEE/ACM International conference on Hardware/software Co-Design and System Synthesis (CODES+ISSS’08), page 103. ACM Press, 2008.

[LMW96] Y. S. Li, S. Malik, and A. Wolfe. Cache Modeling for Real-Time Soft- ware: Beyond Direct Mapped Instruction Caches. In Proceedings of the 17th IEEE Real-Time Systems Symposium, pages 254–263, Dec 1996.

[LMW99] Y. S. Li, S. Malik, and A. Wolfe. Performance Estimation of Embedded Software with Instruction Cache Modeling. ACM Transactions Design Automation of Electronic Systems, 4(3):257–279, July 1999.

[LPB06] M. Loghi, M. Poncino, and L. Benini. Cache Coherence Tradeoffs in Shared-Memory MPSoCs. ACM Transactions on Embedded Computing Systems (TECS), 2006.

[LPF+06] N. Leone, G. Pfeifer, W. Faber, T. Eiter, G. Gottlob, S. Perri, and F. Scarcello. The DLV System for Knowledge Representation and Reasoning. ACM Transactions on Computational Logic, 7(3):499–562, 2006.

[Mar08] M. R. Marty. Cache Coherence Techniques for Multicore Processors. PhD thesis, University of Wisconsin - Madison, 2008.

191 Conclusion

[MBJ07] N. Muralimanohar, R. Balasubramonian, and N. Jouppi. Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0. Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’07), pages 3–14, December 2007.

[MGB11] F. M¨uhlbauer,M. Großhans, and C. Bobda. Rapid Prototyping of OpenCV Image Processing Applications using ASP. In Proceedings of 22nd IEEE International Symposium on Rapid System Prototyping (RSP’11), pages 16 – 22. IEEE, 2011.

[MGST70] R.L. Mattson, J. Gecsei, D.R. Slutz, and I.L. Traiger. Evaluation tech- niques for storage hierarchies. IBM Systems Journal, 9(2):78–117, 1970.

[MPZS12] G. Mariani, G. Palermo, V. Zaccaria, and C. Silvano. OSCAR : An Optimization Methodology Exploiting Spatial Correlation in Multicore Design Spaces. IEEE Transactions on Computer-Aided Design of In- tegrated Circuits and Systems (TCAD), 31(5):740–753, 2012.

[MV99] N. R. Mahapatra and B. Venkatrao. The Processor-Memory Bottle- neck: Problems and Solutions. Crossroads, 5(3es):2, 1999.

[MWL95] S. A. McKee, W. A. Wulf, and T. C. Landon. Bounds on Memory Band- width in Streamed Computations. Technical report, Lecture Notes in Computer Science 966: Europar’95 Parallel Processing, 1995.

[Nio] Nios II/f Core: Fast for Performance-Critical Applications.

[NJR+] I. Nawinne, H. Javaid, R. Ragel, S. Radhakrishnan, and S. Parameswaran. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), (12):1991–2003.

[NJRP15] I. Nawinne, H. Javaid, R. Ragel, and S. Parameswaran. Switchable Cache: Utilizing Dark Silicon for Application Specific Cache Optimiza- tions. IET Transactions on Computers and Digital Techniques, 2015.

[NSJP14] I. Nawinne, J. Schneider, H. Javaid, and S. Parameswaran. Hardware- Based Fast Exploration of Cache Hierarchies in Application Specific

192 Conclusion

MPSoCs. In Proceedings of the Design Automation and Test in Europe Conference (DATE’14). IEEE, March 2014.

[NVI] nVIDIA CUDA Parallel Computing Platform.

[PAC+97] D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas, and K. Yelick. A Case for Intelligent RAM. IEEE Micro, 17(2):34–44, March 1997.

[PP84] M. S. Papamarcos and J. H. Patel. A Low-Overhead Coherence Solu- tion for Multiprocessors with Private Cache Memories. In Proceedings of the 11th Annual International Symposium on Computer Architecture (ISCA’84), pages 348–354. ACM, 1984.

[Rao78] G. S. Rao. Performance Analysis of Cache Memories. Journal of the ACM (JACM), 25(3):378–395, July 1978.

[RGR11] M. Rawlins and A. Gordon-Ross. CPACT - The Conditional Parame- ter Adjustment Cache Tuner for Dual-Core Architectures. In Proceed- igns of the 29th IEEE International Conference on Computer Design (ICCD’11), pages 396–403. IEEE, October 2011.

[SA95] R. A. Sugumar and S. G. Abraham. Set-Associative Cache Simulation Using Generalized Binomial Trees. ACM Transactions on Computer Systems (TOCS), 13(February):32–56, 1995.

[SD95] C. Su and A. M. Despain. Cache Design Trade-offs for Power and Performance Optimization: A Case Study. In Proceedings of the 1995 International Symposium on Low Power Design (ISLPED’95), pages 63–68. ACM, 1995.

[SIKN07] H. Sasaki, Y. Ikeda, M. Kondo, and H. Nakamura. An Intra-Task DVFS Technique Based on Statistical Analysis of Hardware Events. In Pro- ceedings of the 4th International Conference on Computing Frontiers (CF’07), pages 123–130. ACM, 2007.

[SJP13] S. M. M. Shwe, H. Javaid, and S. Parameswaran. RExCache: Rapid Exploration of Unified Last-Level Cache. In Proceedings of the 18th

193 Conclusion

Asia and South Pacific Design Automation Conference (ASP-DAC’13), pages 582–587. IEEE, January 2013.

[SNS02] P. Simons, I. Niemel¨a,and T. Soininen. Extending and Implementing the Stable Model Semantics. Artificial Intelligence, 138(1-2):181–234, 2002.

[SPP14a] J. Schneider, J. Peddersen, and S. Parameswaran. A Scorchingly Fast FPGA-Based Precise L1 LRU Cache Simulator. In Proceedings of the Asia and South Pacific Design Automation Conference (ASP-DAC’14). IEEE, January 2014.

[SPP14b] J. Schneider, J. Peddersen, and S. Parameswaran. MASH{fifo}:A Hardware-Based Multiple Cache Simulator for Rapid FIFO Cache Analysis. In Proceedings of the Design Automation Conference - (DAC’14). IEEE, 2014.

[SS07] R. Sen and Y. N. Srikant. WCET Estimation for Executables in the Presence of Data Caches. In Proceedings of the 7th ACM &Amp; IEEE International Conference on Embedded Software (EMSOFT’07), pages 203–212. ACM, 2007.

[Tay12] M. B Taylor. Is Dark Silicon Useful? Harnessing the Four Horsemen of the Coming Dark Silicon Apocalypse. In Proceedings of the Design Automation Conference (DAC’12), pages 1131–1136. IEEE, 2012.

[TFW00] H. Theiling, C. Ferdinand, and R. Wilhelm. Fast and Precise WCET Prediction by Separated Cache andPath Analyses. Real-Time Systems, 18(2/3):157–179, May 2000.

[TRGM13] Y. Turakhia, B. Raghunathan, S. Garg, and D. Marculescu. HaDeS : Architectural Synthesis for Heterogeneous Dark Silicon Chip Multipro- cessors. In Proceedings of the 50th Annual ACM/EDAC/IEEE Design Automation Conference (DAC’13), 2013.

[TTYO09] N. Tojo, N. Togawa, M. Yanagisawa, and T. Ohtsuki. Exact and Fast L1 Cache Simulation for Embedded Systems. In Proceedings of the

194 Conclusion

Asia and South Pacific Design Automation Conference (ASP-DAC’09), pages 817–822. Ieee, January 2009.

[VGRBV08] P. Viana, A. Gordon-Ross, E. Barros, and F. Vahid. A Table-based Method for Single-pass Cache Optimization. In Proceedings of the 18th ACM Great Lakes symposium on VLSI (GLSVLSI’08), page 71. ACM Press, 2008.

[VGRK+06] P. Viana, A. Gordon-Ross, E. Keogh, E. Barros, and F. Vahid. Con- figurable Cache Subsetting for Fast Cache Tuning. In Proceedings of the Design Automation Conference (DAC’06), pages 695–700, 2006.

[WB91] W. Wang and J. Baer. Efficient Trace-Driven Simulation Method for Cache Performance Analysis. ACM Transactions on Computer Systems (TOCS), 9(3):27–36, 1991.

[WJ90] A. W. Wilson Jr. Multiprocessor Cache Simulation Using Hardware Collected Address Traces. In Proceedings of the 23rd Annual Hawaii International Conference on System Sciences, pages 252–260, 1990.

[WMH+97] R. T. White, F. Mueller, C. A. Healy, D. B. Whalley, and M. G. Har- mon. Timing Analysis for Data Caches and Set-Associative Caches. In Proceedigns of the 3rd IEEE Real-Time Technology and Applications Symposium, pages 192–202, Jun 1997.

[XTE] Cadence Tensilica Xtensa Customizable Processors.

[XTM] XTMP. The XTensa Modeling Protocol and XTensa Sys- temC Modeling for Fast System Modeling and Simulation. http://ip.cadence.com/hwdes.

[YMB15] F. Yonga, M. Mefenza, and C. Bobda. ASP-Based Encoding Model of Architecture Synthesis for Smart Cameras in Distributed Networks. ACM Transactions on Design Automation of Electronic Systems, 2015.

[ZGR11] W. Zang and A. Gordon-Ross. T-SPaCS A Two-Level Single-Pass Cache Simulation Methodology. IEEE Transactions on Computers, 62(2):390–403, 2011.

195 Conclusion

[ZGR12] W. Zang and A. Gordon-Ross. A Single-pass Cache Simulation Methodology for Two-level Unified Caches. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems & Soft- ware (ISPASS’12), pages 168–177. IEEE, April 2012.

[ZV03] C. Zhang and F. Vahid. Cache Configuration Exploration on Prototyp- ing Platforms. In Proceedings of the 14th IEEE International Workshop on Rapid Systems Prototyping, pages 164–170, 2003.

[ZVL04] C. Zhang, F. Vahid, and R. Lysecky. A self-tuning cache architecture for embedded systems. ACM Transactions on Embedded Computing Systems (TECS), 3(2):407–425, May 2004.

196