Design and Development of a Heterogeneous Hardware Search Accelerator

This dissertation is submitted for the degree of Doctor of Philosophy

Tan, Shawn Ser Ngiap Magdalene College

May 21, 2009 Abstract

Search is a fundamental computing problem and is used in any number of applications that are invading our everyday lives. However, it has not received as much attention as other fundamental computing problems. Historically, there have been several attempts at designing complex machines to accelerate search applications. However, with the cost of transistors falling dramatically, it may be useful to design a novel on-chip hardware accelerator for search applications. A search application is any application that traverses a data set in order to find one or more records that meet certain fitting criteria. These applications can be broken down into several low level operations, which can be accelerated by specialised hardware units. A special search stack can be used to visualise the different levels of a search operation. Three hardware accelerator units were designed to work alongside a host processor. A significant speed-up in performance when compared against pure software solutions was observed under ideal simulation conditions An unconventional method for virtually saving and loading search data was developed within the simulation construct to reduce simulation time. This method of acceleration is not the only possible solution as search can be ac- celerated at a number of levels. However, the proposed architecture is unique in the way that the accelerator units can be combined like LEGO bricks, giving this solution flexibility and scalability. Search is memory intensive, but the performance of regular cache memory that exploit temporal and spatial locality was found wanting. A certain cache memory that exploited structural locality instead of temporal and spatial locality was also developed to improve the performance. As search is a fundamental computational operation, it is used in almost every application, not just obvious search applications. Therefore, the hardware accelerator units can be applied to almost every software application. Obvious examples include genetics and law enforcement while less obvious examples include gaming and operating system software. In fact, it would be useful to integrate accelerator units with slower to improve general search performance. The accelerator units can be implemented using an off-the-shelf FPGA at speeds of around 200MHz or in ASIC for 333MHz (0.35µm) and 1.0GHz (0.18µm) operations. A regular FPGA is able to accelerate up to five parallel simple queries or two heterogeneous boolean queries or a combination of each when used with regular DDR2 memory. This solution is particularly low-cost for accelerating search, avoiding the need for expensive system-level solutions. Declaration

I hereby declare that my thesis entitled is not substantially the same as any that I have submitted for a degree or diploma or other qualification at any other University. I further state that no part of my thesis has already been or is being concurrently submitted for any such degree, diploma or other qualification. This dissertation is the result of my own work and includes nothing which is the outcome of work done in collaboration except where specifically indicated in the text. This dissertation does not exceed the limit of length prescribed by the Degree Committee of the Engineering Department. The length of my thesis is approximately 45,000 words with 41 figures and 25 listings.

Signed,

Shawn Tan

i Acknowledgements

I would like to take this opportunity to express my gratitude to the following people who have helped me, in one way or another, throughout the duration of my research at Cambridge and the write-up at home in Malaysia. Dr David Holburn, for being the nicest supervisor that one can hope for, without whom this work would be difficult to accomplish. I want to express my thanks for everything you’ve done for me in the past four years; welcoming me into your family, getting things done within the department and patiently reading through my thesis. All the members of the department and division, for making it a nice place and easy environment to work in. Mr Stephen Mounsey, Mr John Norcott and Miss Eleanor Blair for technical assistance in setting up the various software tools that I needed. Mr Mick Furber for all the assistance in the electrical teaching lab. Friends from college, for helping me through tough times and keeping me sane. Jack Nie for helping me print out my thesis and handling all of the administrative issues in submitting my thesis. Drs Ray Chan and Ming Yeong Lim for being my companions on my many travels. Zen Cho for being my shoulder to cry on when things were not going well. All my friends and family in Malaysia, for their belief in me and support throughout the duration of this research. I would like to thank my sister and my parents for all the patience and tolerance that they have shown me during the final stretch of this work. My niece and nephews, Jarellynn, Jareick and Jarell for lending me their bubbling energy when I needed a boost. This thesis is dedicated to them.

ii Contents

1 Introduction 1 1.1 Justifying Search Acceleration ...... 1 1.2 HistoricalJustification ...... 3 1.3 Objectives...... 5

2 Search Basics 7 2.1 SearchStack ...... 7 2.2 CategorisingSearch ...... 9 2.2.1 PrimarySearch...... 9 2.2.2 SecondarySearch...... 10 2.3 DataStructures&Algorithms ...... 11 2.3.1 DataStructures...... 11 2.3.2 Algorithms ...... 13 2.4 SearchProblems ...... 14

3 Search Application 16 3.1 SearchApplication ...... 16 3.1.1 ExampleQuery...... 17 3.1.2 PipelineBreakdown ...... 17 3.1.3 QueryIllustration ...... 19 3.2 SearchProfile...... 20 3.2.1 KeySearch ...... 21 3.2.2 ListRetrieval...... 21 3.2.3 ResultCollation ...... 22 3.2.4 OverallProfile ...... 22

iii 4 General Architecture 25 4.1 InitialConsiderations ...... 25 4.2 HardwareArchitecture...... 26 4.2.1 Multi-CoreProcessing ...... 26 4.2.2 WordSize...... 27 4.2.3 HostProcessor ...... 27 4.3 SoftwareArchitecture ...... 27 4.3.1 SoftwareToolchain...... 27 4.3.2 StandardLibraries ...... 28 4.3.3 CustomLibrary...... 28 4.4 InitialArchitecture...... 29 4.4.1 StackProcessors ...... 30

5 Streamer Unit 32 5.1 Introduction...... 32 5.1.1 DesignConsiderations ...... 33 5.2 Architecture...... 33 5.2.1 Configuration...... 34 5.2.2 OperatingModes...... 35 5.2.3 StateMachine ...... 36 5.3 StreamerSimulation ...... 37 5.3.1 Kernel Functional Simulation ...... 38 5.3.2 KernelTimingSimulation ...... 39 5.3.3 Kernel Performance Simulation ...... 44 5.4 Conclusion ...... 45

6 Sieve Unit 46 6.1 Introduction...... 46 6.1.1 DesignConsiderations ...... 47 6.2 Architecture...... 47 6.2.1 Configuration...... 48 6.2.2 Modes...... 49 6.2.3 Operation...... 50 6.3 SimulationResults ...... 51 6.3.1 Kernel Functional Simulation ...... 51 6.3.2 KernelSoftwarePumpTiming ...... 51 6.3.3 KernelSoftwarePumpPerformance ...... 56 6.3.4 KernelHardwarePipeTiming ...... 57 6.3.5 KernelHardwarePipePerformance ...... 59

iv 6.4 Conclusion ...... 60

7 Chaser Unit 63 7.1 Introduction...... 63 7.1.1 DesignConsiderations ...... 64 7.2 ChaserArchitecture ...... 64 7.2.1 Configuration...... 65 7.2.2 Operation...... 67 7.3 KernelSimulationResults ...... 68 7.3.1 Kernel Functional Simulation ...... 68 7.3.2 KernelSingleKeyTiming ...... 69 7.3.3 KernelSingleKeyPerformance ...... 72 7.3.4 KernelMultiKeyTiming ...... 73 7.3.5 KernelMultiKeyPerformance ...... 75 7.4 Conclusion ...... 76

8 Memory Interface 79 8.1 Introduction...... 79 8.2 CachePrimer...... 79 8.3 CachePrinciples ...... 80 8.3.1 InstructionCache ...... 82 8.3.2 DataCache...... 84 8.4 CacheParameters ...... 85 8.4.1 InstructionCache ...... 86 8.4.2 Data Cache Trends (Repeat Key) ...... 87 8.4.3 Data Cache Trends (Random Key) ...... 89 8.5 DataCachePrefetching ...... 90 8.5.1 StaticPrefetching ...... 90 8.5.2 DynamicPrefetching...... 92 8.5.3 PrefetchedDataCache...... 92 8.6 CacheIntegration...... 94 8.6.1 CacheSizeRatio ...... 95 8.6.2 StructuralLocality ...... 95 8.7 Conclusion ...... 96

9 Search Pipelines 97 9.1 Pipelines...... 97 9.1.1 PrimarySearch...... 97 9.1.2 SimpleQuery...... 99

v 9.1.3 RangeQuery ...... 100 9.1.4 BooleanQuery ...... 101 9.2 SystemPipelining ...... 101

10 Implementation 103 10.1 FabricArchitectures ...... 103 10.1.1 DynamicFabric...... 104 10.1.2 StaticFabric ...... 104 10.2 Integration Architectures ...... 105 10.2.1 TightCoupling ...... 105 10.2.2 LooseCoupling...... 106 10.3 FPGAImplementation...... 106 10.3.1 Chaser Implementation ...... 107 10.3.2 Streamer Implementation ...... 107 10.3.3 Sieve Implementation ...... 107 10.3.4 Resource&Power ...... 111 10.3.5 PhysicalLimits...... 111 10.4 ASICImplementation ...... 112 10.4.1 AreaEstimates ...... 113 10.4.2 Power Estimates ...... 114 10.4.3 SpeedEstimates ...... 114 10.5CostEstimates ...... 115 10.6Conclusion ...... 115

11 Analysis & Synthesis 117 11.1 ImportantQuestions ...... 117 11.2 HostProcessorPerformance ...... 117 11.2.1 SoftwareOptimisation ...... 118 11.2.2 ProcessorArchitecture...... 118 11.3Scalability...... 123 11.3.1 ProcessorScalability ...... 123 11.3.2 Accelerator Scalability ...... 124 11.3.3 Memory Scalability ...... 125 11.4 AccelerationCost...... 126 11.4.1 ConfigurationA ...... 127 11.4.2 ConfigurationB ...... 128 11.4.3 ConfigurationComparisons ...... 128 11.5 Alternative Technologies ...... 129 11.5.1 ImprovedSoftware ...... 129

vi 11.5.2 Content-Addressable Memories ...... 130 11.5.3 MulticoreProcessors ...... 130 11.5.4 DataGraphProcessors ...... 131 11.5.5 OtherProcessors ...... 132 11.6 SuggestionsforFutureWork ...... 132 11.6.1 Conjoining Arithmetic Units ...... 133 11.6.2 ConjoiningStreamBuffers...... 133 11.6.3 MemoryInterface...... 134

12 Conclusion 135

vii List of Figures

2.1 Searchabstractionstack ...... 8

3.1 Typicalsearchpipeline...... 17

4.1 Initial hardware search accelerator architecture ...... 29 4.2 Initial stack based accelerator architecture ...... 31

5.1 Streamerdataflow ...... 33 5.2 Streamerblock ...... 34 5.3 Streamerconfigurationstack ...... 35 5.4 Streameroperatingmodes ...... 36 5.5 Streamermachinestates ...... 36 5.6 Accelerator unit simulation setup ...... 37 5.7 Streamertimingdiagram ...... 43 5.8 Streamer performance simulation ...... 44

6.1 Sievedataflow ...... 48 6.2 SieveBlock ...... 48 6.3 Sieve configuration register ...... 48 6.4 Sieveoperatingmodes ...... 49 6.5 SieveFSM...... 50 6.6 Sieve software pumped timing diagram ...... 54 6.7 Sieve software pumped simulation ...... 56 6.8 Sieve with hardware piped timing diagram ...... 58 6.9 Sieve with streamer piped simulation ...... 60

7.1 Chaserdataflow ...... 65

viii 7.2 Chaserunitblock...... 65 7.3 Chaserconfigurationstack...... 66 7.4 Chasermachinestates ...... 67 7.5 Single key chaser timing diagram ...... 71 7.6 Chasersimulation ...... 72 7.7 Multiple key chase kernel timing ...... 74 7.8 Chaser simulation (multi-key) ...... 76

8.1 Cachesimulationsetup...... 81 8.2 Basiccacheoperation ...... 84 8.3 Instructioncachehitratio ...... 86 8.4 Repetitiveheapcache ...... 88 8.5 Randomheapcache ...... 89 8.6 Randomheapcache(withprefetch) ...... 93 8.7 Cachestructurecomparison ...... 94 8.8 Structuralcachearchitecture ...... 95

9.1 Searchpipelineabstraction ...... 97

10.1 Implementation architectures ...... 103 10.2 System level implementation ...... 105 10.3 ASICareaandpowerestimates ...... 112

ix List of Tables

3.1 SearchProfiles ...... 20

10.1 ASIC area and power estimates at speed ...... 115 10.2 Fabrication cost per accelerator unit ...... 115

11.1 Code profile for std::set::find() ...... 122 11.2 Specifications for 0.35µm CMOSDPRAMblocks ...... 134

x Listings

3.1 Verilogprofilingconstruct ...... 23 3.2 Keysearchprofilekernel...... 24 3.3 Listretrievalprofilekernel...... 24 3.4 Result collation profile kernel ...... 24 5.5 Streamingpseudocode ...... 39 5.2 Softwarestreamerkernel...... 40 5.3 Hardwarestreamerkernel ...... 40 5.4 Streamerkernel...... 40 6.1 Sievesoftwarekernel ...... 52 6.2 Sievehardwarekernel ...... 52 6.3 Sievekernel ...... 53 6.6 Hardware streamer-sieve kernel ...... 62 7.2 Softwarechaserkernel ...... 70 7.3 Hardwarechaserkernel ...... 70 7.4 Chaserkernel ...... 77 7.6 Software multi-key chaser kernel ...... 77 7.7 Hardware multi-key chaser kernel ...... 78 8.1 Verilog simulation LOAD/SAVE ...... 82 8.2 Cachetreefillkernel ...... 83 8.3 Cachesimulationkernel ...... 83 11.1 AEMB disassembly (GCC 4.1.1) ...... 120 11.2 ARMdisassembly(GCC4.2.3) ...... 120 11.3 PPCdisassembly(GCC4.1.1)...... 121 11.4 68K disassembly (GCC 3.4.6) ...... 121 11.5 disassembly (GCC 4.2.3) ...... 122

xi List of Reports

10.1 Chaser FPGA implementation results (excerpt) ...... 108 10.2 Streamer FPGA implementation results (excerpt) ...... 109 10.3 Sieve FPGA implementation results (excerpt) ...... 110

xii CHAPTER 1

Introduction

Search is a fundamental problem in computing and as computers are in- creasingly invading our everyday lives, search is also becoming an everyday problem for everyone. Historically, search has received less attention than other computing problems. The main objective of this research is to design a hardware device that can offload the mundane tasks from a and speed up bottlenecks in search processing.

1.1 Justifying Search Acceleration

Search is becoming increasingly important in the consumer space. Where once, it was the province of massive owned by large corporations, search is moving downstream. This is evident with the present emphasis placed on desktop search1 and other localised search applications. Search has grown from becoming a fundamental computing problem, into an everyday problem[RSK04] for everyone. Computing is also becoming ever more personal, with mobile devices today far ex- ceeding the computing powers of enterprise servers from the past. As a result, our personal computing devices have to juggle with more information than before. Personal computing storage capacities have grown from the tens of megabytes in the 1980s to the hundreds of gigabytes it is today[Por]. This is reflective of the amount of data search applications have to work on. So, it will be useful to see how modern search might be accelerated today.

1Desktop search is the name for the field of search tools which search the contents of a user’s own computer files, rather than searching the Internet[Wik09a].

1 With transistors being so cheap, there is good reason to explore ways of adding processor functionality to improve performance and add value. The floating-point unit has become an integral component in general-purpose computers to accelerate floating-point calculations. Graphics accelerators are also being integrated into general- purpose computers for stream processing2, which is useful for media centric and scientific computations[HB07, LB08]. For everything else, the generic solution to every problem is to devote more general-purpose computing power to it. According to [Knu73], search is the most time-consuming part of many programs, and the substitution of a good search method for a bad one often leads to a substantial increase in performance and in fact it is sometimes possible to arrange the data or the data structure so that searching is eliminated entirely. However, there are many cases when search is necessary, so it is important to have efficient algorithms for searching. Although the problem of sorting received considerable attention in the earliest days of computer, less has been done about searching. Search can be loosely defined as traversing through a data space to find solutions that fit a set of criteria. As an abstract task, it is not limited to just database search and similar applications. It is a fact that almost every task performed by a computer involves some form of search. Some abstract examples include chess playing and cryptography while less obvious examples include task scheduling and language parsing. Even the simple task of creating a document with a modern word processor will cause a search to be performed for both grammar and spell checking, many times. In less abstract applications, search is involved in almost every aspect of data manip- ulation, regardless of how the data is organised in a computer or for what application. Besides performing a search to find a record, searches are also performed during record insertion, updating and deletion. Therefore, search is a very fundamental computing task and any hardware acceleration for search would contribute significantly to overall programme speed. The search problem can be solved by devoting more hardware resources to it or by designing better algorithms. In the case of the leading search engine in this world, Google, both methods are employed. Google uses a patented algorithm called PageR- ank3, to improve search result quality while also employing parallelisation on a massive scale, to perform complex searches with lightning speed. There are some lessons that can be taken from them. However, there may be a more elegant way of tackling the problem, one that involves using less, not more, hardware that supports algorithms. This research sets out to answer the question by looking at search algorithms, how they behave and which parts of the

2Stream processing is a computer programming paradigm, related to SIMD, that allows some appli- cations to more easily exploit a limited form of parallel processing[Wik09d]. 3http://web.archive.org/web/20071114010112/http://www.google.com/corporate/tech.html

2 algorithm causes bottlenecks in the microprocessor. Then, it attempts to develop a versatile architecture that can support search acceleration in hardware and measures the potential performance of such an accelerator. Chapters 2 and 3 address software issues where search is defined and classified into different categories and the problems with each are identified. Chapters 5 through 10 detail the hardware accelerators, their design considerations, functional configurations, operating modes and implementation technologies. Chapter 11 discusses various over- arching issues including the validity of the results, scalability of the architecture and other potential competing solutions.

1.2 Historical Justification

As mentioned earlier, search has received less attention than sorting [Knu73] and this is also evident in hardware. While there have been past attempts at designing a search processor, there are not many major ones. Some of these attempts have initially targeted non-indexed queries and treated search as a computational problem and the solution de- voted more processing power on the problem. As mentioned in [Sto90], the performance gains are at the expense of sorting. In most cases, simply indexing the database would reduce the effectiveness of the solutions. However, there are also some solutions that used unique storage hardware to do acceleration at a fine grained level. The following traces some of the major evolutions in the past.

CASSM [GLS73, SL75] was a cellular system for very large databases. It was an early research that looked into hardware methods for accelerating search applications and is one that is often cited. It focused on a context addressed cellular system for information processing using a unique but inexpensive large memory device. This device allowed the creation of hardware dependent data structures that were closer to the abstraction of the data as perceived by a human, rather than a machine. Therefore, high level search queries were implemented directly in this device. This memory device was implemented using a floppy disk but could be expanded to include other storage mechanisms including electronic memory. These devices were used in a distributed fashion in order to increase parallelism, using a number of non-numeric microprocessors to process the data in a parallel and associative manner.

DIRECT [DeW78] was a multiprocessor architecture for supporting relational database management systems. It is a form of MIMD computer using a number of micro- programmable off-the-shelf PDP11 microprocessors. These processors were attached to pseudo-associative memory through a cross-point switch. The number of processors allocated to a specific query were dynamically determined, based on the complexity and

3 size of the query. It was software compatible, ran a modified version of Ingres and can be used as a relational database accelerator. Its operations were database specific in- cluding such primitives as: CREATEDB, DESTROY, NEXTPAGE, JOIN, INSERT, RESTRICT and other database specific operations. Therefore, it had a custom programming language that used its query primitives like assembly opcodes. The resultant programming code resembled stored procedure4 languages used in present day databases. However, it did not use indices.

CAFS [Bab79] attempted to design an entire hardware relational database system by means of specialised hardware. It claimed that regular computers were fundamentally unsuited to implement relational operations and that database systems were ultimately I/O limited. Therefore, this system used content-addressable hardware, even at the disk level, to speed up relational queries. It worked on both indexed and non-indexed queries using a temporary storage as a core hardware index. This allowed complex relational operations to be accelerated by manipulating the information stored in this hardware index. However, it ran on specific types of hardware and was limited to searching and filtering table rows stored in disk storage.

GAMMA [DGG+86] was a relational database machine that exploited dataflow query processing techniques. It was built as a cluster of off-the-shelf VAX11 based machines connected via a token ring network and was a direct descendant of DIRECT. However, it took into account the fact that the use of indices would improve search performance tremendously by reducing I/O transactions. In addition to the I/O bandwidth limi- tation, this machine tried to address the bandwidth limitation in the message passing interface of a multiprocessor system. This machine demonstrated that parallelism can be used to work in a database machine context. Moreover, it also showed how paral- lelism was controlled with minimum overhead through a combination of hashing based algorithms and pipelining between processes. However, this was an expensive demon- stration on how standard computing power can be scaled in a cluster to perform search acceleration.

GRACE [FKT86] was a parallel relational database machine. It was also built as a cluster of off-the-shelf machines connected in two rings: a processing ring and a staging ring. Both these rings share the same cluster of shared memory modules. While previous machines employed processing at the database page level, this works on databases at a higher level of granularity, at the task level. Each task uses a number of primitive database oriented operations including joins, selections, sorting and relational algebra.

4A stored procedure is a subroutine available to applications accessing a relational database system[Wik09c].

4 The machine tried to achieve high performance for join-intensive applications by using a data stream oriented processing technique. A parallel join algorithm based on the clustering property of hashing and sorting was used to support this processing technique. In addition, it reduced the I/O bottleneck by using a combination of unique algorithms and disk systems.

RINDA [ISH+91] was a relational database machine with specialised hardware for searching and sorting. It was built as a cluster of standard computers with standard disk controllers and storage. This database processor accelerated non-indexed relational database queries. As the data is non-indexed and consequently unsorted, it has to handle both a search problem and sorting problem at the same time. It is composed of content search processors and relational operational accelerating processors. The former, searches rows stored in disk storage, and the latter sorts rows stored in the main memory. The processors connect to a general-purpose host computer with channel interfaces.

GREO [FK93] was a commercial database machine based on a pipelined hardware sorter. This machine was designed for commercial usage in existing installations. As the clients were not interested in rewriting whole applications to cater to a new architec- ture, support of legacy data structures and algorithms was important. It was made up of a hardware merge sorter alongside a number of data stream microprocessors. The data stream microprocessors were made up of a number of MC68020 microprocessors on a board level multi-processor system. These processors performed the database primitive operations such as selections, projections, joins and other computations. The host com- puter compiled a given query into a sorting-oriented dataflow graph, which was executed by the hardware sorter and the 68K microprocessors. It would seem that in practically all of these cases, the solution presented is one that employs various multi-processor arrangements: at a networked cluster level down to the system board level. The solutions were not targeted at the chip level integration, including those that employed custom chips at a board level. This may be due to the nature of the industry and cost constraints on building systems on chip. However, with the low cost of transistors today, it is feasible to explore building a chip level accelerator. Some of these solutions also employ unique storage hardware and content addressable storage to improve I/O performance. While these solutions are fast, customised storage hardware would introduce incompatibilities with existing computing architectures and content addressable devices are notoriously expensive to build. Therefore, these solu- tions are custom solutions that would be difficult to implement in today’s world where computing is a commodity.

5 1.3 Objectives

This document is broadly organised according to the objectives of this research, which are summarised as follows:

• Justify the need for hardware search acceleration. Search is shown to be an impor- tant and common operation that is performed by a computer. Therefore, it will be beneficial to accelerate the operation, while incurring the minimum penalty of additional hardware and software.

• Categorise the types of search and the problems faced by each. Search needs to be understood and broken down into sub-problems. Each sub-problem can then be studied and accelerated in hardware.

• Design a hardware device that can be used to accelerate common present-day search operations in a cost effective manner. The term accelerate is defined as comprising of two functions:

1. It will need to offload many of the mundane search processing tasks from the host-processor. This will free up the host processor to perform other computational operations. 2. It will be designed to speed up the bottlenecks in the search operation. This speed-up will be obtained by performing many of the operations in hardware, at the fastest rate possible.

6 CHAPTER 2

Search Basics

A search stack helps visualise how the different hardware and software com- ponents work together in any search application. The survey of the software layers begin with the primary search and secondary search layers, which each exhibit different characteristics and encounter different problems. It is also important to know the basic data structures and algorithms that are used in search applications as they are used in application software. In the end, the problems of search become evident and regular methods may not address them adequately.

2.1 Search Stack

The search stack of figure 2.1 is an abstraction framework for illustrating how different parts of the hardware and software fit together to perform search operations. Although the details of each layer in the stack may not be evident at the moment, they will be further elaborated in subsequent chapters. The search stack is separated between hardware and software abstraction layers. The hardware layers are composed of hardware devices from accelerator units to the host processor while the software layers represent different software functions performed during a search operation. The different layers are dependent upon each other and as a result, any improvement in search performance on one layer will ultimately improve the overall performance. As the stack clearly illustrates, improvements can come from either the software or hardware domains and each layer can be substituted with alternative technologies that perform the same functions.

7 Software Application   Secondary Search  Software Primary Search   Interface Library Host Processor   Search Pipeline 

Accelerator Units  Hardware  Figure 2.1: Search abstraction stack

Software Application: This represents the user application that uses search opera- tions and ultimately benefits from any form of search acceleration. These appli- cations include classic examples such as database applications but are no means limited to it. Depending on the application, it may be possible for the software application to directly use the primary search but it will normally depend directly on secondary searches.

Secondary Search: This represents the software algorithms that perform different types of queries classified as secondary searches. These algorithms form the bulk of complex search operations used in different kinds of software applications. All these algorithms are fundamentally dependent on primary search primitives.

Primary Search: This represents the basic software algorithm that perform primitive searches directly on fundamental data structures. As mentioned above, certain applications may only depend on primary searches and not use secondary searches.

Interface Library: This represents the interface layer between the hardware and soft- ware layers. It provides the software hooks that convert the software requirements into hardware machine operations. This can take the form of a driver, a shared library, a language dependent source library or some other form dependent on application requirements.

Host Processor: This represents the main processor that runs the application soft- ware. The host processor is controlled by the software layers through the interface library and controls the hardware layers below to actually perform the search oper- ation. Different host processor architectures may be used, as long as the interface library is changed accordingly.

Search Pipeline: This represents a specific combination of accelerator units that can be used to accelerate search operations. The pipeline is controlled by the host pro-

8 cessor and performs the different operation stages by using the different accelerator units.

Accelerator Units: This represents the hardware primitive units that perform the basic operations of search. These are made up of the chaser, streamer and sieve units that provide the acceleration capability in hardware. These can be replaced by alternative hardware technologies, some of which are discussed later.

It may be useful to keep a mental picture of this search stack while reading the rest of this document. It will not appear again until the end of the document. The bulk of this document is loosely organised around this search stack.

2.2 Categorising Search

Different texts will categorise search algorithms differently. As an example, in [Knu73], search is broadly categorised into different categories:

Internal and External searches are defined based on the data storage method. An internal search uses data stored only inside primary memory. An external search involves data stored inside disk storage.

Static and Dynamic searches are categorised based on the data structure used. A static search uses data that does not change with time. A dynamic search uses data that is subject to frequent record insertion and deletion.

Comparison and Attribute searches could possibly be divided based on the algo- rithm used. A comparison search accomplishes the search by selecting data out based on key comparisons. An attribute search does not involve comparisons but filters out data by property flags.

But, for the purpose of this research, search is broadly organised into: primary and secondary searches. This method of categorisation was chosen as the two searches present different problems.

2.2.1 Primary Search

Primary search involves searching a data space for primary keys, which uniquely identify a specific record within the data space. These keys are usually, though not necessarily, sorted into an in-memory index. The decision to sort the key, depends on how often the search is performed. If search is performed regularly, the cost of sorting the index during insertion will be minimal. The cost of sorting could potentially be further reduced with

9 the aid of special purpose hardware sorting networks. Examples of sorting networks can be found in [CLRS01, Sto90, Shi06]. Primary search is analogous to finding local maxima along a function. In terms of computational resources, this rarely consumes complex operations, unless the values had to be computed on-the-fly. In most cases, comparison algorithms are usually used to traverse a tree-like structure. Operations are then limited to: addition, subtraction and conditional branches. Addition is used to keep track of the tree position, subtraction is used to perform a comparison and conditional branching is used for decision making. From [Sto90], multiple processors will not be very efficient if they are used to perform a search by making multiple probes into a file ordered by a single search key. We cannot expect a multiprocessor to perform any single-key search much faster than a single processor can, but we can expect a multiprocessor to do many different searches in parallel with high efficiency. This becomes obvious when we realise that the surplus processing power provided by multi-core processors is wasted when complex computations are not used. Therefore, multiple processors should be used to conduct multiple independent searches. In turn, this causes memory to bottleneck as memory bandwidth requirements increase linearly with the number of parallel search processes. According to [Sto90], when database keys are unsorted, a serial search might have to examine the entire database. In this case, a multiprocessor search has a potential for excellent speedup. It is possible to spawn a search process whenever we hit a branch in a tree and have the processes move in opposite directions. But from [Sto90], the savings from parallelism is not truly the speedup observed. It is the saving in the overhead used to sort the database and maintain that sorted order. If this overhead is small, then the effectiveness of the parallelism is small. In many applications the cost of sorting or building an index can be amortised over hundreds or thousands of searches. Rarely in such instances does it pay to perform parallel search. On the other hand, some problems in cryptography are essentially enormous searches that are only performed once per data space. The equivalent of building an index is far more costly than searching the database in parallel by using a multiple processor. Therefore, although multiple processors could potentially speed up a search, this approach only works for applications that cannot be indexed. However, similar to the earlier case, memory then becomes the bottleneck. Therefore, the main technical prob- lem for primary search is the memory bottleneck.

10 2.2.2 Secondary Search

Secondary searches offer a different problem. Secondary search involves the search for secondary keys, which are generally non-unique values. Once again, the keys may or may not be sorted into an index. According to [Knu73], secondary search queries are usually restricted to at most the following three types: simple, range and boolean. The problem of discovering efficient search techniques for these three types of queries is already quite difficult, and therefore queries of more complicated types are usually not considered.

Simple Query is the search for a specified key within the search space, such as YEAR = 2008. In many ways, this may look similar to a primary search key except that results are non-unique. If sorted, it will return a chain of references to the different data while an unsorted one, would require a complete traversal. As a result, such a search will degenerate into a linear traversal. There are software techniques available to optimise the query. One method is to batch process a few queries at a time. This suffers from the same problems as primary search. It does not scale well and will not benefit significantly from higher computational power.

Range Query is a search for values that fit within a specified range of values, such as YEAR in [2004:2008]. Just like the simple query, it looks similar to a primary key search and will ultimately degenerate into a linear traversal. Although it is still possible to optimise simple queries in software, it may involve multiple traversals or more complex comparisons. Therefore, it uses more computational power than a simple query.

Boolean Query can combined any primary and secondary searches with boolean operators. Regardless of how it is optimised in software, it would still involve a large number of traversals. Trying to combine different result sets can be graphically easy, but computationally more difficult. In fact, it is suggested in [Knu73] to let people do part of the work, by providing them with suitable printed indexes to the information but we will not consider this here. These types of queries are also complicated to optimise in software because the data is not known beforehand. In the case of a secondary search, a larger proportion of the problem consumes computational power. The most difficult problem is the combination of multiple result sets in the boolean query and this happens to be a very common form of query on large data sets. Therefore, a major technical problem for secondary search is computational complexity, and is a suitable candidate for hardware acceleration.

11 2.3 Data Structures & Algorithms

It is mentioned in [Knu69] that data is rarely simply stored as an amorphous mass of numerical values. The way in which the data is stored can also provide important structural relationships between the data elements. It is important to acquire a good un- derstanding of the structural relationships present within the data, and of the techniques for representing and manipulating such structures within a computer.

2.3.1 Data Structures

From [Kor87], data structures are important because the way the programmer chooses to represent data significantly affects the clarity, conciseness, speed of execution, and storage requirements of the programme. Data structures are chosen so that informa- tion can be easily selected, traversed, inserted, deleted, searched and sorted. For this research, data structures can be classified in the following ways: static and dynamic, structures and implementations. Any potential hardware acceleration of data structures would directly improve algorithm performance.

Static Structures change only their values, not their structure, and include arrays and records. Because their structure stays the same, even large structures are well defined and can benefit from hardware processing. Moreover, their layout is known during compile time and can be scheduled efficiently in software. These structures are often used in signal processing applications and are often hardware accelerated in stream processors through specialised memory addressing modes[KG05] and specialised hardware. However, these structures lack the power of dynamic structures and are rarely used to store complex relationships of data.

Dynamic Structures change their size, shape as well as values, and include stacks, heaps, lists and trees. Dynamic structures are often used to store high-dimensional, non- linear, data that may not be efficiently stored in a static structure because resources are only allocated when needed. These structures do not have a defined layout, either during compile-time or run-time. So, it is fairly difficult to accelerate them in hardware or software. From [KY05], the prevailing technique used for accelerating these structures is data pre-loading or pre-caching. These structures are often used to store large sets of data and need to be accelerated for search applications.

Static Implementations implement data structures statically in hardware and would certainly speed up all the operations on it. A content addressable memory (CAM) is fully associative and will allow information to be searched and retrieved almost instantly. Certain applications, such as network routers, implement such a hardware structure to

12 facilitate routing table lookups[PS06b]. But such structures are expensive and would not be feasible for any large data set. Other common data structures are regularly implemented in hardware, such as stacks and heaps. There has also been some work done [MHH02] on implementing complex graph structures directly in hardware. Such implementations would essentially move the search algorithm from software into hardware. However, whatever it gains in speed is sacrificed in flexibility.

Dynamic Implementations would typically be built in software as a pointer linked structure. Memory can be dynamically and quickly allocated and freed as the structure grows and shrinks. Pointer linked structures can usually only be traversed and searched from one direction at a time. Hence, these structures are notoriously difficult to accel- erate in hardware. However, existing data sets are almost entirely implemented using this method and should be accelerated where possible.

2.3.2 Algorithms

From [CLRS01], informally, an algorithm is any well-defined computational procedure that takes some value, or set of values, as input and produces some value, or set of values as output. An algorithm is thus a sequence of computational steps that transform an input into the output. Most of the algorithms studied by computer scientists that solve problems are types of search algorithms. There are many types of basic search algorithms and learning how it progresses from one type to another will help us understand how to accelerate them.

Linear Search is a basic worst case search algorithm as it merely steps through the data space, one element at a time, until the key is found. The data structure could be either static or dynamic. While the worst case would take O(N) steps to finish, there are different methods to improve this search algorithm such as sorting the data based on value or frequency of access.

Binary Search can be employed if the data is stored in a sorted array. There are several variants on this method such as Fibonacci[Fer60] and interpolation search. At each iteration, the algorithm would quickly eliminate at least half the search space. The worst case would take O(log N) steps.

Binary Tree Search can be employed if the data is structured in a tree. In a binary tree, one branch will contain values that are always smaller than the other branch and each branch is a binary sub-tree. A binary tree search starts at the root of a tree and

13 eliminates half the tree with each step, similar to the binary search. The number of entries traversed will depend on the maximum height of the tree. In the worst case, where there is only one branch at each node, it could degenerate into a linear search through a linked list.

Balanced Tree Search is the solution to a badly grown binary tree search. A balanced tree like Red-Black trees ensure that the height of the longest and shortest branches of the tree will differ by at most one. Therefore, a minimum height tree is guaranteed and the balanced tree search will never degenerate into a linear search. However, these binary trees all depend on the entire tree being in-memory and only allow entry at the root.

Multi-way Tree Search is used for large trees, where data may need to be split into multiple sub-trees and accessed individually. A multi-way tree like B-trees store the root tree in-memory, but store the large sub-trees on disk. The multi-way tree search can quickly traverse through the in-memory tree and locate a particular sub-tree that needs to be loaded. This allows the sub-trees to be entered at different points while only consuming a modest amount of memory. A survey of real world databases [Bor99, MA06, PDG05] show that indices are often built using these trees, keeping parts of the index on disk and swapping pages when necessary. These trees can be considered a more advanced form of a balanced tree. Therefore, it could also benefit from balanced tree enhancements.

Radix Tree Search works like any other tree search, with one critical difference - it does not rely on value comparisons to work. On certain processor architectures, the implementation of comparisons could be expensive. Instead, it checks to see if a bit- value at a specific position of the key and branches left or right depending on the value. Therefore, the amount of time it takes is dependent on the size of the key, rather than the number of elements. As an added benefit, it can also do lexicographical matching or wildcard matching, which is extremely powerful.

2.4 Search Problems

It should be evident at this point that search exhibits different problems. Most com- putational problems can be solved by devoting more computational hardware to the problem. Although secondary search may benefit from having additional computational resources, primary search would resist such attempts. Therefore, the current trend of increasing computational performance by means of additional processor cores, would

14 increase the throughput of multiple searches but would do almost nothing for a single search. Additional computational units go hand in hand with increases in memory band- width requirements. As primary search is a memory problem, one may think that increasing on-chip cache is a useful solution. However, this is not a panacea as the increase in memory cannot continue indefinitely. There may come a day when we have multi-gigabyte cache memories but by then, our databases would probably be terabytes large. Also, subsequent chapters will show that a larger cache size does not necessarily result in better performance. In standard databases, dynamic structures such as trees are often used to store and sort information. They are often defined dynamically during run-time and implemented dynamically in the data memory heap. This means that its characteristics are not well defined prior to it actually being used. As a result, it is not easy to directly accelerate in software or hardware. It can be argued that search is primarily an I/O limited problem. However, with the cost of present technology, it is feasible to circumvent this by storing entire databases in primary memory. Therefore, this thesis assumes that databases are stored entirely in memory and any attempt to accelerate search would need to deal with the problem of slower primary memory only. Improvements in memory technology will help, but not solve the problem until the day when whole databases fit inside fast cache memory.

15 CHAPTER 3

Search Application

The application layer is the top-most software layer. Almost any application that searches through a data set for records that match a number of fitting criteria is considered a search application. This task can be further broken down into several stages that work collectively as a search pipeline. An analysis of actual code profiles for each pipeline stage will reveal that they are different from regular computing code and may require special attention.

3.1 Search Application

Search is a broad problem and search applications encompass a large number of problem types that go beyond the scope of this research. Every type of computer application exploits search operations at its core, which differ depending on the application type. An encryption cracking software performs a search for the encryption key while a chess playing programme searches through a tree of potential positions to find the best move. Although many of these problems are unique and very difficult, they are often less commonly used and would not benefit much from hardware acceleration. An alternative type of search that is performed regularly, is one that is performed on a process table whenever an operating system starts or stops a process. In this type of search, the computer has to go through a finite data set, looking for one or more records that match a fitting criterion or criteria. This form of search is a generic search operation and will benefit directly from any hardware acceleration.

16 3.1.1 Example Query

Assume for a moment, that there is a flu outbreak that only kills cats and the local authorities want to inform all cat owners of this outbreak. Also, assume that there exists a massive directory of the entire human population of the United Kingdom and it holds all kinds of information about individuals including pets they own and their city of residence. So, if the local council wished to find one or more individuals who are resident in Cambridge and who own pet cats, an example query can be performed. This query can be characterised using the following SQL-like statement1:

SELECT person FROM population WHERE pet=cat AND city=cambridge

SELECT, FROM, WHERE and AND are all SQL keywords. The person represents the indi- vidual or individuals being searched for. It may return one or more results, depending on how many pet cat owners reside in Cambridge, or no results if no one owns cats in Cambridge. The population represents the massive directory that needs to be searched. In search algorithms, this massive directory is called the search space or database and the size of this search space would be represented by N records. The cat and cambridge criteria represent the fitting criteria used to filter out the results. In this case, only pet cat owners residing in Cambridge should be identified. It is easy to look at things from this perspective as it exhibits many characteristics of a common search. Looking at this search query with an abstract eye, it reduces itself to searching a data set for one or more records that match one or more fitting criteria. Any application that performs this type of operation regularly, is classified as a search application.

3.1.2 Pipeline Breakdown

The example query above can be broken down into a number of simple sub queries, which return multiple result streams that are then combined into a final result stream. From the above description, the search operation can be broken down into a series of operations. Figure 3.1 illustrates these operations.

KEY KEY LIST RESULTS RESULT SEARCH LIST RETRIEVAL STREAM COLLATION

DATA SET

Figure 3.1: Typical search pipeline

1an SQL statement is merely used for illustration purposes as this research is not SQL focused

17 Key Search can be performed on each criterion of the query statement. The input to this stage is the index structure and the output is a key. Indices are often stored in a balanced tree structure. Unique key searches involve a balanced tree search through the tree. There are many algorithms that can be used to search through a tree, depending on the size of the tree. For sufficiently large trees, the amount of time taken to perform the search by the best algorithms is in the order of O(logN) time. For the example query, depending on the processor and the number of criteria, a number of key searches may be performed in parallel to speed it up. Once the keys are located, a list of results can be retrieved.

List Retrieval searches are not actually searches but form part of the search pipeline. The input to this stage is the starting point of the structure and the output is a list of potential results. A list can be organised in many ways but is most commonly organised as a pointer linked structure, such as a linked list. If the list is not in sorted order, the bottleneck will once again be in the data structure. So for most intents and purposes, the list can assumed to be sorted. In this case, the list retrieval is an operation to pull in data from memory into the processor. For such an operation, the best algorithms will take the order of O(N) time to completely retrieve each list. Once again, depending on the power of the processor and the number of criteria, retrievals may be performed in parallel. The final stage of the operation is to collate the results.

Result Collation operations can be considered a search if they involve any form of result operation. In the example query, the two results lists need to be intersected. This is often a bottleneck in the search operation as the number of results returned from the list retrieval may be significantly larger than the actual number of final results needed. As described earlier, this is a computationally intensive operation and can benefit from hardware acceleration. There are many possible software algorithms that can be applied to this operation. If the results lists are sufficiently short and randomly accessible, a fast intersection algorithm could take O(logN) time to complete. However, for a sufficiently large list that is not randomly accessible, a typical algorithm would take O(N) time to complete. This section is difficult to parallelise because all the earlier results need to be fed in. After this, the bulk of the search operation is essentially complete and the results are available.

Each individual stage of the pipeline can be performed in software by modern pro- cessors. However, each stage needs to wait for results from the previous stage before

18 it is able to continue. Therefore, the above series of operations can be considered a search pipeline. This project will treat each stage of the pipeline as its own operation and accelerate each stage in hardware.

3.1.3 Query Illustration

There are many ways to organise information and just as many ways to search through them. In our example query, there could be a single file on each resident that is sorted by alphabetical order in folders, files, cabinets and rooms. Alternatively, there could be a number of ledgers that hold a list of numbers identifying specific resident files by room, cabinet, file and folder. There could be a ledger representing pet cat owners and another ledger representing residents of Cambridge and these ledgers are stored in book shelves by category. In the first case, the data is structured badly, without any index or keys built for the information. A badly organised database would not benefit from any form of hardware nor software acceleration. As a result, the only way to search through this information is to inspect each individual resident file to see if they are pet cat owners and residents of Cambridge. There is no reason in working on this scenario as no amount of hardware acceleration is going to help, when the bottleneck is the database itself. This search will take a very long time to perform, even on the fastest supercomputers and the best way to accelerate the search operation would be to reorganise the database. In the second case, the data is structured, with a number of indices built for the information. The cat and cambridge ledgers will first need to be retrieved from the pet and city book shelves. Once located, each ledger contains a list of records (assumed to be sorted) that reference specific individual files. So, the cat ledger will identify specific pet cat owners and the cambridge ledger will identify residents. An intersection operation can then be performed on both the lists to find the common records of both ledgers. The resultant references can then be used to locate the specific person from the massive population directory. Although, the types of algorithms that are involved in the above operations are beyond the scope of this research, assume that the data is already organised in an optimised manner and that the algorithms used to process them are also optimised. Therefore, any bottleneck that exists, will be caused by the processing of the search algorithm. This is the opportunity where hardware acceleration may be able to help. Theoretically, the actual number of results retrieved could be from nothing (no one in Cambridge owns cats) to the entire population of the UK (assuming that everyone is resident in Cambridge and they all owns cats). As all search operations are dependent on the size of the search space (N limited), hardware acceleration would be more beneficial for large data sets than small ones. Therefore, this research will focus on methods to

19 accelerate a significantly large sized data set.

3.2 Search Profile

Before we proceed, it is important to understand how the individual search pipeline stages are performed in a regular microprocessor. In order to do this, a basic search software kernel was written using the C++ STL library. The software was then run through a simulator and the resultant operations profiled. Listing 3.1 illustrates how this is done in the Verilog simulation construct. Profiling was turned on in the Verilog simulation, just before the search functions were called. Profiling sends the status of certain important pipeline registers to the output (lines 103–127) when the dump variable is set (line 100) including ASM, which represents the instruction register. The dump variable is toggled (line 186) by a virtual device mapped to memory location 0x40009000, which was activated by a memory write in software. The output was passed through a parser and statistics were taken on each type of instruction. According to [FKS97], branches account for about 20% of the code in general-purpose code and about 5-10% for scientific code with conditional branches accounting for about 80% of it while loads and stores are frequent operations on RISC code and make up about 25-35% load and about 10% store instructions. Therefore, general-purpose code has about 35-45% memory operations, 20% branches and 35-45% arithmetic operations. Table 3.12 shows the profile of a typical search kernel and a breakdown of the operations for the kernel codes shown in listings 3.2, 3.3 and 3.4.

3.2.1 Key Search

Glancing at this profile, it is evident that search code has a similar profile to general- purpose computing code. However, significant differences appear on closer inspection. The number of memory operations is similar to the 35% as suggested in general- purpose code. But, the number of store operations are almost halved from that of a general-purpose code, while the number of load operations increased significantly. The large number of read operations means that key search code essentially behaves in a mainly ‘read’ manner. This suggests that speeding up memory reads might be beneficial. However, reading is usually a cheaper operation than writing and any benefits gained may be insignificant. The numbers of conditional branches are only slightly higher than that of general- purpose code. However, the total number of branches taken is almost 50% more than

2the total percentage is 100 ± 1% due to rounding errors

20 Type Key Search List Retrieval Result Collation Arithmetic 37% 21% 47% Compare 44% 0% 13% Logic 2% 1% 0% Addition 39% 3% 27% Subtraction 15% 96% 50% Branch 29% 20% 28% Conditional 84% 99% 71% Unconditional 9% 0% 27% Return 6% 1% 1% Memory 32% 59% 26% Load 81% 67% 99% Store 19% 33% 1% Miscellaneous 3% 1% 3%

Table 3.1: Search Profiles that of general-purpose code. This indicates that decision making code is more common for search operations. This suggests that speeding up branch penalties or eliminating branches might be beneficial. The bulk of the key search code comprises arithmetic operations. Almost half of the arithmetic operations performed were comparison operations. This is indicative of search operations, as searches mainly involve comparing values with a key. This suggests again, that accelerating comparisons in hardware may be beneficial.

3.2.2 List Retrieval

Looking at the middle column of Table 3.1 immediately tells us that list retrieval is very different from general-purpose code. Although the 20% branches are expected as for general-purpose code, the amount of memory operations at 59% is significantly higher at the expense of arithmetic operations of only 21%. As its name suggests, the list retrieval operation is memory intensive, as it tries to retrieve the entire list from memory. In this case, the ratio of writes to reads is about 1:2 because each node that is read, is also written to memory by the kernel to simulate result buffering. However, if the data read in is used internally, the number of write operations drop dramatically and the operation becomes ’read-only’. No comparison intructions are used whilst subtraction is used the most. While code listing 3.3 does not employ any subtraction, it is used to decrement the list counter. These results show that list retrieval is a memory intensive operation and may benefit from accelerated memory operations but it will not benefit much from accelerating

21 computational operations.

3.2.3 Result Collation

Looking at the right column of Table 3.1, the profile is again significantly different from the general-purpose profile. A large number of branches are again used, with most of them being conditional ones. This is indicative of the decision making process that is involved in results collation. There are relatively few memory operations and they are almost all read operations. As the name suggests, results collation is a compute-intensive operation. Almost half the operations performed are computational. However, the majority is used in subtraction. While code listing 3.4 does not employ any explicit subtractions, it is often used in decision making code to affect the sign and overflow conditions, which allow microprocessors to perform conditional branches. Surprisingly, only about a quarter of the code operations are memory operations. This is a very different profile from the previous two profiles that are memory intensive. Result collation is a computationally intensive operation and is a good candidate for hardware acceleration.

3.2.4 Overall Profile

From the overall result, it can be shown that search is indeed an expensive operation. It is well known that branches and memory operations are expensive operations, when com- pared to simple arithmetic operations. Search algorithms, consume significantly more branches and memory operations than general-purpose code while only certain types of computational operations, compares and subtractions, are heavily used in search. In addition, as each stage is so different, it might be better to design stage specific hardware accelerators than to design a universal hardware accelerator for search. The key search stage resembles a general purpose operation but is memory and computa- tionally intensive. The list retrieval stage is primarily memory intensive. The result collation stage is mainly computationally intensive. Each accelerator stage can then be combined in different ways, in order to accelerate different kinds of search.

22 // DUMP CYCLES 97 reg dump; always @(posedge sys_clk_i) if (dump & core0.risc0.cpu0.dena) begin //begin ‘ifdef AEMB2_SIM_KERNEL 102 $displayh("TME=",($stime/10), ",PHA=",core0.risc0.cpu0.gpha, ",IWB=",{core0.risc0.cpu0.rpc_if ,2’o0}, ",ASM=",core0.risc0.cpu0.ich_dat , ",OPA=",core0.risc0.cpu0.opa_of , 107 ",OPB=",core0.risc0.cpu0.opb_of , ",OPD=",core0.risc0.cpu0.opd_of , ",MSR=",core0.risc0.cpu0.msr_ex , ",MEM=",{core0.risc0.cpu0.mem_ex ,2’o0}, ",BRA=",core0.risc0.cpu0.bra_ex , 112 ",BPC=",{core0.risc0.cpu0.bpc_ex ,2’o0}, ",MUX=",core0.risc0.cpu0.mux_ex , ",ALU=",core0.risc0.cpu0.alu_mx , //",WRE=",dwb_wre_o, ",SEL=",dwb_sel_o , 117 //",DWB=",dwb_dat_o, ",REG=",core0.risc0.cpu0.regs0.gprf0.wRW0, //",DAT=",core0.risc0.cpu0.regs0.gprf0.regd, ",MUL=",core0.risc0.cpu0.mul_mx , ",BSF=",core0.risc0.cpu0.bsf_mx , 122 ",DWB=",core0.risc0.cpu0.dwb_mx , ",LNK=",{core0.risc0.cpu0.rpc_mx ,2’o0}, ",SFR=",core0.risc0.cpu0.sfr_mx , ",E" ); 127 ‘endif end // if (uut.dena)

always @(posedge sys_clk_i) begin 154 // DATA WRITE if (dwb_stb_o & dwb_wre_o & dwb_ack_i) begin case (dwb_adr_o[31:28]) 4’h0: // INTERNAL MEMORY case (dwb_sel_o) 159 4’hF: rDLMB[dwb_adr_o[DLMB:2]] <= #1 dwb_dat_o; 4’hC: rDLMB[dwb_adr_o[DLMB:2]] <= #1 {dwb_dat_o[31:16], wDLMB [15:0]}; 4’h3: rDLMB[dwb_adr_o[DLMB:2]] <= #1 {wDLMB[31:16], dwb_dat_o[15:0]}; 4’h8: rDLMB[dwb_adr_o[DLMB:2]] <= #1 {dwb_dat_o[31:24], wDLMB [23:0]}; 4’h4: rDLMB[dwb_adr_o[DLMB:2]] <= #1 {wDLMB[31:24], dwb_dat_o[23:16], wDLMB[15:0]}; 164 4’h2: rDLMB[dwb_adr_o[DLMB:2]] <= #1 {wDLMB[31:16], dwb_dat_o[15:8], wDLMB[7:0]}; 4’h1: rDLMB[dwb_adr_o[DLMB:2]] <= #1 {wDLMB[31:8], dwb_dat_o[7:0]}; endcase // case (dwb_sel_o) 4’h8: // EXTERNAL MEMORY begin 169 case (dwb_sel_o) 4’hF: rDOPB[dwb_adr_o[DOPB:2]] <= #1 dwb_dat_o; 4’hC: rDOPB[dwb_adr_o[DOPB:2]] <= #1 {dwb_dat_o[31:16], wDOPB [15:0]}; 4’h3: rDOPB[dwb_adr_o[DOPB:2]] <= #1 {wDOPB[31:16], dwb_dat_o[15:0]}; 4’h8: rDOPB[dwb_adr_o[DOPB:2]] <= #1 {dwb_dat_o[31:24], wDOPB [23:0]}; 174 4’h4: rDOPB[dwb_adr_o[DOPB:2]] <= #1 {wDOPB[31:24], dwb_dat_o[23:16], wDOPB[15:0]}; 4’h2: rDOPB[dwb_adr_o[DOPB:2]] <= #1 {wDOPB[31:16], dwb_dat_o[15:8], wDOPB[7:0]}; 4’h1: rDOPB[dwb_adr_o[DOPB:2]] <= #1 {wDOPB[31:8], dwb_dat_o[7:0]}; endcase // case (dwb_sel_o) ‘ifdef SAVEMEM 179 #1 $fdisplayh(swapmem , "\@",dwb_adr_o[DOPB:2]," ",rDOPB[dwb_adr_o[DOPB:2]]); ‘endif end 4’h4: // I/O DEVICES case (dwb_adr_o[15:12]) 184 4’h0: $write ("%c", dwb_dat_o[31:24]); 4’h9: dump <= !dump; endcase // case (dwb_adr_o[15:12]) default: $display ("DWB@%h<=%h", {dwb_adr_o ,2’o0}, dwb_dat_o); endcase // case (dwb_adr_o[31:28]) 189 // $display ("DWB@%h<=%h", {dwb_adr_o,2’o0}, dwb_dat_o); end // if (dwb_stb_o & dwb_wre_o & dwb_ack_i) Listing 3.1: Verilog profiling construct

23 int swchase(std::set &setA, int pkey) 32 { #ifdef DEBUG iprintf("PKEY\t: 0x%X\n", pkey); 35 #endif

volatile int j = (int)&*setA.find(pkey)._M_node;

#ifdef DEBUG 40 iprintf("FIND\t: 0x%X\n", j); #endif

return EXIT_SUCCESS; } 45 Listing 3.2: Key search profile kernel

int swstream(std::list &listA) 47 { for (std::list::iterator node = listA.begin(); node != listA.end(); node++) { 51 volatile int j = *node;

#ifdef DEBUG iprintf("HIT\t: 0x%X\n", j); #endif 56 } } Listing 3.3: List retrieval profile kernel

int swsieve(std::list &listA, std::list &listB ) 60 { 61 std::list::iterator idxA, idxB;

idxA = listA.begin(); idxB = listB.begin(); 66 while( (idxA != listA.end()) && (idxB != listB.end()) ) { if (*idxA == *idxB) { volatile int j = *idxA; 71 // HIT!! #ifdef DEBUG iprintf("HIT\t: 0x%X\n", j); #endif 76 idxA++; idxB++; } else if (*idxA < *idxB) { 81 idxA++; } else { idxB++; 86 } }

return EXIT_SUCCESS; } 91 Listing 3.4: Result collation profile kernel

24 CHAPTER 4

General Architecture

Some general architectural decisions need to be made at the very start. The accelerator units were designed to work alongside a host processor in a het- erogeneous computing environment. An existing host microprocessor was used instead of designing a system from the ground up, in order to exploit existing software tools for development and testing. C++ was chosen as the primary software language along with the use of STL as the default li- brary. Some initial ideas of using a stack processor were also explored but ultimately discarded.

4.1 Initial Considerations

It was clear that the work would involve studying both hardware and software operations. Working at the hardware-software boundary allows a greater leeway in determining where to draw the line between functionality. On the hardware side, microprocessor cores are increasingly becoming commodities that can be allocated to a problem to solve it. Hence, an obvious method for accelerating search operations would be to spread the task across multiple processors. This is the most obvious path being taken presently by various microprocessor vendors from (desktop), Sun (server) and ARM (embedded). Alternatively, an architecture can be designed to provide small and fast algorithmic support for individual search sub-operations. This is a more flexible method of address- ing the problem of generic search and would benefit the most applications. It must not be an attempt at designing a hardware search engine as a hardware search engine would

25 certainly be speedy, but it would not be useful for much else. On the software side, search algorithms can also be improved by changing the algo- rithm architecture. Any improvement in the class of algorithm can have dramatic effects on the performance. If a hardware accelerator can be designed to support improvements in algorithm structure, it would prove to be doubly useful.

4.2 Hardware Architecture

From earlier considerations, it appears that the best way to accelerate search is to spread out the work across multiple hardware cores. The question is the form of multi- processing that this should take. Independent searches can definitely be split up across multiple independent NUMA1 machines. However, it may not be feasible to do this for secondary searches that involve data from a common data set.

4.2.1 Multi-Core Processing

Homogeneous processors use multiple replicated copies of the same hardware core to increase processing power. This is useful for software engineers as it would be easy to model and distribute multiple tasks across a homogeneous hardware platform[Mer08]. Such hardware is suitable for general purpose computing but is less suitable for special application computing as it consumes a large amount of chip resources, which may not be used for the specialised computational task. Heterogeneous processors use multiple dissimilar cores to increase processing power. This is useful for hardware engineers as it is a more efficient usage of chip resource[Mer08]. However, it is more difficult for software to be written for it as each different core has different computational and memory requirements. In fact, each core may even be of an entirely different computing architecture. However, as the focus of this research is very application specific, search only, it is possible to use heterogeneous processing as the way to accelerate computational per- formance. This can be accomplished by designing extensions to an existing processor instead of designing an entirely new processor. This allows some functions to be im- plemented in the host processor software instead of having to implement everything in accelerator hardware. It would also allow us to easily benchmark the performance of the software running with and without the accelerator. By exploiting an existing host processor architecture, it would not be necessary to design an accompanying software toolchain. This will ultimately simplify software development. Therefore, this was the path chosen for the development of the hardware search accelerator.

1NUMA is a computer memory design used in multiprocessors, where the memory access time depends on the memory location relative to a processor[Wik09b].

26 4.2.2 Word Size

Although 64-bit microprocessors are slowly becoming the norm, most applications are still overwhelmingly 32-bit. Therefore, there is little attraction in using a 64-bit word length for either the hardware accelerator or the host processor. For expediency, the internal architecture was selected to use a basic 32-bit word-length for the proof of concept. However, there is no reason that the design cannot be completely converted into a 64-bit design or higher, if necessary, for future work.

4.2.3 Host Processor

The role of the host processor is mainly to configure the accelerator and to supply it with data. Its secondary role would be to provide comparisons between the software and hardware search operations. This means that the host processor has a minor role to play and should be minimalist. The focus of the host processor should be on size and simplicity rather than raw computational performance. Using an open-source processor would provide full access to the design, which will facilitate hardware accelerator integration and simulation. There are several popular microprocessor designs available under an open-source license. Many of these [Sun06, Int07, LCM+06] microprocessors are unnecessarily complex. Therefore, a simpler soft microprocessor architecture was chosen for the host processor. An open-source Verilog implementation[Tan04] of the Microblaze[Xil04] is used as the host processor. It is a DLX-like 32-bit RISC microprocessor which is mainly designed for small embedded applications. In addition to an instruction and data memory bus, it has the advantage of having a dedicated accelerator bus. This third bus can be used as a private communication and configuration bus between the accelerator and host processor. It was also designed by the author of this thesis who has intimate knowledge of its inner workings and can easily modify it when necessary. It is also sufficiently mature and independently tested by users of the processor.

4.3 Software Architecture

Hardware development has to go hand in hand with software development. Otherwise, there will not be any way to exploit the advances made on the hardware platform. In order to accelerate software development, libraries are used where possible.

4.3.1 Software Toolchain

The chosen processor has a mature C/C++ software toolchain based on the GNU Com- piler Collection (GCC version 4.1.1). This simplifies writing software for the host pro-

27 cessor and also leverages on existing software libraries. This allows certain functionality to be emulated in software, where necessary. Although many arguments may be made between the performance of C versus C++ code, a decision was made to use C++ for development. The main reason is due to the prevalence of high-level libraries for C++, while still being able to use C code that is closer to the hardware. Using techniques from [Sak02], the accelerator interface can be interfaced using low level C in a library for the host processor. Simple code tests also show that the code generated for both languages are very similar in performance. The main factor that determines code performance is the optimisation level used and not the language. The optimisation level of -O2 was used for almost all the code compiled for testing. In early tests, it was found that the -O3 optimisation often resulted in a larger code size and slightly slower running code. The -O1 optimisation often resulted in code that made more memory accesses than was necessary. So, the final chosen optimisation level was the best trade-off[Jon05] in terms of performance and size.

4.3.2 Standard Libraries

Initially, several third party libraries were used during testing. However, this caused problems as varying results with different software libraries were encountered. As a result, it was decided that a standard library had to be used for development and testing. Certain libraries have been optimised for their specific applications, but of the many popular libraries available, a decision was made to support the C++ Standard Template Library. Being a standard library, it is both robust and mature[Ste06]. It has met much success through the years, making it suitable for the widest range of applications[Str94]. As a standard C++ library, it is the first port of call for many programmers as many are familiar with them. As a result, it would have been heavily used in existing applications. Hence, there would be more trust in the integrity and performance of its results and a hardware accelerator that is proven to be capable of accelerating C++ STL library routines, would have the widest benefit. The C++ STL library presents many basic data structures (trees, lists, queues, stacks) and related algorithms that operate on these data structures[SL95]. Through its template architecture, these data structures can be easily wrapped around various data types including user defined data types and data structures. This makes the C++ STL, a very powerful library for writing software algorithms.

28 4.3.3 Custom Library

Both a custom mid-level and a low-level accelerator libraries were written. This deci- sion was taken to make the accelerator platform agnostic. The low-level library provides different, primitive read and write routines to access the accelerator registers directly. These are the only routines that need to be modified for different host processor archi- tectures and systems level designs. The mid-level library provides a user-friendly structure to the low-level interface library. It provides data structures that can be manipulated by external software and mid-level software routines to access the accelerator. This allows user software to be abstracted as simple functions instead of calling the hardware routines directly.

4.4 Initial Architecture

Figure 4.1 shows an early conception of a potential accelerator architecture. Although the final design is significantly different from this, it is beneficial to introduce this early design. It helps to understand the train of thought and also the reasons for changes that were made in the end. The accelerator was broken up into three modular sections: bus interface, accelerator and memory interface.

BUS N−ELEMENT ARRAY RAM

HSE HSE HSE

L1/ L2 L1/ L2 L1/ L2

X−CON

L1/ L2 L1/ L2 L1/ L2 ( DDR ( SDRAM) DDR EXTERNAL BUS EXTERNAL HOST CONTROLLER HOST DATA SPACE MEMORY DATA BUS BUS I UNI T NTERFACE HSE HSE HSE I UNI T DDR NTERFACE

Figure 4.1: Initial hardware search accelerator architecture

Bus interfacing was made a modular section in order to make the accelerator platform agnostic. This interface is used to interact with any host processor and will depend on the host architecture. For example, a HyperTransport module could be used for interfacing with AMD processors on a platform or a PCIe module could be used for generic PC interfacing. However, this is the subject of an end-user application and is not directly pertinent to this research.

RAM interfacing was made a modular section for a similar reason. This allows the accelerator to access the popular memory technology of the day. Today, this is

29 DDR2-SDRAM but there is no reason that the accelerator should not be used on any future technology. Again, this is subject to an end-user application and is not directly pertinent to this research.

Element Array actually makes up the main part of this research. This holds a number of small accelerator cores with the different interfaces to each side. The cores can be linked together to form a search pipeline.

It was important to see if this accelerator architecture was viable. In this case, it was, because there was a method to configure a number of accelerators with access from the host processor and to memory. For this exercise, the details of each block were not important. However, there were a couple of potential bottlenecks with this architecture. Firstly, memory was going to be a bottleneck as all the accelerator units access mem- ory through the same memory interface. However, without actually changing existing computing architecture practices, there is little that can be done. Ultimately, all the data sits in a common main memory that needs to be accessed by the accelerators. There are existing techniques to increase memory bandwidth and such details can be handled by the modular memory interface. Secondly, the communication with the host processor through the accelerator bus is another bottleneck. This is used for configuration purposes, which should be fairly light on traffic. However, it is also used to access the results of the search operations. Depending on how the search pipelines were configured, this may be a fairly significant amount of traffic. Therefore, as much of the operation should be offloaded to the accel- erators as possible. This will reduce the traffic to only the ones relevant to retrieving results.

4.4.1 Stack Processors

In the past, stack processors have had limited success in mainstream applications. How- ever, many general purpose stack processors have been studied in [Koo89] and they are just as powerful as mainstream RISC/CISC architectures. In recent years, stack processors are starting to see a revival such as in [LaF06, Pay00] for general purpose use. All recursive algorithms use a stacking model of operation. Although it is not strictly necessary to traverse a tree recursively as explained in [Kor87], recursive algorithms are usually implemented as such. Hence, it made sense to use a stack architecture in the design as it was intrinsically suited. There are many advantages to using a stack processor such as, fast procedure calls and returns and reduced instruction complexity, all of which reduces computational cost for search. So, the search accelerator design began around a stack architecture.

30 However, designing a custom stack microprocessor also required writing a custom toolchain for it. Forth is a commonly used high-level language for programming stack based machines, though it is also possible to use other languages. As the research was into the design of an accelerator, a choice was made, fairly early in the design process, to design it independent of processor architecture. Therefore, it did not make sense to design a special purpose microprocessor architecture to process it. Figure 4.2 shows an initial idea for a custom stack-based accelerator unit. Although it was ultimately decided to abandon the idea of a stack-based processor, this figure presents some interesting ideas. The figure shows the use of a pointer engine, which is an addressing device to off-load the calculation of pointers. However, as most dynamic data structures are not calculated, the pointer engine ultimately became a simple look up device.

X−CON

L1/ L2 CACHE UCACHE

I CACHE DCACHE

PC TOS DP ASLU TOS NOS DECODE BRANCH EG NE PENGI ASTACK DSTACK HSE ELEMENT

Figure 4.2: Initial stack based accelerator architecture

Using a stack architecture also added an extra level of information available to the processor. It allowed the hardware to keep track of the level of tree traversal by keeping count of the push and pop operations. This idea was retained in order to exploit the stack level information. This will be explored later in Section 8.6.2.

31 CHAPTER 5

Streamer Unit

The streamer is the simplest accelerator unit to understand. Functional and timing simulation results show that the streamer off-loads the work of the host processor but does not achieve any significant speed-up. However, as a simple to understand unit, it is used to illustrate the steps taken in writing the simulation software kernel and simulating the device.

5.1 Introduction

The design of the accelerator can start from any part of the pipeline. But, of the different stages of the search pipeline, the simplest operation to perform is the list retrieval operation. Therefore, this can be used to illustrate the processes involved, while keeping everything else simple. In any search query, once a key has been located, the next operation is to extract one or more records that are related to the key. In STL terms, a map data structure could be used to map a key to any secondary data structure, such as a list. In the case of the example search query, once the key cat is found, a list of records that contain cat can then be retrieved from memory. Hence, the next operation is to pull results into the accelerator. This secondary structure would typically be stored in a pointer linked data structure. Although it is by no limited to being a linked list, a linked list is used as an example because it is the most primitive dynamically linked data structure. In order to accelerate the processing of this list, a streamer unit can be used.

32 5.1.1 Design Considerations

From the beginning, it is fairly clear that a streamer unit would provide no significant improvement in performance. List retrieval operations are inherently memory bound. The software operation of processing a list is a O(N) bound function and the streamer unit is also similarly bound. Therefore, the objective of the streamer unit is not to speed things up, but to offload the task from the host processor. As this is not computationally complex, there is little to differentiate the hardware and software operations - it can either be implemented in hardware or software. The only question is the differing amounts of acceleration and costs involved. Its main function is to bypass the host processor in supplying data to the other accelerating units, from main memory. This task of pulling in data, can be analogised to a form of Direct Memory Access (DMA). However, a regular DMA is designed to move data between main memory and devices in large contiguous blocks but it is not data-structure-aware as it does not treat complex data structures any differently from random memory blocks. This is not suitable as the individual nodes of a data structure could be located anywhere within the heap and need not necessarily be contiguous, which results in bandwidth wastage. A streamer unit is designed to be data structure aware and it moves data that is needed, in a set order, and from the correct memory locations, into the accelerator. In addition to being data-structure-aware, it is also results-aware and the streamer will walk through a data structure and extract potential results from it. Therefore, although its primary function is to supply the accelerator units with data, it can also be used standalone, to independently extract data for use by the host processor in any application.

5.2 Architecture

Figure 5.1 illustrates an abstract level view of the flow of data through a streamer. A streamer unit walks through a data structure and converts the data structure into a stream of data values. These data values represent the results from the list retrieval operation. All that is needed to achieve this simple operation is a simple machine structure. This simplicity means that the streamer can be easily implemented at low cost.

STREAMER DATA DATA VALUES STRUCTURE

Figure 5.1: Streamer data flow

33 Figure 5.2 shows the architectural view for a streamer unit. It is a three port device, with one memory port, one output port and one configuration port. The memory port is connected to the data memory and cannot be accessed from the host processor directly. The configuration port is used to access the configuration stack. The retrieved data stream is available on the output port. The output port and configuration port can be accessed via the accelerator bus.

CONF SIZE NEXT DATA

NODE CONFIGSTACK

DATA MEMORYSTREAMER FIFO STREAM0

Figure 5.2: Streamer block

A note needs to be made about the memory. Memory limitations will be considered in detail at a later stage. For now, and for the next few chapters, memory can be considered an abstract device with unlimited space and bandwidth. However, for the simulation results, the host processor and accelerator units are connected to a shared memory pool via a round-robin memory arbiter. Although connected this way, this will not present any problems in our simulations as both hardware and software operations are run separately. So, the issue of memory contention is avoided.

5.2.1 Configuration

The software library hsx/stream.hh provides several software functions to access and configure the streamer unit. There are four streamer channels defined in hsx/types.hh as HSX_STREAM0 through to HSX_STREAM3 but these are not hard limits and are easily changed in software. These identifiers specify the exact streamer channel to access on the accelerator bus. This allows the streamers to be configured and accessed by the host processor. The configuration of a streamer is managed by a series of registers organised as a stack. Figure 5.3 illustrates the structure of the streamer configuration stack. There is no reason that the configuration registers cannot be organised in a different way, such as a memory mapped structure. The reason that this structure was chosen is to simplify the configuration operation in hardware as only one configuration port is needed. Only the CONF register is actually accessible on the accelerator bus and functions as the top-of-stack register. Each write to this register will push the values down the stack. To completely configure the streamer unit, the values need to be written in the order: NODE, DATA, NEXT, SIZE, CONF. All these details are managed by the hsxSetStream() function and the user need not be concerned with the details.

34 31 3 210 CONF ROK MOD ENA RST SIZE 00 NEXT 00 DATA 00 NODE 00

Figure 5.3: Streamer configuration stack

NODE contains the pointer to the base node, which is the first data item in the data struc- ture. This can be obtained for standard STL data structures using the begin() method for each structure. Both the following offset registers specify positive offsets from this value. The two lowest bits are zero as data is assumed to be word-aligned in memory, which is a fair assumption for most 32-bit processors.

DATA contains the offset to the data value within the node. This offset is added to the NODE pointer to obtain a memory location, which holds the actual value that gets pulled in from the data structure into the results stream. Again, this offset is assumed to be word-aligned.

NEXT contains the offset to the pointer for the next node. This offset is added to the NODE pointer to obtain a memory location, which holds the link pointer to the next node in the data structure. This pointer overwrites the existing base pointer before iterating through the stream cycle. The stream cycle forms the actual pointer following operation.

SIZE contains the number of items that are to be retrieved. This value is used in a counter that is decremented after each iteration of the stream cycle. When this counter reaches zero, the stream cycle is halted. The size of a STL data structure can be obtained using the size() method.

CONF is the configuration and status register. Figure 5.3 shows the configuration bits for the CONF register. As pointers are all word-aligned in memory, the enable and reset bits would not be enabled by any of the other registers. This feature ensures that the streamer is cleared and enabled only when it is needed. The hsxSetStream() function also resets the unit before configuring and enabling it.

5.2.2 Operating Modes

There are two modes for the streamer output that can be configured using the CONF:MOD register bit. These modes determine the operating mode of a streamer unit. Figure 5.4

35 depicts these basic modes.

MODE PUMP stops the data stream from being automatically streamed away. Essentially, it disables the read enable signal on the output buffer, which prevents any other attached accelerator unit from streaming the data away. The host processor now needs to manually read the streamed values, using the accelerator bus. This con- figures the streamer to be used in standalone mode, where the streamer is used as an independent device to pull data into the processor from memory.

MODE PIPE tells the streamer to pipe the data stream directly through to another at- tached device. This is another accelerator unit, typically the sieve unit. This mode bypasses the host processor and provides the highest streaming speed pos- sible within hardware as the process flow is directly controlled in hardware.

MODE_PUMP MODE_PIPE

STREAMER FIFO STREAMER FIFO FIFO FIFO SIEVE STREAMER FIFO FIFO FIFO

Figure 5.4: Streamer operating modes

5.2.3 State Machine

Figure 5.5 shows the finite state machine controlling the streamer. There are four states, each running at full clock speed. The main stream cycle consists of the NULL, DATA and NEXT states.

LOAD SIZE > 0 DATA IDLE CONFIG NULL DATA LOAD NEXT

POINTER LOAD

Figure 5.5: Streamer machine states

IDLE is the default state and is entered as a result of either a soft or hard reset. All internal control signals are reset to their default values during this stage. The value of NODE register is copied to the internal node pointer, the DATA register is copied to an internal offset register and the SIZE register is copied to the internal counter.

NULL state performs a cycle check. If the internal counter is zero or if the internal node pointer is a null pointer, the stream cycle is terminated by staying locked in this

36 state and blocks until a reset is received. Otherwise, the data value is addressed by adding the internal node pointer value with the internal offset register. The NEXT register is then copied to the internal offset register.

DATA state performs a single data read transfer on the memory port. The appropriate memory control signals are asserted and de-asserted according to the transfer protocol and the data item is read directly into the output buffers. At the same time, the next data pointer is addressed by adding the internal node pointer value with the internal offset register. The DATA register is then copied to the internal offset register as before.

NEXT state performs another single data read transfer on the memory port. The pointer is loaded directly into the internal node pointer, over-writing the existing pointer and the internal counter is decremented by one.

From this state machine, it takes three clock cycles to stream out a single 32-bit word of data. Assuming that the streamer channel runs at a nominal 100 MHz, the theoretical 1 maximum data streaming speed of a single channel is 1.066 Gbps ( 3 ×100MHz×32bits). However, it loads a 32-bit data and a 32-bit pointer during each iteration giving a theoretical maximum memory bandwidth of 2.13 Gbps.

5.3 Streamer Simulation

To measure the streamer performance, software simulation was used. A streamer kernel was written in C++ to compare the performance of software and hardware streaming methods. Extracts of the source code are listed at the end of the chapter. Each method was timed and measured using a unitless tick count, which was then used to obtain the speed-up factor. The kernel first created and filled an input list (lines 72–76) with random values, the number of which was determined during compile time. The input was then sorted (line 79) and the same data set was used for both the hardware and software streams for comparison. Debug output was obtained via iprintf(), an integer-optimised version of printf() with a smaller memory footprint. Debug and non-debug was selectively enabled using conditional defines and the results for both streaming operations were compared using simple text scripts. Figure 5.6 shows the simulation virtual hardware setup. This setup is used for simulating the other accelerator units as well. The host processor is connected to the accelerator units via a dedicated accelerator interface, which is used for both data and control operations. The processor reads software instructions from instruction memory that is directly connected. Each accelerator and host processor uses a shared data

37 Instruction HOST Memory CPU

Data Memory C C

I/O M M MEMORYI/F Devices CPUI/F V

Figure 5.6: Accelerator unit simulation setup memory pool that is accessed through a round-robin memory arbiter. Contention in the memory arbiter between the host processor and accelerator unit is avoided by performing each software and hardware streaming operation individually.

5.3.1 Kernel Functional Simulation

Listing 5.1 shows the debug output for a data set of N = 30. This verifies that the same results are obtained for software and hardware streamers. It is important to verify that the hardware operation produces the same result before useful performance measurements can be made. Initially, it seems that a streamer unit is quite worthless as an accelerator. Although the bulk of the tick count was consumed by the iprintf() function, the number of ticks required to perform either software or hardware streaming were similar. In fact, the hardware accelerated operation took a slightly longer time. However, this is not entirely true and the situation will be further explored in subsequent chapters. Listing 5.3 shows the sample software used to configure the hardware streamer as an independent DMA device to free up the processor for other functions. Although in actual applications, it will be prudent to check that the buffers are not empty before extracting the result, in this example, the hardware results were not checked to exist in the buffers before being extracted (line 60). This was because the hardware streamer unit pulls in data at a much higher rate than the host processor consumes it. Checking can be done by reading the status of the CONF:ROK bit. The way in which the configuration parameters for the list were obtained may look somewhat convoluted and may require some explaining. This depends entirely on how the data structure was defined and this was just one method of extracting the param- eters. The values that were needed are the base pointer, data and next pointer offsets and the structure size. The information was extracted from studying the C++ STL linked list data structure source file bits/stl_list.h. For a user-defined data struc- ture, extracting the necessary offsets and pointers should be fairly straightforward. Any slow-down due to this added complexity was taken into account by considering this

38 configuration time as a fixed overhead cost. Listing 5.2 shows an equivalent function performed by the software to stream data in. In essence, the software operation needed only very few instructions to perform the stream cycle. As mentioned earlier, this is not computationally intensive and is reflected by the simple for-loop used. In pseudo assembly, the code assembles into:

DO 1 data = LOAD(pointer + data_offset) pointer = LOAD(pointer + pointer_offset) STORE(data, output_location) 4 LOOP

Listing 5.5: Streaming pseudocode

5.3.2 Kernel Timing Simulation

To obtain a more accurate timing measurement, a non-debugged streamer kernel was used. This removes the iprintf() overhead and all tick counts were consumed purely by the streamer operation. Listing 5.6 shows the output of the kernel timing simulation. The number of operations were still similar, with the hardware configuration overhead consuming several memory cycles more than the software operation. Figure 5.7 shows an extract from the streamer kernel simulation timing diagram for one iteration. The time values quoted are all unitless and ten units correspond to one tick. The diagram focuses on the important signals while running the hardware streamer kernel to detect events as they happen in hardware. There are a number of markers on the diagram, indicated by the vertical dotted lines, from left to right:

A (1067870) is an estimated mark for the start of the configuration phase of the streamer. Some parts of the configuration phase happen before this and can be considered kernel function call overhead. This point corresponds to the first trans- fer initiated by the hsxSetStream(HSX_STREAM2, cfg) function call. The first part of this function asserts the CONF:RST bit to reset the streamer state machine and flush the buffers as indicated by the clr_i signal 1 . After this, the con- figuration registers are transferred onto the configuration stack, indicated by the multiple xwb_stb_i and xwb_ack_o handshakes 2 . The same technique of writ- ing to the CONF:RST bit is used to reset, flush, and configure the other accelerator units.

B (1068880) marks the end of the configuration phase of the streamer. The final task of the hsxSetStream(HSX_STREAM2, cfg) function call is to enable the accelerator. At this point, the CONF:ENA bit is asserted, which starts streaming data from memory immediately as indicated by the ena_i signal 3 . The memory transfers are indicated by the multiple dwb_stb_o and dwb_ack_i handshakes 4 . The same technique is used to start the other accelerator units.

39 C (1071360) shows the time at which the output buffers are full. The output buffers are 15 levels deep and are pushed and popped on each wre_i and rde_i assertion. By counting the numbers of buffer pushes and pops, it is evident that the buffers will stall at C 5 . The streamer slows down and waits for items to be extracted from its output buffer. This is because the hardware streams data in at a much higher rate than the software extraction.

D (1071780) marks the time when the streamer stops running. The streamer does not stop until the size counter decrements to zero. The number of items read into the output buffer is 30 in this case, as defined during compile time. This can be determined by counting the number of wre_i assertions after B 6 .

E (1074010) marks the estimated point when the stream kernel function ends. The number of items extracted into the host processor is also 30 in this case. This can be determined by counting the number of rde_i assertions after B 7 . After this, there are a few more operations to perform before the kernel function returns control to the main process; these can be considered the kernel function return overhead.

As seen from the result, the time taken to perform the actual hardware streaming was much shorter than the total hardware operation time. Although the total hardware streamer operation time, TAE = 614 ticks, the amount of time to extract the stream into the host processor, TBE = 513 ticks (83.6%) only. This makes sense as the hardware streamer immediately began to fetch data into the processor at point A and did not stop until it was completed. The hardware configuration overhead, TAB = 101 ticks (16.4%).

From the output, the total streamer kernel consumed, THW = 852 ticks. This makes the function call and return overhead, T+ = 238 ticks (+38.8%). Using a similar function call and return overhead for the software operation, the software operation, TSW = 414 ticks. Therefore, the hardware streamer operation actually ran slower than the software operation. From the timing diagram estimate, the speed-up factor is:

T SW = 0.67 , N = 30 THW

However, there are several points to note in this estimation. Inspecting the source code will show that the function call and return overhead is not 238 ticks. A large proportion of it was actually used up by the operation to extract the values for configu- ration, which happens prior to point A. Assuming the function call and return overhead is similar to that of the other accelerator units, of about 100 ticks, the results are then different.

The total hardware operation would require THW = 752 ticks, of which a total of

40

D A C B E

1047000 sec 1048000 sec 1049000 sec 1050000 sec 1051000 sec 1052000 sec 1053000 sec 1054000 sec 1055000 sec 1056000 sec 1057000 sec Time A B C D E xwb_stb_i=0

xwb_ack_o=0 © dwb_stb_o=0 ©

dwb_ack_i=0

clr_i=0 @I

ena_i=1 @I @

@I

@

iue57 temrtmn diagram timing Streamer 5.7: Figure wok_o=1 @

@ ©

wre_i=0 @ rok_o=0 @

@ ©

rde_i=0 @ @

@

@

7 2 3 4 6 5 1 41 TAB = 224 ticks (29.8%) are used for hardware configuration overhead. The software operation time would require TSW = 552 ticks. This gives a slow-down factor of 0.73, which makes more sense, but is still very slow. On closer inspection, the amount of time taken to actually complete the hardware stream, TBD = THW = 290 ticks including stalling. The additional time at the end,

TDE = 223 ticks is used by the host processor to retrieve the result values from the operation. Therefore, if the results are not retrieved in software but piped directly to the other accelerator units, there is a potential hardware speed-up of 1.90. This is a good sign as the hardware streamer unit is mainly going to be used to stream information into another hardware accelerator unit.

5.3.3 Kernel Performance Simulation

In order to eliminate the large uncertainties from the timing diagram estimates, multiple sampling was used for simulation. The streamer kernel was compiled for different data set sizes between N = 10 and N = 150 to extrapolate a trend. Each kernel was run and sampled 50 times to obtain a range of software and hardware tick counts. In each case the data set was first prepared with random values. The number of samples chosen was trade-off between simulation time and accuracy. Increasing the sample size would improve statistical accuracy in the results. However, sampling 50 times for each data set took about a day to complete the entire simulation run. The simulation had to be re-run each time the design was modified. Figure 5.8 shows the simulation results for different data sets. The points were plot- ted with y-errorbars to mark the mean and standard deviation of the result set but the errorbars aren’t visible as the results are fairly consistent. Both the software and hard- ware curves reflect the O(N) bound nature of a list retrieval operation. Extrapolating linearly from the graphs, yields the following relationships:

Streamer simulation 4000 1

3500

3000 0.9

2500

2000 0.8 Ticks 1500 Speedup

1000 0.7 Software 500 Hardware Speedup 0 0.6 0 20 40 60 80 100 120 140 160 Data set size (n)

Figure 5.8: Streamer performance simulation

42 Msw(N) = 22.0 N + 94.0 (5.1)

Mhw(N) = 22.6 N + 241.7 (5.2)

Equation 5.1 describes the performance of the software streamer. The intercept of 94 is similar to the timing-estimated function call and return overhead of 100. Equation 5.2 describes the performance of the hardware streamer unit. The intercept of 242 is close enough to the estimated overhead from the timing diagram with the complex parameter extraction. From the graph, the speed-up ratio Msw(N) of the streamer for Mhw(N) N = 30 agrees well with the timing estimate of 0.73 and the ratio of the speed-up for a sufficiently large data set, N →∞ is:

Msw(N) Mup = = 0.97 Mhw(N)

This is, in effect, a slight (3%) slow-down rather than a speed-up. As mentioned earlier (section 5.3.1) streaming is a very simple operation that the host processor can, if necessary, handle fairly efficiently. The hardware state machine performs the same series of loops with the same number of memory accesses. Therefore, it does not run any faster than the software kernel. It runs slightly slower due to the difference in memory contention between the software and hardware kernels. The software kernel had the entire memory bus to itself while the hardware kernel had to share small parts of it with the parts of the running software application. However, from section 5.3.2, it is apparent that the streamer unit actually pulls in data at a higher rate than the software is able to remove it. Therefore, the streamer unit would spend a large proportion of its time stalling while waiting for the host processor to pull data off. In addition, the hardware streamer works independently of the host processor and can consequently be used to offload the streaming operation from the host processor. In terms of using it to feed data to other acceleration units, the exact effects are not presently known but will be explored in subsequent chapters.

5.4 Conclusion

The streamer unit can be used to accelerate a list-retrieval operation on a dynamically linked structure in two ways. When used as a standalone unit, it can be used to offload the task of pulling data into the host processor. This will reduce the workload of the host processor but provides little additional benefit. If used in combination with other acceleration units, in addition to offloading, it can potentially pipe data through at a much higher rate than the host processor can. However, this assumes that the attached

43 unit would consume data at a sufficiently high rate to prevent the streamer unit from stalling, which is a limitation. The streamer unit is ultimately a O(N) bound machine. The hardware design is fairly optimised, with multiple operations layered across the machine states. It has a speed-up factor Mup = 0.97, which is similar enough to the software operation. This makes it useful as the worst case is not significantly detrimental to the list retrieval operation while it frees up the host processor for other operations. The maximum external memory bandwidth that is required for each streamer unit is 2.13Gbps at 100MHz while the maximum internal data transfer rate is 1.066Gbps at 100MHz.

44 Software Stream HIT :0x56 HIT : 0x121A HIT : 0x1244 HIT : 0x281B HIT : 0x2BBD HIT : 0x2D13 HIT : 0x2E80 HIT : 0x37AE HIT : 0x38D4 HIT : 0x3F17 HIT : 0x5147 HIT : 0x536E HIT : 0x64C8 HIT : 0x65E6 HIT : 0x6F80 HIT : 0x7007 HIT : 0x7CB4 HIT : 0x7D4E HIT : 0x8957 HIT : 0x9903 HIT : 0xAFF3 HIT : 0xB06A HIT : 0xC596 HIT : 0xCD1F HIT : 0xD3EE HIT : 0xD8CF HIT : 0xE89A HIT : 0xE9B2 HIT : 0xF54E HIT : 0xFBF4 182638 swticks 10079 swmemticks Hardware Stream HIT :0x56 HIT : 0x121A HIT : 0x1244 HIT : 0x281B HIT : 0x2BBD HIT : 0x2D13 HIT : 0x2E80 HIT : 0x37AE HIT : 0x38D4 HIT : 0x3F17 HIT : 0x5147 HIT : 0x536E HIT : 0x64C8 HIT : 0x65E6 HIT : 0x6F80 HIT : 0x7007 HIT : 0x7CB4 HIT : 0x7D4E HIT : 0x8957 HIT : 0x9903 HIT : 0xAFF3 HIT : 0xB06A HIT : 0xC596 HIT : 0xCD1F HIT : 0xD3EE HIT : 0xD8CF HIT : 0xE89A HIT : 0xE9B2 HIT : 0xF54E HIT : 0xFBF4 183040 hwticks 10129 hwmemticks Listing 5.1: Streamer kernel debug output

int swstream(std::list &listA) 31 { 32 for (std::list::iterator node = listA.begin(); node != listA.end(); node++) { volatile int j = *node; 37 #ifdef DEBUG iprintf("HIT\t: 0x%X\n", j); #endif } } 42 Listing 5.2: Software streamer kernel

45 int hwstream(std::list &listA) 44 { std::list::iterator node; 46 hsxStreamConfig cfg;

cfg.conf.bits.mode = HSX_STREAM_PUMP; cfg.node = (int) &*listA.begin()._M_node; // node base cfg.next = (int) &node._M_node->_M_next; // next offset 51 cfg.data = (int) &((std::_List_node *)node._M_node )->_M_data; cfg.size = LIST_MAX; //listA.size();

hsxSetStream(HSX_STREAM2 , cfg); 56 // pull data for (int i=0; i

int stream() 68 { std::list listA; 71 // prefill lists for (int i=0; i

// sort lists listA.sort();

// do sieve 81 int ticks; int memtick;

// SOFTWARE STREAM iprintf("Software Stream\n"); 86 memtick = getmemtick(); ticks = gettick(); swstream(listA); ticks = gettick() - ticks; memtick = getmemtick() - memtick; 91 iprintf("%d swticks\n",ticks); iprintf("%d swmemticks\n",memtick);

// HARDWARE STREAM iprintf("Hardware Stream\n"); 96 memtick = getmemtick(); ticks = gettick(); hwstream(listA); ticks = gettick() - ticks; memtick = getmemtick() - memtick; 101 iprintf("%d hwticks\n",ticks); iprintf("%d hwmemticks\n",memtick);

return EXIT_SUCCESS; } 106 Listing 5.4: Streamer kernel

Software Stream 652 swticks 98 swmemticks Hardware Stream 852 hwticks 120 hwmemticks Listing 5.6: Streamer simulation output (non-debug)

46 CHAPTER 6

Sieve Unit

The sieve unit is a primitive computational unit. It can be configured to perform multiple operations and also takes inputs from multiple sources. The functional and timing simulation results show that it can improve per- formance of a simple boolean query by 5.2 times when used in conjunction with the streamer unit in a hardware pumped configuration.

6.1 Introduction

After a list of results is extracted during list retrieval, a common operation required at the end of a search is to collate the results. If the results do not need to be collated, they can be buffered and passed through to the host processor as the final results. Both of these functions are performed by a sieve unit. Of the two operations, collation is computationally intensive and is a suitable can- didate for hardware acceleration. It involves traversing the results list and comparing each individual element to see if it matches one or more filter criteria. The example query would need to compare a list of records to see if they both contain the words cat and cambridge. Unlike the streamer, a sieve works on the actual values of the results and is not data-structure-aware. As the name suggests, it is designed to filter through a results list to extract only the results that fit a specific criterion. Common operations that would need to be performed are intersection and union of two lists. As the sieve unit is used at the end of a search pipeline, it can be used as a results buffer to extend the output buffer capacity of any other acceleration unit. In this mode,

47 it can extend the existing output buffer capacity. This can help alleviate bottlenecks that are caused by buffer stalls as seen in the streamer unit previously.

6.1.1 Design Considerations

To perform these functions, some factors need to be considered. The process of inter- secting two lists can be software-intensive. Each item in the first list needs to be checked against the items in the second list. Assuming that the two lists are of equal length, a na¨ıve algorithm would do this operation in O(N 2) time. If the lists are unsorted, this is the best possible performance of any sieve operation. However, in this case, the problem lies with the data structure rather than the computation. For optimal operation, data items should be inserted into a list in sorted order. If the lists are in sorted order, a smarter algorithm would be capable of reducing this operation to O(N) time. However, sorting is a computationally intensive operation that is O(N log N) bound. Assuming that insertion happens far less often than data selection, it should be per- formed during data insertion. For most common applications, this assumption is valid. In this case, the cost of sorting the data can be easily amortised over a number of select operations, which is more efficient than sorting the data after a select operation. If the minimum and maximum values are known, the sieve operation can be further reduced to O(log N) time. This can allow a binary search to be used to quickly elim- inate blocks of items that need to be read. However, this may need better hardware capabilities to decide where to split the range. Furthermore, it requires the entire list to be available in advance, to know the list range. A hybrid technique is used in the hardware sieve unit. It runs in O(N) time while being able to eliminate blocks of data at a time. The two lists are stored in input buffers and the minima and maxima of each input buffer are tracked. When possible, the input buffers are flushed to eliminate whole blocks. The result is a very fast sieve unit that provides significant acceleration.

6.2 Architecture

Figure 6.1 illustrates an abstract level data view of the sieve process. Multiple result streams generated by streamer units or software are combined by the sieve unit into a final result stream. Although a sieve unit is designed to combine two streams into one, multiple sieve units can be combined in parallel and in cascade, for more complex collation operations. Figure 6.2 shows a single sieve unit organised as two individual channels. Each individual channel has an input and output port, linked directly to internal input and

48 DATA STREAM A

SIEVE RESULTS

DATA STREAM B

Figure 6.1: Sieve data flow output buffers. Unlike the streamer unit, the channels are controlled and configured in pairs because the sieve works on paired data streams. Configuring either SIEVE_0 or SIEVE_1 channel results in configuring the same channel pair.

CONF

SIEVE 0 FIFO FIFO SIEVE 0

INPUT STREAM OUTPUT STREAM SIEVE SIEVE 1 FIFO FIFO SIEVE 1

Figure 6.2: Sieve Block

6.2.1 Configuration

The software library hsx/sieve.hh provides several software functions to access and configure the sieve unit. There are four sieve channels defined in hsx/types.hh as HSX_SIEVE0 through to HSX_SIEVE3 in this library. Just as before, these identifiers specify the exact sieve channel to access on the accelerator bus. Unlike the streamer unit, the sieve unit has a single configuration register that con- trols the operation of the channel pair, instead of a stack. The configuration of a sieve unit is managed through the accelerator bus using the hsxSetSieve() function. This function will write the appropriate values to the configuration register. Figure 6.3 describes the configuration bits of the register. The configuration register also doubles as a status register for each individual channel. Although writes to each sieve will configure the same register, reads from each sieve will return only the status of each individual channel. So, a read will return the data available status for read and buffer available status for write (CONF:ROK and CONF:WOK) bits individually for HSX_SIEVE0 and HSX_SIEVE1. 31 6 5 4 3 2 1 0 CONF MODE WOK ROK ENA RST

Figure 6.3: Sieve configuration register

49 6.2.2 Modes

A2 A1 A0 FIFO FIFO A1 A0 A2 A1 A0 FIFO FIFO B1 B0 SIEVE SIEVE SIEVE B2 B1 B0 FIFO FIFO B1 B0 B2 B1 B0 FIFO FIFO A1 A0

MODE_PAS MODE_SWP

A2 A1 A0 FIFO FIFO A1 A2 A1 A0 FIFO FIFO A1 B1 B0 A0 B2 B2 SIEVE SIEVE B2 B1 B0 FIFO FIFO B2 B1 B0 FIFO FIFO

MODE_AND MODE_IOR

Figure 6.4: Sieve operating modes

The sieve unit is able to perform several functions. Figure 6.4 depicts some of the basic functions and other functionality can be added to the sieve unit when desired. The basic functionality is sufficient to perform the most common collation functions. These functions are configured by setting the CONF:MOD bits to certain values:

MODE PAS links the output buffers and input buffers directly in pass-through mode. This mode makes the sieve extend any results buffer of any other accelerator unit. In this mode, the sieve can stream data at maximum speed and store additional elements in each buffer.

MODE SWP crosses the output buffers with the input buffers in swap-through mode. This swaps the two channels but does not filter any of the results. In this mode, each channel can stream at maximum speed and store additional elements in each buffer. If a number of sieves are used in cascaded swap mode, it is possible to use them as routers to move data around.

MODE AND filters the two input buffers into a single output buffer in intersection mode. Only duplicate results that appear on both input streams are filtered into the output. This performs a logical AND operation on the input streams. In the figure, B2 is the same as A1. Only A1 is filtered through while B2 is dropped. In this mode, whole buffers are flushed if the values in a buffer are outside the intersection range.

MODE IOR filters and sorts the two input buffers into a single output buffer in union mode. Duplicate results that appear on both input streams are filtered into a single occurrence. At the same time, the results are sent out in sorted order. This performs a logical inclusive OR operation on the input streams. In the figure, B1 is smaller than A1, which is the same as B2. Only A1 is filtered through while B2 is dropped.

50 6.2.3 Operation

Although the channel pairs are configured together, each individual input and output port can be accessed through the accelerator bus using hsxGetData() and hsxPutData(). Data can be fed directly into the sieve unit by the host processor or streamed from a streamer unit. Using software to pump data allows the sieve unit to be used as a stan- dalone data collation unit to accelerate the collation of data streams retrieved from various other sources. However, the fastest performance can be achieved if the data is streamed directly from a streamer unit instead of via the host processor. The sieve unit works by having data pushed into it in sorted order. The output of a sieve is also sorted and can be used as a direct input to another sieve. This allows sieve operations to be cascaded for complicated collation operations. Figure 6.5 shows the basic finite state machine controlling the sieve. There are only 2 states, each each running at full clock speed.

BUFFERS NOT EMPTY

IDLE WORK

TRANSFER

Figure 6.5: Sieve FSM

IDLE state is the default state, where the data items in the input buffers are compared. The sieve unit maintains a record of the minimum and maximum data items pushed into the input buffer. A decision is made to either flush, pop, pass, swap or stall the buffers based on the contents of each input buffer.

WORK state is when the actual sieve operation occurs. A flush clears all data items from an input buffer, a pop drops a single data item from the input buffer, a pass feeds a single data item from the input to the output buffer, a swap feeds a single data item from the input to the opposite output buffer, and a stall will wait for more data items to be made available on the input buffers.

Therefore, the maximum theoretical transfer speed of a single channel is 1.6 Gbps at 100 MHz. With two independent channels, the sieve unit can have a maximum data transfer rate of 3.2 Gbps at 100 MHz. This is ample data bandwidth to handle the incoming data stream that comes from either a streamer unit or the host processor. The maximum theoretical transfer rate of a streamer unit is only 1.066Gbps and the processor transfer speed is even slower. Therefore, there is no possibility of saturating the sieve unit either by software or hardware piping of the inputs, and this makes the sieve operation bound by the streaming input operation.

51 6.3 Simulation Results

Once again, simulation was used to measure the performance of the sieve unit. Listing 6.3 show parts of the sieve kernel written to perform an intersect collation in both software and hardware. Two input streams were created and filled (lines 100-105) with a number of different random values. Once again, the data set size was defined during compile time. Several random values were then inserted (lines 107-113) into both streams to ensure that intersections exist. The two streams were then sorted using O(N log N) time (lines 115-117) and used for both software and hardware input streams. As the two streams contain different values, the common values were located at different parts along each list. The results of both the hardware and software sieves are inspected visually using the debug output.

6.3.1 Kernel Functional Simulation

Listing 6.4 shows the debug output. It verifies that the same results were obtained for both software and hardware sieves. This shows that the hardware sieve can at least be used to offload the operation from the host processor. Listing 6.2 shows how to use the hardware sieve as an independent sieve configured for intersection mode under software pump. Both lists were manually pumped into the sieve through software, using the hsxPutData() function. In this example code, the software kernel did not check for the status of the read and write buffers because the software pump will never saturate the buffers. However, the CONF:WOK and CONF:ROK status bits should be checked in actual applications. Listing 6.1 shows an equivalent intersection computed in software. There are different methods to compute intersections. The method used computes the intersection in O(N) time.

6.3.2 Kernel Software Pump Timing

Listing 6.5 shows the non-debugged software output. Unlike the streamer unit, it is evident that the hardware operation is faster than the software option, even at this stage. Figure 6.6 shows the resultant timing diagram from running the sieve with software pump input. It is a single sample with a data set size of N = 33 for each list. Again, the timing diagram is unitless and 10 units are equal to a tick. There are three markers on this timing diagram:

A (1885940) marks the point when sieve configuration began. This corresponds to the beginning of the hsxSetStream(HSX_STREAM2, cfg) function. Again, the parts

52 int swsieve(std::list &listA, std::list &listB ) 31 { std::list::iterator idxA, idxB; 34 idxA = listA.begin(); idxB = listB.begin();

while( (idxA != listA.end()) && (idxB != listB.end()) ) { 39 if (*idxA == *idxB) { volatile int j = *idxA; // HIT!! #ifdef DEBUG 44 iprintf("HIT\t: 0x%X\n", j); #endif

idxA++; idxB++; 49 } else if (*idxA < *idxB) { idxA++; } 54 else { idxB++; } } 59

return EXIT_SUCCESS; } Listing 6.1: Sieve software kernel

int hwsieve(std::list &listA, std::list &listB ) 64 { hsxSieveConfig cfg; 66 cfg.conf.bits.mode = HSX_SIEVE_AND; hsxSetSieve(HSX_SIEVE2 , cfg);

std::list::iterator idxA, idxB; idxA = listA.begin(); 71 idxB = listB.begin();

// data pump while( (idxA != listA.end()) && (idxB != listB.end()) ) { 76 hsxPutData(HSX_SIEVE2 , *idxA++); hsxPutData(HSX_SIEVE3 , *idxB++); }

// data pull 81 for (int i=0; i<(LIST_MAX/10); ++i) { // HIT!! volatile int j = hsxGetData(HSX_SIEVE2); 86 #ifdef DEBUG iprintf("HIT\t: 0x%X\n", j); #endif } 91 return EXIT_SUCCESS; } Listing 6.2: Sieve hardware kernel

53 int sieve() 96 { 97 std::list listA, listB;

// prefill lists for (int i=0; i

// plant about 10% hits 107 for (int i=0; i<(LIST_MAX/10); ++i) { int j = getrand() & 0x0000FFFF; listA.push_back(j); listB.push_back(j); 112 }

// sort lists listA.sort(); listB.sort(); 117

// do sieve int ticks; int memtick; 122 // SOFTWARE SIEVE iprintf("Software Sieve\n"); memtick = getmemtick(); ticks = gettick(); swsieve(listA, listB); 127 ticks = gettick() - ticks; memtick = getmemtick() - memtick; iprintf("%d swticks\n",ticks); iprintf("%d swmemticks\n",memtick); 132 // HARDWARE SIEVE iprintf("Hardware Sieve\n"); memtick = getmemtick(); ticks = gettick(); hwsieve(listA, listB); 137 ticks = gettick() - ticks; memtick = getmemtick() - memtick; iprintf("%d hwticks\n",ticks); iprintf("%d hwmemticks\n",memtick); 142 return EXIT_SUCCESS; } Listing 6.3: Sieve kernel

54

A C B

1883000 sec1884000 sec1885000 sec1886000 sec1887000 sec1888000 sec1889000 sec1890000 sec1891000 sec1892000 sec1893000 sec1894000 sec1895000 sec1896000 sec1897000 sec1898000 sec1899000 sec1900000 sec1901000 sec Time A B C SIEVE2 @I

iue66 iv otaepme iigdiagram timing pumped software Sieve 6.6: Figure xwb_stb_i=0 xwb_wre_i=0 xwb_ack_o=0 @

a_rok_o=1 @ @I @I wre_i=0 © @ @ @

@ @ @

1 2 3 4 55 Software Sieve HIT : 0x116E HIT : 0x22F1 HIT : 0x30B1 HIT : 0x8601 HIT : 0xD2C6 30874 ticks 1859 memticks Hardware Sieve HIT : 0x116E HIT : 0x22F1 HIT : 0x30B1 HIT : 0x8601 HIT : 0xD2C6 28080 ticks 1769 memticks Listing 6.4: Sieve kernel output (debug)

Software Sieve 2330 swticks 198 swmemticks Hardware Sieve 1454 hwticks 141 hwmemticks Listing 6.5: Sieve kernel output (non-debug)

before this were considered the function call overhead and similar to the streamer, it resets and flushes the sieve before configuring it.

B (1886800) marks the point when sieve configuration was completed. This corresponds to the end of the hsxSetStream(HSX_STREAM2, cfg) function. The sieve is en- abled at this point. After this point, data from the two lists were then pumped into the sieve unit via software. Each assertion of the xwb_stb_i signal 1 corre- sponds to writing a value to the sieve input. As both input channels are pumped by the software, there are 66 data items written.

C (1899680) marks the point when the sieve operation was completed. The parts after this were considered the function return overhead. The wre_i assertions show the number of hits 2 3 4 written into the output buffer. In this case, there were 3 hits, which was correct for a list size of 33 items (30 dummies and 3 hits) as prepared by the software. The last three xwb_stb_i assertions were used to pull the items off the sieve output.

Reading from the timing diagram, the complete operation, TAC = 1374 ticks. The configuration operation, TAB = 86 ticks (6.25%) while the data pump and pull operation,

TBC = 1288 ticks (93.75%). The total ticks reported was 1454 and the overhead, T+ = 80 ticks (+5.8%) was used for function call and return. Both the configuration and software overheads are minor as the bulk of the time is taken up by the sieve operation. According to the non-debug output, the software sieve required 2330 ticks. Using a similar function call and return overhead, the software sieve consumes TSW = 2250

56 ticks. The hardware accelerated operation gives a speed-up factor:

T SW = 1.64 TAC

Although this is an acceleration, it is not very significant. Hence, the sieve unit would not benefit the host processor very much in standalone operation. In this config- uration, the host processor is kept busy pumping data to the sieve unit. Therefore, the performance bottleneck is governed by the host processor software pumping the data. Once this bottleneck is removed, the results are much better as shown in section 6.3.4.

6.3.3 Kernel Software Pump Performance

Figure 6.7 shows the simulation results of the software and hardware tick count for sieve operations on different data sets. Again, each data set was randomly prepared and the simulation was sampled 50 times for each data set size. The calculated averages and standard deviations are used to plot the curves.

Sieve software pump simulation 12000 1.7

10000

8000 1.6

6000 Ticks Speedup 4000 1.5

2000 Software Hardware Speedup 0 0 20 40 60 80 100 120 140 160 Data set size (n)

Figure 6.7: Sieve software pumped simulation

Graphically, the software and hardware timings were both linearly proportional to the size of the data sets while the speed-up showed diminishing returns. Extrapolating the software and hardware curves, their linear relationships can be extracted into the following equations:

Vsw(N) = 78.3 N + 90.1 (6.1)

Vhw(N) = 45.9 N + 162.7 (6.2)

As before, the intercepts of each graph represent the fixed overhead costs while the slope shows the cost per data item. A software overhead of 90 is very close to the timing estimate of 80 ticks. A hardware overhead value of 163 is of a similar to the timing estimate of TAB + T+ = 166 ticks.

57 Graphically, the speed-up factor for N = 30 of 1.6 is similar to the timing estimate of 1.64. Graphically, it is clear that for large data sets, the factor reaches a plateau at about 1.7 speed-up factor. Once again, the speed-up factor is not very impressive as it is entirely limited by software. Even for a sufficiently large data set, the speed-up factor merely resolves to: Vsw(N) Vup = = 1.71 Vhw(N)

6.3.4 Kernel Hardware Pipe Timing

As clearly evidenced by the earlier results, the host processor pump is a bottleneck for supplying data to the sieve. It is possible to use one streamer unit to pump one data stream, alongside software pumping the second data stream, to speed things up. But, by using two streamer units to pump data to both sieve channels, the software pump bottleneck is removed entirely. Listing 6.6 shows the modified kernel that used two streamer units to pipe data into the sieve unit. In this case, unlike in Listing 6.2, the status of the output buffer needed to be checked for valid outputs. This was done by means of a polling mechanism here, but this can also be done using interrupts. Listing 6.7 shows the output from the modified kernel software. As can be clearly seen, the software operation consumed a similar amount of time but the hardware op- eration was much faster than the one in section 6.3.2. In this case, the data set was set to N = 30. Figure 6.8 shows the results of running the sieve with two streamer inputs. There are six markers on this timing diagram:

A (1869080) marks the beginning of the configuration overhead. This corresponds to the hsxSetSieve(HSX_SIEVE2) function. Everything before this is considered the function call overhead.

B (1871050) marks the point when HSX_STREAM2 configuration starts. This corresponds to the hsxSetStream(HSX_STREAM2, cfgA) function.

C (1872060) marks the point when HSX_STREAM2 configuration ends. At this point, the streamer unit began to read data in from memory and piped it directly onto the sieve unit. The number of data units read can counted by the number of dwb_stb_o assertions 1 . This point can be considered the starting point for the operation as data starts to stream in here.

D (1873910) marks the point when HSX_STREAM3 configuration starts. This corresponds to the hsxSetStream(HSX_STREAM3, cfgB) function.

58

D A C B E F

1869000 sec 1870000 sec 1871000 sec 1872000 sec 1873000 sec 1874000 sec 1875000 sec 1876000 sec 1877000 sec 1878000 sec Time A B C D E F iue68 iv ihhrwr ie iigdiagram timing piped hardware with Sieve 6.8: Figure SIEVE2 xwb_stb_i=1 xwb_ack_o=0 wre_i=0 © © ©

STREAM2

xwb_stb_i=0 xwb_ack_o=0

dwb_stb_o=0 @I

dwb_ack_i=0 wre_i=0 @

3 @

STREAM3

xwb_stb_i=0 @

xwb_ack_o=0 @ @I

dwb_stb_o=0 4 @ dwb_ack_i=0 @

wre_i=0 @ @

5

1 2 59 E (1875100) marks the points when HSX_STREAM3 configuration ends. At this point, all the configuration overhead ends. The streamer unit began to supply the sec- ond stream of information to the sieve unit 2 . The sieve unit now had enough information to begin the intersection operation.

F (1877300) marks the point when the sieve operation is completed. The number of hits can be counted by the number of wre_i assertions at the top 3 4 5 . In this case, there were 3 hits, which is the correct number of hits for a list size of 30. The distribution of the hits indicated that the intersections occurred towards the end of the list but it could occur anywhere.

A stall can be observed around E as the buffers become full. By default, the buffer depths were configured to 15 levels deep. As there was both an output buffer in the streamer and an input buffer in the sieve, the stall happened after 30 data items were read. At this point, there were 15 data items sitting in each input and output buffer. Reading the values directly from the timing diagram, the total time for complet- ing the whole hardware piped operation TAF = 822 ticks. The configuration time was

TAC = 298 ticks (36.3%) and the actual sieve operation consumed TCF = 524 ticks

(63.7%). The actual number of ticks reported by the terminal output THW = 938 ticks.

The additional T+ = 116 ticks (+14.1%) are considered used by the function call and return overhead. Using a similar value for the software sieve operation, the software operation took

TSW = 2398 ticks. Using this result, the hardware accelerated operation resulted in a 2.9 factor acceleration. This is a significant acceleration and a promising result when compared to the earlier software-pumped estimate. A point of note is the total cost of the overhead is a much larger portion in the hardware accelerated operation. This indicates that the actual effective search operation is much faster than before. As the size of the data elements increases, the configuration overhead will ultimately become negligible. For most purposes, these overheads can be considered a constant value while the operation time is proportional to the number of elements in the list. In other trials, it was discovered that using the .size() function is dependent on N and can increase the hardware overhead to 1400 ticks. Therefore, the setup time can also be reduced if the method for extracting the pointers and offsets is less convoluted.

6.3.5 Kernel Hardware Pipe Performance

Figure 6.9 shows the simulation results of the software and hardware tick count for sieve operations on different data sets, as before. Immediately noticeable is the slope of the hardware line and the range of the speed-up, which has changed tremendously.

60 Sieve-streamer pump simulation 12000 4.5

10000 4

3.5 8000

3 6000 Ticks

2.5 Speedup 4000 2 2000 Software Hardware 1.5 Speedup 0 0 20 40 60 80 100 120 140 160 Data set size (n)

Figure 6.9: Sieve with streamer piped simulation

Extrapolating linearly and extrapolating the results, we end up with the following equations to describe the software and hardware sieves:

Vsw(N) = 78.4 N + 77.9 (6.3)

Vhw(N) = 15.0 N + 520.2 (6.4)

Equation 6.3 shows insignificant change from equation 6.1 software simulation re- sults, as is expected. The slopes are the same and the intercepts are very similar. The timing estimate of 116 for function call and return is close enough to the intercept of 78. Equation 6.4 shows a significant difference from equation 6.2. The intercept of 520 is similar enough to the timing estimate of TAC + T+ = 414 ticks for the hardware opera- tion overhead, considering the estimation method. The hardware configuration overhead has increased significantly as the streamer unit configurations require a significant time to extract the configuration options and push these values into the stack. This is evident when comparing timing diagrams and is illustrated experimentally here. The timing estimation of 2.9 for the speed-up factor, corresponds to the ratio of the two slopes when N = 30 as shown on the graph. Once again, the speed-up factor shows a trend of diminishing returns but the hardware piped speed-up factor for a sufficiently large data set is:

Vup = 5.23

This is a significant speed-up value. Therefore, it can be safely said that the hardware piped sieve provides a very significant acceleration over the pure software operation.

61 6.4 Conclusion

A sieve unit designed as decribed can evidently be used as an accelerator unit to offload the filtering operation from the host processor and accelerate result collation. The speed- up of a sieve unit is 1.71 when used standalone but it gives a Vup of 5.2 when combined with the hardware streamer unit and under ideal conditions. In this situation, the performance of the sieve unit is bound by the performance of the streamer unit. However, this analysis considers only the hardware versus software tradeoffs when intersecting two lists. For a more complicated operation with multiple lists, the hardware speed-up could be more. Multiple hardware sieves can perform operations in parallel and in cascade without any significant problems. The sieve unit does not consume any external memory bandwidth as it does not deal with the data set directly. This contributes to a larger acceleration when compared to software methods that need to access memory regularly. The maximum internal bandwidth of each sieve channel is 1.6 Gbps for a combined total of 3.2 Gbps per unit at 100 MHz.

62 int hwsieve(std::list &listA, std::list &listB ) 64 { // configure sieve hsxSieveConfig cfg; 67 cfg.conf.bits.mode = HSX_SIEVE_AND; hsxSetSieve(HSX_SIEVE2 , cfg);

// configure streamers std::list::iterator node; 72 hsxStreamConfig cfgA, cfgB;

cfgA.conf.bits.mode = HSX_STREAM_PIPE; cfgA.node = (int) &*listA.begin()._M_node; // node base cfgA.next = (int) &node._M_node->_M_next; // next offset 77 cfgA.data = (int) &((std::_List_node *)node._M_node)->_M_data; cfgA.size = LIST_MAX + LIST_MAX/10; // listA.size();

cfgB.conf.bits.mode = HSX_STREAM_PIPE; cfgB.node = (int) &*listB.begin()._M_node; // node base 82 cfgB.next = (int) &node._M_node->_M_next; // next offset cfgB.data = (int) &((std::_List_node *)node._M_node)->_M_data; cfgB.size = LIST_MAX + LIST_MAX/10; //listB.size();

hsxSetStream(HSX_STREAM2 , cfgA); 87 hsxSetStream(HSX_STREAM3 , cfgB);

// data pull for (int i=0; i<(LIST_MAX/10); ++i) { 92 while (!(hsxGetConf(HSX_SIEVE2) & (1<<5))); // wait for result // HIT!! volatile int j = hsxGetData(HSX_SIEVE2);

#ifdef DEBUG 97 iprintf("HIT\t: 0x%X\n", j); #endif }

return EXIT_SUCCESS; 102 } Listing 6.6: Hardware streamer-sieve kernel

Software Combi 2514 swticks 198 swmemticks Hardware Combi 938 hwticks 189 hwmemticks Listing 6.7: Streamer-sieve kernel output

63 CHAPTER 7

Chaser Unit

The chaser unit operates on the key search stage of the search pipeline. It can be configured to work with different data structures and applications. Functional and timing results of the chaser simulation show that the it has the ability to accelerate multiple key searches by up to 3.43 times when compared against a pure software operation.

7.1 Introduction

The final accelerator unit is the chaser unit. The very first step of any search pipeline usually begins with a key search. This can be considered the most common search operation, which is performed during primary and secondary search operations. In primary search operations, this is the only search operation that is performed. Therefore, it is a common computational task and would benefit a considerable number of further applications if accelerated in hardware. The task of chasing down the search key involves a few operations: loading the data from memory and comparing it against a search key. Based on the result of the comparison, a decision is made on what action to do next. At a machine level, this is a fairly mundane task that does not exploit the full powers of a microprocessor but unnecessarily consumes valuable computational power. This presents an excellent opportunity for offloading the operation to free up the processor for other compute intensive tasks, without compromising on the raw performance.

64 7.1.1 Design Considerations

As mentioned previously, keys are often stored in a tree structure. Therefore, it is evident that a tree traversal algorithm would be used to search it. In many implementations, travelling down a branch and up it again involves remembering and recalling where the algorithm has been. This is essentially similar to the action of pushing down and popping back up the procedural call stack. Hence, a tree traversal algorithm would benefit from having a stack machine ar- chitecture, which is naturally suited to efficient stack operations. The fact that the popular SQLite database engine uses a stack-based virtual machine [Hip07] to process SQL queries, lends some credence to this idea. This formed the initial idea of a hardware chaser design. An initial design was envisioned, involving a simple dual-stack processor that could be programmed with a few primitive operations. The basic operations needed were: a memory load, a compare operation and a conditional operation. The top of stack contains a pointer that points to the current data node. As the device steps down the tree, the stack is pushed and the new node pointer loaded from memory into the top of stack. When going back up the stack, the pointers merely need to be popped off the stack. The design had a distinct advantage of keeping the previously loaded pointers in the stack. This meant that the pointers would not need to be recalculated nor reloaded from memory, which would save both computational and bandwidth resources. The stack machine was also modest in its use of resources as it merely needed a small memory block with a simple ALU. However, after spending some time on this design, it was abandoned. It was found that a key search is more akin to a list traversal than a tree traversal. Any search that involves traversing an entire tree is best optimised by reorganising the data in a different method. The whole idea of using a tree is to eliminate entire branches with every step of the traversal. There is no need to move back up a tree and the path traced through the tree is only one way. As a result, it was possible to redesign the chaser unit in a simpler form.

7.2 Chaser Architecture

The function of the chaser is to process a data structure and extract part of it as a result. Figure 7.1 illustrates an abstract level view of a chaser data flow. Like the streamer, the chaser processes a data structure, not the data values directly, and extracts the data node that matches a key. To enable it to do this, the chaser unit has four ports: an input port, an output port, a configuration port and a memory port.

65 DATA DATA CHASER NODE STRUCTURE

Figure 7.1: Chaser data flow

Figure 7.2 illustrates the structural view of the chaser. The memory port is connected directly to data memory and used to read in the data structure. The input, output and configuration ports are only accessible over the accelerator bus. The input port is used to write the primary key into the appropriate configuration register. The output port is used to retrieve results from the chaser. The configuration port is used to write values into the configuration stack.

CONF GTCC LTCC EQCC DATA CONFIGSTACK NODE

PKEY FIFO CHASER FIFO CHASE0

DATA MEMORY

Figure 7.2: Chaser unit block

7.2.1 Configuration

The software library hsx/chase.hh provides several software functions to access and configure the chaser unit. There are four chaser channels defined in hsx/types.hh as HSX_CHASE0 through to HSX_CHASE3. These identifiers specify the exact chaser channel to access on the accelerator bus. Like the streamer unit, the chaser holds its configuration in a stack. The only configuration register not accessed on the stack is the key register, PKEY, which is written to separately from the rest of the configuration. This separates the search key from the data set configuration, which simplifies the chaser for multi-key searches on a single data set. Figure 7.3 lists the registers in the configuration stack. It works the same way as the streamer configuration stack mentioned in section 5.2.1 All the details are managed by the hsxSetChase() function. To configure the chaser unit, the values need to be written in order: NODE, DATA, EQCC, LTCC, GTCC, CONF.

NODE contains the base pointer. This will typically contain the pointer to the root node of a tree. All the following offsets are calculated as a positive offset from this base pointer.

66 31 3 2 1 0 CONF WOK ROK ENA RST GTCC 00 LTCC 00 EQCC 00 DATA 00 NODE 00

Figure 7.3: Chaser configuration stack

DATA contains the key offset. A key will be loaded from this offset from the node pointer and compared with the primary key to decide the next operation. The operation performed will depend on the values of the following three CC registers.

EQCC contains an offset to a pointer. This offset is used when the loaded value is equal to the primary key value. All offsets are considered positive values from the base pointer. If a negative offset is supplied, this is interpreted as a hit condition and the present base pointer will be pushed into the result output buffer. Therefore, this register will normally be set to the HSX_CHASE_HIT value in software. This register can be set to a branch node offset if the search is looking for a ‘less than’ or ‘greater than’ key instead. If the pointer is a NULL pointer, the search will stop.

LTCC contains an offset to a pointer. This offset is used when the loaded key value is less than the primary key value. This generally means that the primary key exists in the right branch of the present node. The new node pointer will then be loaded from this offset. If a negative offset is supplied, this is interpreted as a hit condition and the present base pointer will be pushed into the result output buffer. If this value is a NULL pointer, the search will stop.

GTCC contains an offset to a pointer. This offset is used when the loaded key value is greater than the primary key value. This generally means that the primary key exists in the left branch of the present node. The new node pointer will then be loaded from this offset. If a negative offset is supplied, this is interpreted as a hit condition and the present base pointer will be pushed into the result output buffer. If this value is a NULL pointer, the search will stop.

CONF The only register that can be read on the configuration bus is the CONF register, which also functions as a status register. Figure 7.3 lists the bits of the configu- ration register. Most of it works the same as the configuration registers for the other accelerator units.

67 7.2.2 Operation

KEY FOUND

LOAD DATA IDLE CONFIG NULL DATA LOAD COMP NEXT BASE

NULL POINTER POINTER LOAD

Figure 7.4: Chaser machine states

Figure 7.4 shows the internal machine states of the chaser. It has a number of states, each capable of being run at one clock cycle:

IDLE is the default state and is entered whenever the accelerater is reset by hardware or software. In this state, the registers from the configuration stack are copied onto internal machine registers.

NULL is the state where the node pointer is checked for a NULL pointer. If a NULL pointer is detected, the machine goes back to an IDLE state. Otherwise, the base pointer and data offset is added and written into the internal node pointer, which now points to the data value.

DATA is the state where the actual data value is read from memory via the memory port. The necessary memory signals are asserted and deasserted to complete a memory transfer. The data is read into a holding data register.

COMP is the state where the loaded data is compared with the key. The result of the comparison is stored in a conditional register. This is given it’s own state to share the same ALU as the other memory calculations.

NEXT is the state where the next pointer is calculated. The appropriate offset is selected based on the result of the earlier comparison. If the offset is a negative value, the chase is completed and the base pointer is written into the output buffer. Otherwise, the offset is then added to the base pointer and written into the internal node pointer, which now points to the next node pointer.

BASE is the state where the pointer to the next node is loaded from memory. Again, the necessary memory signals are asserted to complete a memory transfer. The pointer is loaded as the new base pointer.

The main loop of the machine runs through: NULL, DATA, COMP, NEXT, BASE. This makes the theoretical maximum internal bandwidth 640 Mbps at 100 MHz core speed. However, it loads a data and a pointer during each loop iteration. Therefore, its theo- retical maximum external bandwidth consumption at 100 MHz is 1.28 Gbps.

68 Although primarily designed for tree traversal, the chaser unit can also be used to traverse other linked data structures. This can be done by configuring the less than and greater than pointer offsets to point to the appropriate next node in the link. When configured this way, a chaser can be used to search for a primary key that is stored within a linked list or other structure. Alternatively, software can be used to translate other data structures into trees for processing.

7.3 Kernel Simulation Results

As before, simulation was used to measure the performance of a chaser. A chaser kernel was written to perform a primary key search in both software and hardware. Listing 7.4 shows the main chaser kernel. The kernel first filled an input tree (lines 82–86) with a number of different random values. As before, the size of the tree is defined during compile time. A key value is then inserted into the tree (line 87) to ensure that there is at least a single result hit. The software and hardware kernels were compiled and run, with the debug output inspected visually to ensure functionality.

7.3.1 Kernel Functional Simulation

Listing 7.1 shows the debug output. Visual inspection confirms that the software and hardware chasers both work and produce the same result. A primary key of 0x1690 is found at the node located at 0x80000680, which is in the heap memory space. This shows that the hardware chaser is capable of performing the same task as the host processor software. Hence, it can be used to offload the primary key search work from the host processor.

Software Chaser PKEY : 0x1690 FIND : 0x80000680 11844 swticks 740 swmemticks Hardware Chaser PKEY : 0x1690 FIND : 0x80000680 11918 hwticks 758 hwmemticks Listing 7.1: Chaser simulation output (debug)

Listing 7.3 shows a hardware chaser kernel. The method for extracting the config- uration parameters is rather complicated owing to the structure of the C++ STL set library. This is because the internal variable used to represent the red-black tree is a private member of the set class. There is no easy way to access a private member directly from an external application.

69 Hence, the pointer to the root of the red-black tree was extracted (lines 58–59) using manual offsets. The manual offsets were obtained by studying the STL header for trees to determine where the root pointer was stored. The next few lines extracted the relevant pointers for a specific data structure node, depending on the result of the comparison operation. Listing 7.2 shows a software chaser kernel. It simply calls the built-in STL tree search function that searches the red-black tree in O(log N) time. It then returns the pointer to the base node, which is the same result that the hardware operation returns. The pointer value extracted using either method, can be used to cast a node pointer in the application to access the data.

7.3.2 Kernel Single Key Timing

Listing 7.5 shows the output from the simulation of chasing a primary key in a red- black tree of size N = 50 elements. With a large N, it takes a long time to simulate it as building the tree is O(N log N) bound. With a small N, the period of interest is very short as the search is completed in O(log N) time. This tree size was chosen as a trade-off between simulation time and observability. This listing shows the total hardware and software timing results without any debug output. There are a few extra memory accesses for the hardware operation, which is largely used during the hardware configuration operation. Figure 7.5 shows only the timing diagram of the important signals in the hardware operation portion of the above simulation. As before, the time values are unitless but 10 units are equivalent to a clock tick. There are four visible markers on the diagram:

A (1590631) marks the beginning of the hardware configuration overhead. This is when the hsxPutData(HSX_CHASE2, pkey) operation was called. The parts before this are considered the function call and return overhead.

B (1591781) marks the end of the hardware configuration overhead. This is when the hsxSetChase(HSX_CHASE2, cfg) function was completed. The actual chaser operation began to run immediately after this as indicated by the activity on the dwb_stb_o signal 1 .

C (1592401) marks the point when the primary key was found by the chaser as indicated by the rok_o assertion 2 . Although the key was found, the host processor did not realise it yet. It took a few more ticks for the host processor to check the status of the output buffer and retrieve the results from the output buffer as indicated by the rde_i assertion 3 .

70 int swchase(std::set &setA, int pkey) 31 { #ifdef DEBUG iprintf("PKEY\t: 0x%X\n", pkey); #endif 35

volatile int j = (int)&*setA.find(pkey)._M_node;

#ifdef DEBUG iprintf("FIND\t: 0x%X\n", j); 40 #endif

return EXIT_SUCCESS; } Listing 7.2: Software chaser kernel

int hwchase(std::set &setA, int pkey) 46 { #ifdef DEBUG iprintf("PKEY\t: 0x%X\n", pkey); #endif 51 // Configure the hardware chaser. hsxChaseConfig cfg; std::set::iterator node;

int *tree = (int *)&setA+2; // Manual tree header offset. 56 cfg.node = (int) *tree; // Extract pointer to tree ROOT. cfg.eqcc = HSX_CHASE_HIT; // Hit when memory = key. cfg.gtcc = (int) &node._M_node->_M_left; // Offset when memory is > key. cfg.ltcc = (int) &node._M_node->_M_right; // Offset when memory is < key. cfg.data = (int) &((std::_Rb_tree_node*)node._M_node)->_M_value_field ; // Data offset 61

hsxSetChase(HSX_CHASE2 , cfg);

hsxPutData(HSX_CHASE2 , pkey); 66 while (!(hsxGetConf(HSX_CHASE2) & (1<<2))); // wait for result volatile int j = hsxGetData(HSX_CHASE2);

#ifdef DEBUG iprintf("FIND\t: 0x%X\n", j); 71 #endif

return EXIT_SUCCESS; } Listing 7.3: Hardware chaser kernel

71 int chaser() 77 { std::set setA; int pkey = getrand() & 0x0000FFFF; 81 // prefill lists for (int i=0; i

// sort lists //listA.sort(); 91 // do sieve int ticks; int memtick;

// SOFTWARE CHASE 96 iprintf("Software Chaser\n"); memtick = getmemtick(); ticks = gettick(); swchase(setA, pkey); ticks = gettick() - ticks; 101 memtick = getmemtick() - memtick; iprintf("%d swticks\n",ticks); iprintf("%d swmemticks\n",memtick);

// HARDWARE CHASE 106 iprintf("Hardware Chaser\n"); memtick = getmemtick(); ticks = gettick(); hwchase(setA, pkey); ticks = gettick() - ticks; 111 memtick = getmemtick() - memtick; iprintf("%d hwticks\n",ticks); iprintf("%d hwmemticks\n",memtick);

return EXIT_SUCCESS; 116 } Listing 7.4: Chaser kernel

72

D A C B

1590000 sec 1591000 sec 1592000 sec 1593000 sec Time A B C D xwb_stb_i=0 cfg_node[31:2]=20000164 00000000 xXXXXXXX 20000164 cfg_data[31:2]=00000004 00000000 xXXXXX+20000164 00000004 cfg_eqcc[31:2]=20000000 00000000 xXXXXX+200001+00000004 20000000 iue75 igekycae iigdiagram timing chaser key Single 7.5: Figure cfg_ltcc[31:2]=00000003 00000000 xXXXXX+200001+000000+20000000 00000003 cfg_gtcc[31:2]=00000002 00000000 xXXXXX+200001+000000+200000+00000003 00000002 cfg_conf[31:2]=xXXXXXXX 00000000 xXXXXX+200001+000000+200000+000000+00000002 xXXXXXXX

cfg_pkey[31:0]=0000455D 00000000 7+ 0000455D

dwb_stb_o=0 @I @ dwb_adr_o[31:2]=00000290 00000000 2+ + 2+ + 2+ + 2+ + 2+ + 2+ + 2+ 00000290

dwb_dat_i[31:0]=0000455D 80000590 000000+800000+000000+000000+XXXXXXXX@ 0+ + 0+ + 0+ + 0+ + 0++ 0++ 0000455D 4000+ 4000D00000+00+00026+00002CE4 dwb_ack_i=0 @ rADR[31:2]=20000290 00000000 200001+200+200+200+200+200+20000290

rFSM[2:0]=111 000 + + + + + + + + + + + + + 111 @ wre_i=0

dat_i[31:0]=80000A40 00000000 @ 800005+800+800+800+800+800+80000A40

rde_i=0 ©

@ rok_o=0 ©

@ 73

2 1 3 Software Chaser 420 swticks 31 swmemticks Hardware Chaser 460 hwticks 47 hwmemticks Listing 7.5: Chaser simulation output

D (1592571) marks the end of the hardware chase kernel. At this point, the results were retrieve and the operation was completed. The hardware kernel then returned control back to the main kernel.

The total operation took TAD = 194 ticks. From this, TBD = 79 ticks (40.7%) was used in the actual chase operation. The balance TAB = 115 ticks (59.3%) was used up by the hardware configuration overhead. A significantly large proportion of the operation time was actually spent configuring the hardware parameters as opposed to performing the chase.

From the simulation output, another T+ = 266 ticks (+137.1%) is used by the func- tion call and return overhead. These are consumed before A and after D. Using a similar function call overhead, from the simulation output, the software operation completed in

TSW = 154 ticks. From this single timing simulation, the timing estimate speed-up is:

T SW = 0.79 TAD

If the hardware configuration overhead is discounted as a fixed cost, the hardware chase operation speed-up TSW = 1.95 is much faster. This is a significant acceleration TBD for a very basic search function.

7.3.3 Kernel Single Key Performance

Figure 7.6 shows the simulation results for the chaser software and hardware operation. The data set sizes chosen were between 10 to 300 for tree depths between 4 to 9 levels. The time taken for simulation grows quickly with a significantly large data set size because the insertion process is O(N lg N)1 bound. Therefore, the data set is kept to a fair size to keep simulation times reasonable. As the performance of the search algorithm is O(lg N) limited, the graphs are plotted against the lg N values. The performance for a single key search over one trial depends greatly on the positional level of the key in the tree. As the data set is prepared randomly, the vertical error bars on each data point reflect the standard deviation of the values obtained across 50 trials. Extrapolating and extracting the linear relationship

1 Nomenclature: lg = log2

74 Chaser simulation 500 1.1

450

1 400 Ticks Speedup

350 0.9 Software Hardware Speedup 300 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8 8.5

Data set size (log2n)

Figure 7.6: Chaser simulation of the points gives the following two equations to describe them:

Csw,s(N) = 22.9 lg N + 280.6 (7.1)

Chw,s(N) = 9.04 lg N + 390.2 (7.2)

Equation 7.1 describes the relationship of the software operation. The intercept of 281 ticks agrees with the timing estimate of 266 ticks for function call and return overhead. The single key chase operation is a very different operation compared to the other accelerator devices. This is because a search can very quickly find a key, due to the lgN nature of the search process. There is therefore, a fairly large possible range for the intercept point. Equation 7.2 describes the relationship of the hardware operation. The intercept of 390 corresponds to the timing estimate of TAB + T+ = 381 for the total hardware overhead costs. For a sufficiently large N, the speed-up factor is:

Csw(N) Cup = = 2.54 Chw(N)

The cross over point in the graph is at about lg N = 8, which is N = 256. This means that for any tree that is larger than 256 elements, the hardware chaser affords a hardware acceleration while the software method is faster for small trees. This will prove useful in everyday applications. Most indices will benefit because any data set worth accelerating will definitely be larger than 256 entries.

7.3.4 Kernel Multi Key Timing

The chaser has a more interesting mode, where it can perform better acceleration than before. The chaser can be configured to perform multiple key searches on the same data structure. In this mode, a dedicated chaser can be allocated to each data structure that

75 needs to be searched, simplifying the configuration process. This is useful for a structure that is commonly used, such as an operating system process table. Listing 7.6 and 7.7 shows a multiple key search kernel. In each case, the number of keys searched is N/10 (integer division) of each data set of size N. The thing to note in listing 7.7 is the order in which the hardware is configured. The configuration registers were written to and the chaser was enabled, before the keys that need to be searched are loaded. If the order is swapped, the multiple keys in the input buffers will be flushed during the configuration process. Listing 7.8 shows the debug output of the simulation. The results were verified through visual inspection and shown to be the same for both software and hardware methods. Figure 7.7 shows the results of one simulation trial for N = 30 and 3 key searches. There are six markers on the timing diagram:

A (1225782) marks the start of the hsxChaseSet() function call. The configuration parameters were all extracted before this point. Since the parameters are the same as that for a single search, it can be assumed to consume the same amount of time.

B (1226882) marks the end of the hsxChaseSet() function call. At this point, the chaser is configured but has not begun chasing the key as seen by the absence of dwb_stb_o signal assertions.

C (1227112) marks the time when the first key is written into the chaser. Almost immediately, the loading of the key is signalled by the rde_i signal. The chaser then began operating soon after this, as seen by the multiple dwb_stb_o assertions 1 .

D (1227462) marks the time when all the multiple keys have been written into the chaser as indicated by the wre_i signal 2 . This operation overlaps the actual chase operation. So, the keys to be searched for can all be safely written in advance if there are less than the size of the input buffer. Otherwise, the multiple key search operation may need to be broken up into several phases.

E (1228202) marks the time when the last key was loaded as indicated by the rde_i signal 3 . The amount of time between C-E can be used to calculate the average key search time.

F (1228862) marks the time when all the results have been retrieved and the multiple key chase operation is completed.

76 int swchase(std::set &setA, std::vector &pkey ) 32 { for (int i=0; i<(LIST_MAX/20); ++i) { 35 volatile int j = (int)&*setA.find(pkey[i])._M_node; #ifdef DEBUG iprintf("FIND\t: 0x%X\n", j); #endif } 40

return EXIT_SUCCESS; } Listing 7.6: Software multi-key chaser kernel

int hwchase(std::set &setA, std::vector &pkey ) 45 { 46 // Configure the hardware chaser. hsxChaseConfig cfg; std::set::iterator node;

int *tree = (int *)&setA+2; // Manual tree header offset. 51 cfg.node = (int) *tree; // Extract pointer to tree ROOT. cfg.eqcc = HSX_CHASE_HIT; // Hit when memory = key. cfg.gtcc = (int) &node._M_node->_M_left; // Offset when memory is > key. cfg.ltcc = (int) &node._M_node->_M_right; // Offset when memory is < key. cfg.data = (int) &((std::_Rb_tree_node*)node._M_node)->_M_value_field ; // Data offset 56

hsxSetChase(HSX_CHASE2 , cfg);

for (int i=0; i<(LIST_MAX/20); ++i) // write multiple keys hsxPutData(HSX_CHASE2 , pkey[i]); 61

for (int i=0; i<(LIST_MAX/20); ++i) { while (!(hsxGetConf(HSX_CHASE2) & (1<<2))); // wait for result volatile int j = hsxGetData(HSX_CHASE2); 66 #ifdef DEBUG iprintf("FIND\t: 0x%X\n", j); #endif } 71 return EXIT_SUCCESS; } Listing 7.7: Hardware multi-key chaser kernel

77

D A C B E F

1226000 sec 1227000 sec 1228000 sec Time A B C D E F CHASER2 cfg_pkey[31:0]=000006E8 00000000 76DE6BED 0000CC09 00000198 000006E8 wre_i=0 ©

rde_i=0 © iue77 utpekycaekre timing kernel chase key Multiple 7.7: Figure

xwb_stb_i=0 xwb_ack_o=0

xwb_dat_i[31:0]=00000000 + 0++ + 0++ + + + 0+ + + 0+ + + 0+ + + + + + + + + + 0++ + 0+80+ + 0+8+ + + 000+ + + + 0000+ + 8++ + 0+ + 8+ 0+ 0+ + 0+ 0+ 81F01E90

xwb_dat_o[31:0]=800008B0 + 00+250+00+ 250+00+ 250+00+ 250+00+ 250+00+ 250+ 0+2+ 25+ 00+2502++ 25029++ 25029B4++ 25029B+0+2+ 0+2+ + 8+ 80++ 0+ 9D1+ + + + + + 15+ + + + + 1+ + + 2CEE5F59

dwb_stb_o=0 @I @

dwb_dat_i[31:0]=000006E8 80000770 00000010800000000000000C0000000800000011 8000+0000CC+ 0++ 8+0+ + 0+0+ + 0+ 8+0+ + 000+0+ + 0+ 8+000++ 0+ 8+000+0+ + 000++ 0+ 8+0+ + 0+ 8+000006E8 @

dwb_ack_i=0

3 1 2 78 Software Chaser FIND : 0x80000860 FIND : 0x80000888 FIND : 0x800008B0 20652 swticks 1198 swmemticks Hardware Chaser FIND : 0x80000860 FIND : 0x80000888 FIND : 0x800008B0 19974 hwticks 1198 hwmemticks Listing 7.8: Chaser multi-key simulation output (debug)

Software Chaser 886 swticks 75 swmemticks Hardware Chaser 574 hwticks 71 hwmemticks Listing 7.9: Chaser multi-key simulation output

The complete hardware operation took TAF = 308 ticks in this one trial. Of this,

TAC = 133 ticks (43.2%) was made up of hardware configuration overhead. The balance

TCF = 175 ticks (56.8%) was consumed by the actual chase operation. The average key search time TCE for two keys is 109 ticks or 36 ticks per key, which is fairly close to the number in the single key timing estimate. From the timing estimate in listing 7.9, the function call and return overhead is estimated to be T+ = 266 ticks (+86.4%). Using a similar function call and return overhead, the software operation takes

TSW = 620 ticks. This gives an estimated speed-up factor of 2.0 times. However, these estimates are only an indicator as they only reflect the results from one single trial and a more accurate value is estimated in the next section.

7.3.5 Kernel Multi Key Performance

Figure 7.8 shows the results of a series of multiple key searches, each repeated 50 times. In each case, the graphs are plotted against the number of searches performed, which is set to be 10% of the data set size. The performance is linearly related to the number of keys that have to be searched, with each key taking a slightly different amount of time to be searched. Extrapolating and extracting the linear relationship of the points gives the following two equations to describe them:

Csw,m(N) = 285.2 N + 4.6 (7.3)

Chw,m(N) = 83.2 N + 318.2 (7.4)

Equation 7.1 describes the relationship of the software operation. Equation 7.4

79 Chaser simulation (multi-key) 9000 2.9 2.8 8000 2.7 2.6 2.5 7000 2.4 2.3 6000 2.2 2.1 2 5000 1.9 1.8 Ticks 4000 1.7 1.6 Speedup 1.5 3000 1.4 1.3 2000 1.2 Software 1.1 Hardware 1000 1 Speedup 0.9 0.8 0 5 10 15 20 25 30 Number of searches (n)

Figure 7.8: Chaser simulation (multi-key) describes the relationship of the hardware operation. The intercept of 318 is similar to the timing estimate of TAC + T+ = 399 for the total hardware overhead costs. The speed-up factor of Csw (N) = 1.51 for N = 3 and is close to the timing estimate of 2.0. Chw(N) However, for a sufficiently large N, the speed-up factor is:

Cup = 3.43

7.4 Conclusion

A primary key search is a common task that needs to be performed for both primary and secondary searches. The hardware chaser unit can be used to both offload and accelerate the primary key search process. However, it only provides a significant saving if the data set that needs to be searched is larger than N = 256. It provides an acceleration that is O(lg N) bound. If a chaser is used to search multiple keys in the same data structure repeatedly, the acceleration becomes O(N) bound with the number of keys to be searched even though each individual key search is still O(lg N) bound. For a sufficiently large data set, the maximum acceleration factor is Cup = 3.43 for multiple key searches. The maximum external memory bandwidth that is required for each chaser unit is 1.28 Gbps at 100 MHz.

80 CHAPTER 8

Memory Interface

As a big part of search is primarily memory limited, cache and the memory hierarchy are explored. A special cache that takes into account structural locality, instead of just temporal and spatial locality is also designed. How- ever, the improvements gained are only 3% and not sufficiently significant to warrant its use in the search accelerator unless absolutely necessary.

8.1 Introduction

All the results thus far, have been obtained with one underlying assumption: the simula- tions were all run without the use of any cache memory. All memory accesses were sent out through the memory arbiter to a simulated external memory device as described in section 5.3. Typical computer architecture practice exploits the benefit of a memory hierarchy, using cache, to speed up operations. However, the effects of a cache memory on search need to be studied.

8.2 Cache Primer

Search algorithms are typically limited by the number of records that have to be searched through, which translates into the size of the search space N. As main memory is slow, for search algorithms that have to traverse through in-memory data sets, this becomes a major bottleneck, which is usually alleviated by use of cache memory. The performance of existing cache architectures is fairly well understood [Han98, Gen04, van02]. Existing cache memories are designed around two core principles: temporal and spatial locality;

81 while cache performance is typically regulated by three basic parameters: cache size, line length and associativity[HS89]. Performance is improved by retaining more data within the cache. It is common to find more than half the die area of a modern processor taken up by cache memory as processor speeds have outpaced memory speeds[FH05]. However, this directly increases cost by increasing chip area when this valuable space could alternatively be used to increase functionality of a processor. Alternatively, a reduction in area space could lower its cost. Moreover, improving search performance by merely increasing cache size is not sus- tainable simply because there will always be a significant amount of data stored in main memory. Even if cache sizes do reach gigabyte values, the primary memory by then would be larger and the data set sizes potentially even larger. Therefore, the search space will always be stored in external memory instead of on-chip cache. A cache line represents the amount of information read whenever data is read from main memory. Longer cache lines affect performance by bringing in larger blocks of data at a time, exploiting spatial locality. This will naturally benefit linked data structures because each node generally holds multiple words of information. However, these com- plex data structures may not have data nodes located in contiguous locations, which reduces the effectiveness of spatial locality. Associativity works by replicating cache blocks, to reduce the problem of a single cache block mapping to multiple areas in main memory (aliasing). This improves the probability of retaining data in the cache. If information is widely scattered across memory, there is less spatial locality to exploit. Hence, higher associativity may be useful by retaining multiple blocks within cache. However, the cost increases quickly with associativity due to replication. For instruction cache, both temporal and spatial locality are equally important. As instructions are executed sequentially, instructions next to an existing one are likely to be used (spatial locality). In the case of loops in algorithms, recently used instructions are likely to be used repeatedly. (temporal locality). However, it is less clear that a data cache benefits from the same design features for reasons set out below. For search applications, the data space exhibits ephemeral characteristics. Data structures are usually traversed in one direction and once a node is used it is unlikely to be used again, which reduces the effectiveness of temporal locality. For accessing data structures, structural locality may be more important because in every data structure, once a node is checked against the search key, it will normally need to access a child node next, which can be located anywhere in data memory.

82 8.3 Cache Principles

As microprocessor speeds outpace increase in memory speeds, the processor spends more time waiting for data. The present trend in general-purpose microprocessors is to increase the amount of cache to reduce this penalty. There are two problems with this trend. Firstly, increasing cache improves general purpose performance, but may not help with search operations. Secondly, this strategy is not cost-effective from area, power and efficiency points of view. As cache performance can severely affect the performance of software, how it can help a search accelerator needs to be investigated. To facilitate analysis, design and testing, a parameterisable cache memory block was first designed and tested. The cache uses a pseudo-LRU replacement mechanism for 2-way and 4-way associativity configurations. The associativity, size and line width of the cache is dependent on conditional defines that were defined on the command line by simulation scripts.

Instruction 0x00000000 L1 Cache .text 32−bit 0x000FFFFF RISC CPU Data 0x00100000 L1 Cache .data 0x001FFFFF

0x80000000

.heap .stack 0x81FFFFFF

Figure 8.1: Cache simulation setup

Figure 8.1 illustrates how the caches and memory were set up for simulation. The memory transfers were monitored by simulation scripts and dumped as text, which was then post-processed using text processing tools on the host computer. The memory space was divided into the following main address spaces for easy monitoring:

.text is reserved for read-only instruction memory. This memory can either be im- plemented as simulated on-chip ROM or off-chip flash. In this case, it is was an on-chip device. Therefore, all transfers happen at the fastest possible rate to avoid slow downs in simulation due to excessive wait times for instructions.

.data is reserved for read-write initialised data memory. This memory was implemented as a small block of on-chip RAM. It mainly holds certain variables, constants, strings and other pre-initialised values.

.heap is reserved for the read-write heap memory. It represents an uninitialised block of external memory. This is where the dynamically allocated data (using malloc()

83 and free()) is located. Entire data structures were stored within this area in- cluding trees, lists and other dynamically linked data structures.

.stack is reserved for the software stack. This was used for function call and return overheads and for passing parameters between functions. Some parts of the data structure may also be stored within this area, such as the structural information of a tree that is located in the heap.

A software kernel was written to test the operation of the cache memory block by performing a key search in a tree. Two types of code were used: a random value search and a repeated value search. The simulation trials were conducted with a different number of loop iterations defined as ITERS in the code. For each value of ITERS, 50 samples were collected. Listing 8.1 shows part of the cache simulation construct in Verilog. Lines 210–225 saves the external memory contents into a Verilog memory (VMEM) file. Lines 227–230 loads the contents from the Verilog memory file into external memory. These operations simulate the load and save operations on a computer. Constructing each tree takes O(N log N) time, which translate into more than a day of real-world simulation time for such a large tree. Therefore, reusing a saved tree reduces simulation time tremendously. Listing 8.2 shows the data preparation code used to pre-build the search tree. Lines 39–44 fill a red-black tree with 216 records. The resultant data structure has a node that is 6 words in size, which results in a 400kbyte data set. Line 48 triggers the simulation construct that saves the external memory into a Verilog memory (VMEM) file. Listing 8.3 shows the cache simulation kernel. Line 44 triggers the simulation con- struct to transfer the pre-built tree structure into the heap. Lines 50 and 62 enable and disable the data cache for simulation, which is used to limit the cache results to pure search code. Lines 52–59 iterates through a red-black tree search, a number of times.

if (dwb_stb_o & dwb_ack_i & dwb_wre_o & (dwb_adr_o[31:16] == 16’h0200)) begin 210 $strobe("SAVE MEMORY");

fname = $fopen("dopb.vmem"); 213 $fdisplayh(fname,"/* Save OPB RAM */"); // save heap for (save = 0; save < 32’h00080000; save = save + 1) begin $fdisplayh(fname,"@", save, " ", {rDOPB[save]}); end 218 // save stack - important!!! as some information is pushed onto the stack by the compiler for (save = 32’h07D0000; save < 32’h07EFFFF; save = save + 1) begin $fdisplayh(fname,"@", save, " ", {rDOPB[save]}); end $fclose(fname); 223

end

if (dwb_stb_o & dwb_ack_i & dwb_wre_o & (dwb_adr_o[31:16] == 16’h0400)) begin $strobe("LOAD MEMORY"); 228 $readmemh("dopb.vmem", rDOPB); end Listing 8.1: Verilog simulation LOAD/SAVE

84 #define NODE_MAX 0x010000 // 64k nodes 31 int main() // works with -O1 { // declare and create a tree 35 std::set *rbtree; rbtree = new std::set();

// pre-fill the tree for (int i = 0; i < NODE_MAX; ++i) 40 { *hsx::STDO = i; rbtree->insert(i << 16); } 45 // save/load the tree *hsx::STDO = rbtree->size(); *hsx::SAVE = -1;

rbtree ->clear(); 50 *hsx::STDO = rbtree->size();

*hsx::LOAD = -1; *hsx::STDO = rbtree->size(); 55 aemb::enableDataCache (); // start the cache test

// list the tree for (std::set::iterator node = rbtree->begin(); node != rbtree->end(); node++) { 60 *hsx::STDO = *node; }

aemb::disableDataCache (); // disable the cache test 65 exit(0); } Listing 8.2: Cache tree fill kernel

int main() 37 { // declare and create a tree 39 std::set *rbtree; rbtree = new std::set(); // create a rbtree object in the heap

rbtree ->clear(); // !!! do not skip this step *hsx::LOAD = -1; // simulator heap load 44

// search for 10 values std::set::iterator node;

// enable cache 49 aemb::enableDataCache ();

int j = *hsx::PRNG << 16; for (int i=0; ifind(j); *hsx::STDO = *node; } 59

// disable cache aemb::disableDataCache ();

exit(0); 64 } Listing 8.3: Cache simulation kernel

85 Lines 55–56 are commented out for the repeated search case or left uncommented for the random search case.

Repetitive Search (2K1W2L) Random Search (2K1W2L) 160 140

140 120

120 100 100 80 80 Inst Inst 60 60 Inst cache (h:m) Inst cache (h:m) 40 40

20 20

0 0 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 Loop iteration (n) Loop iteration (n) Repetitive Search (2K1W2L) Random Search (2K1W2L) 4.5 3.5

4 3 3.5 2.5 3

2.5 Data 2 Data Stack Stack 2 Heap 1.5 Heap 1.5 Data cache (h:m) Data cache (h:m) 1 1 0.5 0.5

0 0 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 Loop iteration (n) Loop iteration (n)

Figure 8.2: Basic cache operation

8.3.1 Instruction Cache

At this point, the intention was to verify the functionality of basic data and instruction cache blocks. The specific results are less important as cache parameter effects on search operations are simulated later. Figure 8.2, however, yields some useful early considerations. Extrapolating the points for the instruction cache linearly yields the following relationships:

Irep(N) = 1.25 N + 17.3 (8.1)

Irnd(N) = 1.21 N + 16.3 (8.2)

Both equations 8.1 and 8.2 show that the instruction hit ratio improves linearly with the number of iterations through the loop for both repetitive and random cases. The two graphs are very similar in nature and within 3.2% of each other, and any difference is insignificant. The expected linear trend is correct because the same code loop and the same in- structions are being run each time. This will only happen when the instruction code size is small enough to be loaded once and fit entirely in the cache. In the case of

86 the simulation kernel, the software runs only through the red-black tree search routine, which is only a sub-section of the total code. The simulation results suggest that the cache is correctly capturing and retaining needed data. Spatial and temporal locality are beneficially exploited.

8.3.2 Data Cache

The situation for the data cache is very different. Both the stack and heap were cached in the data cache and they exhibit different characteristics. The search tree was stored in the heap, while the function call and return overheads were stored in the stack. As can be seen from the individual curves, the stack exhibits a higher hit ratio than the heap. The performance of the stack cache indicates that a data cache would be a valuable addition for software function call and return operations. However, the heap ratio for the repeat case is 0.9 while the random case is 0.09 only. The 10 times difference is due to temporal locality. It is expected for the random case as different searches traverse down different branches of the tree for each iteration. But the repeat case is expected to perform linearly, like the instruction cache, as it steps through the same data nodes each iteration. However, the size of the data cache is very much smaller than the size of the data space. The 2K1W2L parameter gives a 2kbyte cache size (2K), organised in a direct- mapped (1W) configuration with a 2 word cache line (2L) in 256 blocks. The task of searching a single key involves calling several C++ STL subroutines and overwrites the heap cache with stack values. This results in the characteristic curve of diminishing returns for the data cache in each case. It is highly unlikely that multiple search operations are going to search the exact same data every time. Although the same algorithm may be used, the data sets and keys may change. The results are an early indicator that a search data set may not be suitable for caching inside a data cache. The results from the stack performance also show that at least 30 iterations are needed to get a result that is close to the value for a large number of iterations. This helps to determine the minimum number of iterations that need to be performed in later parts of the chapter.

8.4 Cache Parameters

The next issue is how different cache configurations affect search operation, with par- ticular attention to the heap cache. The cache was checked by running a software loop through 50 iterations that searches for a specific key in a tree. Then, each combination of cache parameters is sampled over 50 iterations and the hit-miss-ratio is recorded. The cache parameters are changed one at a time and the entire simulation repeated. A

87 complete simulation run took several days to complete. Figures 8.3, 8.4, 8.5 and 8.6 are grouped by memory size as this translates directly into physical cost, which is an important factor in any physical implementation. Each figure contains 6 sub-figures labeled with a different cache size. Each sub-figure is a heat-map with associativity (N-Way) and cache line width (2N Words) as the param- eters, while the colour represents the hit-miss-ratio. The results provide us with some interesting trends and guidelines for search oriented caches.

8.4.1 Instruction Cache

Figure 8.3 shows the results of the instruction hit ratio, for different cache parameters. For the instruction hit ratio simulation, the entire search kernel was about 4kbytes in size. It was compiled using a -O2 optimisation level. The instruction cache performed as expected, based on previous results. All the plots exhibit similar trends, which indicate consistency in the results. Performance improves with increased size, line width and associativity. Interpolating from the raw data provides the following approximation of the instruc- tion hit ratio. The cache-line width is defined as 2L words and 2K is the cache size in .

0.04KL+0.22L−0.06K+5.1 Ihit = 2 (8.3)

Line Width. The most obvious improvement was seen with each line width increment. This is because instructions are stored and executed in sequence. Increasing the line width will improve the spatial locality performance. Longer lines capture more sequential data at once but also increase the line refill time. However, due to the sequential nature of instruction operation, the increased hit ratio makes up for the increased cache refill times.

Associativity. Increased associativity also increased performance, but not by a sig- nificant amount. We can see graphically that increasing the associativity from direct- mapped to 4-way did not change the hit ratio. Increased associativity only helps if the kernel size is significantly larger than the cache size. If it is smaller, memory aliasing does not occur as the entire kernel is retained within the cache block. Equation 8.3 dropped the associativity value entirely as it was insignificant.

Cache Size. Figure 8.3 did not show any improvement for larger caches. Larger caches are redundant when the kernel is only 4kbytes in size. For a much larger software kernel,

88 Random Search HMR (32K) Random Search HMR (64K) 2 500 2 500 450 450 400 400 350 350 ) )

N 300 N 300 1 250 1 250

Way (2 200 Way (2 200 150 150 100 100 50 50 0 0 0 0 1 2 3 4 5 6 1 2 3 4 5 6 Line width (2N words) Line width (2N words)

Random Search HMR (8K) Random Search HMR (16K) 2 500 2 500 450 450 400 400 350 350 ) )

N 300 N 300 1 250 1 250

Way (2 200 Way (2 200 150 150 100 100 50 50 0 0 0 0 1 2 3 4 5 6 1 2 3 4 5 6 Line width (2N words) Line width (2N words)

Random Search HMR (2K) Random Search HMR (4K) 2 300 2 400 350 250 300 200 ) ) 250 N N 1 150 1 200

Way (2 Way (2 150 100 100 50 50 0 0 0 0 1 2 3 4 5 6 1 2 3 4 5 6 Line width (2N words) Line width (2N words)

Figure 8.3: Instruction cache hit ratio an observable improvement is expected over increasing cache sizes. Equation 8.3 shows that size variability is a less significant factor on cache performance than line width.

8.4.2 Data Cache Trends (Repeat Key)

Now that the basic instruction cache trends for executing search algorithms are known, the data cache trends for search need to be observed. The investigation focuses on the heap hit ratio only, as the data structures are almost entirely stored in the heap. Figure 8.4 shows the results of the repeat simulation with different cache parameters. All the plots have a similar shape, which indicate consistency in the results.

Line Width. Once again, the most visible improvements were due to an increase in cache line width. There is a visible leap in hit ratio, when moving from a line width

89 Repeat Search Hit-Miss Ratio (32K) Repeat Search Hit-Miss Ratio (64K) 2 6 2 8 5.5 7 5 ) )

W 4.5 W 6 4 1 1 5 3.5 3 4 Cache way (2 Cache way (2 2.5 3 2 0 1.5 0 2 1 2 3 4 5 6 1 2 3 4 5 6 Cache line (2L words) Cache line (2L words)

Repeat Search Hit-Miss Ratio (8K) Repeat Search Hit-Miss Ratio (16K) 2 4 2 5 4.5 3.5

) ) 4

W 3 W 3.5 1 2.5 1 3 2.5 2 Cache way (2 Cache way (2 2 1.5 1.5 0 1 0 1 1 2 3 4 5 6 1 2 3 4 5 6 Cache line (2L words) Cache line (2L words)

Repeat Search Hit-Miss Ratio (2K) Repeat Search Hit-Miss Ratio (4K) 2 2.8 2 3.5 2.6 2.4 3 ) )

W 2.2 W 2.5 2 1 1.8 1 2 1.6 1.4 1.5 Cache way (2 Cache way (2 1.2 1 1 0 0.8 0 0.5 1 2 3 4 5 6 1 2 3 4 5 6 Cache line (2L words) Cache line (2L words)

Figure 8.4: Repetitive heap cache of 8 into 16 and 32 words. However, each data node is only 6 words in size. In this specific case, the nodes were inserted into the tree in-order. For a tree that is built in- order, the child nodes are located spatially close to each other, which results in hit ratio improvements for longer cache-lines. For a tree that is built with randomly inserted data, adjacent nodes are scattered through the heap, losing the advantage of spatial locality for longer cache-lines. In a real-world scenario, it is unlikely that the tree nodes will be spatially close to each other due to frequent data insertions and removals. Therefore, in real-world applications, longer cache-lines will be beneficial up until they are about the size of a single data node.

Cache Size. Unlike the instruction cache, there was a visible benefit in increasing the size of the overall cache. This is reflected in the colour values that increased from a maxima of 2.8 to 8.0 on the plots. It can be safely concluded that a larger cache size

90 will improve the hit ratio until the size of the cache approximates the data set size of 400kbytes. However, the improvement is not linear as doubling the cache size did not double the hit ratio.

Associativity. Associativity improvements seems a little more complex, initially. Against all logic, an increase in associativity worsens the hit-ratio when the cache line is very short. However, unless the data used were stored in multiple memory locations aliased by a single cache block, increased associativity is less effective. As associativity increases, the number of memory locations that map to a single block of cache actually increases. This increases the chance of old data being evicted, when it otherwise would not have been.

8.4.3 Data Cache Trends (Random Key)

Now that a baseline data cache performance has been established, a better approxima- tion to real-world operation can be observed. Figure 8.5 shows the simulation results for similar conditions as above, except that a different key is searched each time. All the plots have a similar shape, which indicate consistency in the results. But this gives a very different picture from that of section 8.4.2.

Line Width. Once again, the most visible improvement in hit ratio came from an increase in line width. However, the improvements are non-linear as a doubling of line width is not accompanied by a corresponding doubling in performance. This adds weight to the earlier assertion that for real-world applications, increasing the line widths would not continue to increase the hit ratio significantly. The reasons for this are the same as before. The data structure will benefit from spatial locality, but only to the extent of the data structure size of a single data node.

Cache Size. This gave a very different result as compared to the section before. Increasing the cache size does not seem to have any significant affect on the hit-ratio as the maximum values stay within the 1.6–1.8 range only while it increased steadily from 2.8–8.0 in the repeat case. This is because the data structure traversed exhibits only limited temporal locality characteristics. The data held in cache is wasted because it is not needed again and the cache has to continuously fetch new data from main memory.

Associativity. For the same reasons as above, changes to the associativity did not bring any visible benefit. Increasing the associativity did not change the hit-ratio sig- nificantly as evidenced by the lack of change in the heat map from 1-way to 4-way. Associativity only helps when there is a problem with cache trashing. In this case, there

91 Random Search Hit-Miss Ratio (32K) Random Search Hit-Miss Ratio (64K) 2 1.8 2 1.8 1.6 1.6 1.4 1.4 ) )

W 1.2 W 1.2 1 1 1 1 0.8 0.8 0.6 0.6 Cache way (2 Cache way (2 0.4 0.4 0.2 0.2 0 0 0 0 1 2 3 4 5 6 1 2 3 4 5 6 Cache line (2L words) Cache line (2L words)

Random Search Hit-Miss Ratio (8K) Random Search Hit-Miss Ratio (16K) 2 1.6 2 1.8 1.4 1.6 1.4

) 1.2 )

W W 1.2 1 1 1 0.8 1 0.8 0.6 0.6 Cache way (2 Cache way (2 0.4 0.4 0.2 0.2 0 0 0 0 1 2 3 4 5 6 1 2 3 4 5 6 Cache line (2L words) Cache line (2L words)

Random Search Hit-Miss Ratio (2K) Random Search Hit-Miss Ratio (4K) 2 1.6 2 1.6 1.4 1.4

) 1.2 ) 1.2 W W 1 1 1 0.8 1 0.8 0.6 0.6

Cache way (2 0.4 Cache way (2 0.4 0.2 0.2 0 0 0 0 1 2 3 4 5 6 1 2 3 4 5 6 Cache line (2L words) Cache line (2L words)

Figure 8.5: Random heap cache

is little cache contention because there is little temporal locality to begin with. So, practically all the data in cache can be discarded.

8.5 Data Cache Prefetching

From the results obtained, they show that data caches tend to exhibit low hit-ratio for search operations. It is possible that prefetching techniques can be used to improve the performance. Prefetching techniques can be categorised into two main categories: static and dynamic. More details can be found elsewhere [KY05].

92 8.5.1 Static Prefetching

Static prefetching involves making software modifications during compile-time to initiate prefetching. Hardware changes are needed to implement a non-blocking method of initiating memory access or a specialised sub-block to prefetch data into cache. This operation is then accessed through a special software prefetch instruction. There are two main approaches used: latency tolerance and latency reduction. There are two categories of data that the techniques work on: arrays and linked data. Latency tolerance involves prefetching memory in order to reduce the number of misses. This will still incur a heavy cost on memory bandwidth as all the search data still need to be prefetched. As bandwidth is finite, the performance will still eventually reach a memory bandwidth limit. Latency reduction involves retaining data in the cache to reduce the number of misses. This method is more sustainable as it does not significantly increase the band- width cost. However, it may not help for search data. For arrays, data are usually stored in sequential locations and would benefit from spatial locality. Also, data arrays are usually statically allocated during compile time. So, it is trivial for the compiler to figure out which addresses to prefetch during compile- time and safely reorder instructions to insert prefetch instructions. For linked data, the next block to prefetch cannot be predicted during compile time. Furthermore, there are no guarantees that the memory allocation would be sequential or linear. Therefore, such cases are more difficult to handle than the arrays. Unfortunately, this is the category of data that will be present for search applications.

Direct Arrays are the easiest data structures to prefetch. Access to these arrays are statically defined and software operations that act on them can be thoroughly analysed during compile-time. Instructions can then be accurately reordered and prefetched. However, this method is only useful for searching and sorting through arrays and not more complex search structures.

Indirect Arrays are slightly more difficult to handle than direct arrays. Access to these arrays depends on an index that is only available during run-time. However, the structure of the data is well defined. All that is needed is to precalculate the index value and prefetch it. The rest is similar to direct arrays.

Tiling essentially breaks down and reorders loop operations into smaller chunks. This allows work to be done on smaller chunks of the array at a time, to exploit temporal locality. This would be fairly useful for signal processing applications but not for search.

93 Natural Pointers involve reordering instructions and inserting a special operation to fetch the next child node before it is needed. It suffers from a few weaknesses. It will only be useful if there is a significant number of other operations between the prefetch and when the node is needed. There will be no benefit if the next node is operated on immediately after the prefetch, like in the case of a streaming data. Also, it is only able to prefetch nodes directly next to the present node, which limits its effectiveness at prefetching longer paths.

Jump Pointers involve modifying the software data structure to store additional pointers that hint on nodes to prefetch. These can point further down the path, solv- ing the problem of natural pointers. However, this method would require the large modifications to existing software and hardware, which limit its utility.

Data Linearisation works by reordering the location of data nodes to improve spatial locality, during data insertion or deletion. However, this involves changing the malloc() routine to re-order data structures on-the-fly. It mainly benefits data structures that rarely change. Otherwise, penalty for re-ordering will be significant.

8.5.2 Dynamic Prefetching

Dynamic prefetching is a hardware technique to detect and initiate prefetching. This can be performed at various levels of the memory hierarchy. The closer the hardware prefetcher is located to the processor, the more information it has access to, in order to calculate the prefetch location. Therefore, it would make sense to include the hardware prefetcher within the processor, where possible.

Stride prefetching works by detecting and prefetching sequential accesses that are a fixed distance apart. This can be implemented with extra hardware, using a stream buffer. This is particularly useful for signal processing applications, which often access data at a fixed distance, such as for filters. But for a scattered heap, it is less useful.

Correlation prefetching records the sequence of memory accesses and uses the in- formation to decide on which blocks to prefetch. Most linked data are often stored at random locations in memory. Although seemingly random, a repeating sequence can be detected during run time. For example, multiple searches through a tree will always access the root node, followed by one of its child nodes and so on. However, it requires some initial time to build up the correlation tables and a lot of additional hardware.

94 Content-Based prefetching uses the content stored inside a data structure to predict which block to prefetch next. It examines the data structure itself to identify potential pointers stored within them. These can be identified by comparing the most significant bits of the content with those of the current data being fetched. This is based on the assumption that the nodes would be allocated within a similar section of memory (e.g. within the heap area). This can be tricky when the data structure contains data that have similar upper bits. Also, like natural pointers, it can only look down the search path by one level.

8.5.3 Prefetched Data Cache

Most of the static techniques do not reduce the memory transaction cost but merely hide the cost by performing the fetch when the data are not needed. Therefore, a hardware technique was investigated to see how it affects the hit ratio. Content-based dynamic prefetching was chosen as it seemed the most suitable for accelerating search through linked data. Figure 8.6 shows the simulation result of a data cache with dynamic content-based prediction hardware. A comparison of this result with the result in figure 8.5 will show that it is virtually the same. In fact, the raw numbers show a very slight decrease in hit ratio for the prefetched cache. The reason that prefetching does not work well for search is because of the nature of search. Prefetching works on the principle of structural locality, where the data element that is linked next, is prefetched. This should hide any cache misses. Due to temporal locality, the data that are prefetched and retained, can then be used later. However, search data are needed for only a brief period of time. Temporal locality is virtually non-existent for search operations. Pointer chasing would not benefit from prefetching unless there was a significant time gap between the pointer fetch and data use. In the case of a chase or stream operation, neither of these is true. Therefore, any performance advantage that can be gained from prefetching is lost because the data are needed soon after. Hence, the result is that prefetching data into the cache is more suitable for ap- plications that require regular data access such as for signal processing. However, for unpredictable data access, prefetching will have little advantage to a thoroughly ran- dom read cache. In fact, it may even have a small penalty due to the wasted prefetching overhead.

95 Random Search Hit-Miss Ratio (32K) Random Search Hit-Miss Ratio (64K) 2 1.8 2 1.8 1.6 1.6 1.4 1.4 ) )

W 1.2 W 1.2 1 1 1 1 0.8 0.8 0.6 0.6 Cache way (2 Cache way (2 0.4 0.4 0.2 0.2 0 0 0 0 1 2 3 4 5 6 1 2 3 4 5 6 Cache line (2L words) Cache line (2L words)

Random Search Hit-Miss Ratio (8K) Random Search Hit-Miss Ratio (16K) 2 1.6 2 1.8 1.4 1.6 1.4

) 1.2 )

W W 1.2 1 1 1 0.8 1 0.8 0.6 0.6 Cache way (2 Cache way (2 0.4 0.4 0.2 0.2 0 0 0 0 1 2 3 4 5 6 1 2 3 4 5 6 Cache line (2L words) Cache line (2L words)

Random Search Hit-Miss Ratio (2K) Random Search Hit-Miss Ratio (4K) 2 1.6 2 1.6 1.4 1.4

) 1.2 ) 1.2 W W 1 1 1 0.8 1 0.8 0.6 0.6

Cache way (2 0.4 Cache way (2 0.4 0.2 0.2 0 0 0 0 1 2 3 4 5 6 1 2 3 4 5 6 Cache line (2L words) Cache line (2L words)

Figure 8.6: Random heap cache (with prefetch)

8.6 Cache Integration

For all the simulations above, the data set size is in the order of 1.6Mb of memory while the different caches used are between 2kbytes (0.1%) to 64kbytes (4.0%) of the data set size. These sizes were chosen to reflect real-world L1 cache size such as those on AMD processors1. But this extremely small cache ratio means that hardly any part of the data set can be stored in cache. Therefore, the performance does not change significantly even with different cache sizes. The instruction cache performs better as the cache to memory ratio is much higher than the instruction memory size. However, increasing the cache size to improve performance is extremely expensive as static RAM is a very expensive resource in a chip. A block of single-port 64kbit memory

1http://web.archive.org/web/20080123090140/http://www.sandpile.org/impl/k8.htm

96 takes up 3.44mm2 (e1,995) in 0.35µm and 0.21mm2 (e230) in 0.13µm according to recent prices from Europractice2. Therefore, it is important to determine a good cache size, that uses a minimal resource, while still being able to provide some reduction in memory bandwidth. Earlier results show that associativity does not help much, while a longer cache line will help more. Therefore, a direct-mapped cache design can be used for the investigation. Of the different accelerators, the chaser would benefit from a cache if used for multi- key searches on a single data structure, exhibiting temporal locality. A data structure was constructed to approximately fit itself entirely into the largest cache size. A multi- key search was then performed to search a portion of the data set.

Cache performance with size (multi-key) Cache performance with structure (multi-key) 1.5 320 320 CacheN 1.03 CacheS N/S ratio 1.4 300 300 1.02 280 1.3 280 CacheN 260 SpeedN 260 1.01

1.2 N/S ratio Speedup ratio 240 240 Memory Transactions Memory Transactions 1 1.1 220 220

200 1 200 0.99 1 10 100 1 10 100 Size ratio (%) Size ratio (%)

Figure 8.7: Cache structure comparison

8.6.1 Cache Size Ratio

The cache was inserted between the chaser unit with memory and the multi-key simu- lation performed, on different cache size ratios. Each simulation was repeated 50 times and the values are plotted in a graph. The left sub-figure of Figure 8.7 shows the re- sults of the simulation against the size ratio, from 1% to 100%. The speed-up ratio was measured against the number of memory transactions on a cacheless solution. Looking at the cache performance with size, it is evident that as the cache size increased, there was also an increase in performance. However, increasing the cache size from about 1.5% to 100% only speeds up the search operation by about 50%. From the graph, the optimal cache size would be about 10% of the data structure. Beyond this point, the improvement in performance diminishes with the increase in cost.

2http://web.archive.org/web/20080120104958/http://www.europractice-ic.com/docs/ MPW2008-general-v3.htm

97 8.6.2 Structural Locality

In order to exploit structural locality, a structural cache was designed. Figure 8.8 shows the basic architecture of a structural cache. The basic concept behind it is a cache that is segmented based on the levels of the tree. This can be likened to a form of enforced pseudo-associativity, with the set chosen based on the level of the tree the data is in. In this design, the LEVEL of the tree is produced by the chaser unit as it branches down the tree and is not obtained from the memory address.

TAG LEVEL INDEX LINE

WORD RAM DATA TAG RAM

= HIT

Figure 8.8: Structural cache architecture

Data nodes can be scattered randomly across the entire memory space. Therefore, it is very possible that the nodes close to the root of a tree are aliased by the branches of the tree. By associating only certain blocks of the cache with certain levels of the tree, the chance of aliasing is reduced. This should increase the probability of keeping the nodes close to the root of the tree, in cache. A structural cache was designed and simulated using the same parameters as before. As visible in the right sub-figure of Figure 8.7, the structural cache shifts the threshold ratio down, allowing smaller caches to approximate the performance of a larger cache. The use of this structural cache gives an improvement in performance of about 3% at small cache sizes, for no extra cost. However, its significance reduces greatly for larger caches, when almost all the data structure is stored in cache. Therefore, if there are extra resources, a small structural cache can be integrated with a chaser unit. This can give a slight performance boost by reducing the amount of memory transactions needed to perform a multi-key search on a single data structure.

8.7 Conclusion

It is clear that instruction and data caches exhibit different characteristics and need to be designed and configured separately. The instruction cache for search operations would benefit from having a larger line width, associativity and size. Data caches would only benefit from having a larger line width. However, linked data structures used in search do not benefit significantly from data caching. Data locality does not take into account the actual structure of the data as

98 each data node is usually scattered throughout the memory space. Furthermore, caches designed to exploit temporal locality will fail when the data or key is different for each search. Spatial locality will help under certain circumstances. Neighbouring nodes that need to be traversed, should be stored in spatially close locations. This will convert structural locality into spatial locality and benefit from increased line widths in a cache structure. However, none of these solutions should be handled in hardware. They are all bet- ter handled in software. However, most software requires that the memory must be randomly accessible, which makes it difficult to enforce the necessary changes. As the results show, for the case of search operations, data cache size does not matter much. So, if there is a need to include any data cache for search operations, a simple and small direct-mapped cache is all that is needed, with a larger line width. This cache can benefit the chaser unit slightly, but it will have minimal impact on the streamer unit. As a result, it may actually be better to allocate the limited chip resources to something else instead of cache.

99 CHAPTER 9

Search Pipelines

The hardware layer right above the accelerator units is the search pipeline layer. The different hardware units can be combined in different ways to accomplish the task of processing different types of search queries.

9.1 Pipelines

Now that the different accelerator units have been introduced and discussed, it is essen- tial to visualise how the units work in combination to accomplish search. The different categories of search problems were described in Section 2.2. The most primitive form of search is primary search, as described in Section 2.2.1, which forms the most common search operation. Secondary searches were described in section 2.2.2 and covers other kinds of search operations. The different search problems can be accelerated using a different combination of accelerator units described below.

Search Retrieval Collation

Figure 9.1: Search pipeline abstraction

9.1.1 Primary Search

A primary search is a search for primary keys, which are expected to return a single unique result from the whole data set. This means searching for a equivalence key only, as any other search criteria may return more than a single unique result. Such an application can be typically accelerated using a single chaser unit connected to memory.

100 Whether or not an optional cache is used would depend entirely on the nature of the search and the data set. A primary search would typically only involve the first stage of the pipeline: search. This is the most important stage as later stages and secondary searches are dependent upon it. An analogy can be drawn with an instruction fetch stage, where the throughput of operation is dependent on the issue rate of instructions. In this case, the search pipeline throughput is dependent on the issue rate of primary key searches. From equation 7.2, the size of the data set does not adversely affect the completion time for a chase operation. The completion time will only double if the data sets approach the size of 243 nodes. Since this is unlikely to be reached, the chase operation can, for most estimations, be considered to complete in O(1) time with a dominant configuration overhead. Assuming that the data set for indices is large, the key issue rate can be estimated ′ to be Chw,m(N) = 83.2 ticks per key or 0.012 keys per tick. From the assumption above, 390.2 the amount of time for a single search is 83.2 ≥ 4.68 times this rate. Although it will be possible to reduce the overhead costs associated with key searching, it will not improve the issue rate, which is quickly dominated by the actual key search, rather than the overhead. Therefore, in order to amortise the configuration overhead cost, a hardware chaser should be used for searching at least 5 keys on a single data structure before switching over to a different data structure. ′ Csw,s(N) For a single primary search, a single chaser can provide a ′ ≥ 2.53 speed up Chw,s(N) for a single key search on a sufficiently large data sets. Although this is not much of an acceleration, it is fundamentally all that can be done for this type of search unless there is a fundamental change in the types of memory, data structures and algorithms used. For applications where a data set needs to be regularly searched for keys, permanently assigning a chaser to each data set would be beneficial. This would result in a multi-key ′ Csw,m(N) acceleration of ′ ≥ 3.43 speed up. Chw,m(N) These values are obtained without the use of any cache. Figure 8.7 shows the speed- up of using a cache. A small (10%) structural cache memory can increase the perfor- mance to 3.43 × 1.3 = 4.46 times, which is more significant. The key issue rate can also 83.2 be slightly reduced to 1.3 = 64.0 ticks per key. As mentioned in section 2.2.1 the only way to increase this performance is to assign multiple chasers to perform a number of independent searches on different data sets in parallel, preferably on parallel memory channels. In such a configuration, memory contention would become a more significant issue, which is why it is important to have multiple memory channels to service the memory requirements of the multiple chasers.

101 9.1.2 Simple Query

A simple query is the most primitive secondary search operation. It differs from the primary key search and is expected to return one or more results from the whole data set for a single key search. This form of query is accelerated using the most primi- tive pipeline, with only two stages: search and retrieval. This is implemented using a streamer unit in cascade with a chaser unit. The first part of the search is a primary key search as described above. Once a key is found, it can be mapped to a list of records that match the key. The streamer can then be used to pull the records into the accelerator. As a single stream retrieval operation does not provide any additional acceleration on top of key search, the basic pipeline only accelerates as much as a single key search on a chaser as described before. As before, the nature of the problem means that performance is increased by paral- lelising multiple searches. The most obvious way to do this is to replicate the chaser– streamer pipeline multiple times and to run independent queries on each. This will result in a linear increase in hardware cost with the number of parallel search pipelines. However, there is an alternative way of building a pipeline for simple queries. Multiple streamers can be paired with a single chaser if the chase time is assumed to be much lower than the time taken to retrieve the results stream. Knowing that the key issue rate is about 83.2 ticks per key, the size of the data set that can be streamed during the interim, can be estimated. Using expression 5.2, this value is estimated to ′ Chw,m(N) be ′ ≥ 3.68. Mhw(N) So, for the first four results returned, the chaser will be kept busy with a second search. But if the streamer returns more than four results, the chaser would be kept blocking until the streamer is free to service it. A simple query is likely to return more than four results per key search. Therefore, it would be possible to match up multiple streamer units with a single chaser instead of pairing them up one to one. For sufficiently large data structures, the optimal number of streamers to service a single chaser can be estimated using:

′ Mhw(N1) N1 ′ = 2.5 × (9.1) Chw,s(N2) lg N2

It is safe to assume that for most common cases, N2 ≫ N1 ≫ 1 but it is difficult to predict the exact or relative sizes of each data set. Therefore it will be impossible to estimate exactly how many streamers should be paired. However, expression 9.1 can provide an indication of the lower bound.

As mentioned before, a chaser would only be worth using if N2 ≥ 256, which gives a value of lg N2 ≥ 8 for the denominator. When N2 grows, the denominator grows logarithmically. In addition, it takes at least the time of N1 ≥ 4. It is easy for N1 ≥ 8

102 in most applications and when N1 grows, the numerator grows linearly. Therefore, it ≫ N1 ≈ is likely that N1 lg N2. So, an assumption is made that lg N2 1 as a minima for practical applications. This is a useful relationship to have as it can be a guide for deciding how many streamers and chasers to place in a chip. Although no exact value is obtainable without making prior assumptions about the data, a lower bound of 2.5 streamers per chaser can be inferred from the relationship.

9.1.3 Range Query

A range query is similar to a simple query. There are a number of configurations that can be used to accelerate this form of search, depending on how the range is bound. The range can be bound at both ends or only one end. For a range that is bound at both ends, there are two methods that can be used, depending on the size of the range. A pure hardware method can be used to turn a range query into a multi-key simple query for all the values within the range. Each result can then be sent off to a streamer to be retrieved from memory. One or more sieve units can be used in union mode to combine the results into a final results list. A hybrid method can be used, where a chaser unit is used to chase down either the lower or upper bound node. Once this value is retrieved, a software algorithm can be used to traverse the tree, retrieving the other nodes within the range. These nodes can then be fed to the streamer in a conventional way and the results can be combined in hardware using a sieve unit. Which method is chosen would depend entirely on the size of the data structure and the range of items to chase. Assuming that the data structure size is significantly large and the range (R) to be fairly large at about 10% of the data structure size, equations 7.1 and 7.4 can be used to compare the options. Equation 7.1 needs to be modified slightly where lg N ⇒ R to estimate the amount of time needed to traverse a tree, by observing the fact that lg N actually symbolises the number of nodes visited when going down a tree.

C (R) 22.9 R + 280.6 sw,s = = 0.275, R ≫ 1 (9.2) Chw,m(R) 83.2 R + 318.2 Equation 9.2 shows that regardless of the size of the range, the hybrid software method will be faster than a multi-key hardware method. Although this can be improved by shifting the root of the tree once one bound is found, the multi-key hardware method would still be slower than the software traversal. Therefore, the only advantage that the hardware method has over the hybrid method is offloading and not speed-up unless a method can be found to allow hardware traversal of trees. A slight variation of this hybrid method can also be used for a single bound range,

103 which traverses all the nodes on one branch of the tree. This exploits the fact that for any node in a tree, all the nodes to the right branch are greater while the nodes to the left branch are smaller. So, a chaser unit can be used to chase down the lower or upper bound. When this is found, all the nodes to one side of the branch can be retrieved A hardware method of tree traversal is to use a streamer that is configured such that its DATA and NEXT register is set to only one branch. This will force the streamer to perform a depth-first traversal by traversing one branch and returning a list of pointers down that branch. To estimate the effectiveness of this solution, equation 5.2 can be used. C (R) 22.9 R + 280.6 sw,s = = 1.013, R ≫ 1 (9.3) Mhw(R) 22.6 R + 241.7 Equation 9.3 tells us that there will not be any significant difference between a hardware stackless tree traversal and a software method. If the software is modified slightly to build stackless trees, this is all that is needed to retrieve the entire branch of a tree. Otherwise, a hybrid method can be used, where the software is used to configure and set off multiple streamers down different branches or do extract the nodes entirely in software, which will do no worse.

9.1.4 Boolean Query

A boolean query may involve all the different forms of queries above. It involves an additional layer of result collation after any number of simple and range queries. Two or more result streams can be collated with one or more sieve units. In the most primitive applications, a sieve unit can be connected in cascade with two streamer units. As measured earlier, this configuration alone will give a 5.2 times speed-up. At this point, it may seem logical that combining this with chasers would provide additional acceleration. Unfortunately, this is not true because the bottleneck in such a pipeline is the streamer. However, there is a situation where a sieve unit can provide additional acceleration. For more complex collation operations, multiple sieves can be combined, in parallel and cascade, in a logical manner. Depending on the complexity of the operation, the speed-ups obtained can be significantly more than just 5.2 times. It is also potentially possible to design a slightly more advanced sieve, which collates more than two streams at a go. The only reason that this was not done was to simplify things by working with primitive structures. While this will consume similar resources with multiple cascaded sieves, it will reduce the number of stages in the search pipeline and increase throughput for complex collations.

104 9.2 System Pipelining

With regard to throughput, the accelerator units can be combined in different ways, which are described in the next chapter. Certain combinations of accelerator units can then be constructed to address the different pipeline types expressed in this chapter. In a situation where the pipelines are all single staged without overlapping, the amount of acceleration would be similar to the values quoted in earlier sections. If multiple pipelines are used in parallel, throughput can be increased until the bandwidth limit is reached. However, just like any other pipeline, the different stages could also be overlapped to increase throughput. In such a situation, each accelerator unit has an output buffer treated as pipeline buffers. The host processor will then be in charge of synchronisation and reorganising the sequencing of data where necessary. While this has not been explored specifically, a potentially larger total acceleration could be achieved, with a modest number of hardware units.

105 CHAPTER 10

Implementation

The accelerator units can be combined together either in a dynamic or static fashion and they can be used either as a bridge, a co-processor or even an I/O device. In addition, the accelerators have primarily been designed for FPGA implementation. However, some potential ASIC implementation (0.35µ and 0.18µ) are also explored.

10.1 Fabric Architectures

Although it is important to figure out the number of accelerator units that can be implemented, there is also the question of how the units will be interconnected. Figure 10.1 shows some possible interconnection architectures for the accelerator itself. These figures assume that there is a host interface and memory interface to each side. The main difference in these two implementations is how the pipeline is structured, either statically or dynamically. The choice of which to use would depend on the type of search operations that need to be accelerated and the physical resources available.

10.1.1 Dynamic Fabric

In the dynamic fabric, the search pipeline is dynamically structured in hardware. De- pending on the type of search being accelerated, the necessary accelerator units can be allocated and linked together. The data path through this dynamic pipeline is mainly controlled by configurable routers in hardware. These routers can be simple switch fabrics that connect the outputs from one stage to the inputs of another and can be adapted from existing switch architectures and also network-on-chip architectures.

106 C C C C C C C C

R R R R M M M M M M M M

V V V V R R R R

V V V V V V

(A) Dynamic Fabric (B) Static Fabric

Figure 10.1: Implementation architectures

There is also a possibility of using the sieve unit as a form of router. Each sieve unit can be configured to perform a swap, as one of its operation modes. If alternated in cascade, it is possible to build a swapping network to route signals around. It would also be trivial to couple the swap operation with the intersect or union modes to perform routing and result collation at the same time. In this case, the cost of routing can be amortised as part of the search pipeline operation. The advantage of such a configuration is the flexibility of the pipeline and the types of problems it can solve. However, this flexibility comes at the expense of using multiple hardware router units. This will increase both the limited hardware implementation costs, and also the configuration overhead as the pipelines will now need to be configured, on top of the accelerator units. Therefore, this would only be useful for performing well defined search operations on extremely large databases such as searching a DNA database.

10.1.2 Static Fabric

For other common operations, the ability to have such flexibility is expensive and may be excessive. Most common searches would involve one of the common forms, such as the example query. Therefore, it is also possible to accelerate most common search problems without needing any dynamic configurability. The pipelines can be statically configured and any additional complexity can be absorbed in the software. In this case, the types of pipelines laid out may need to be considered. There is no reason why static fabric should be composed of only one type of pipeline accelerating one type of query. It would be better to mix different types of pipelines in the accelerator and allow the software to select the pipeline to be used for acceleration. The configuration shown in the figure can be used to accelerate a fairly complex boolean query with four criteria or it could be used to accelerate four individual simple queries or two simple queries and one boolean query. This is by no means the only pipeline configuration possible. Although hardware routers are not used, software routing could be used as a com-

107 plementary method. Software routing could be used to link multiple serial queries or link parallel parts of a single query or to link complex queries that would not otherwise be accelerated. In the case of figure 10.1, the last sieve stage may be removed and replaced with a pure software sieve, or a software-pumped sieve. Software can also play the role of a router and move results between the pipelines and between stages. The only disadvantage of software routing would be the slow-down factor, which is not a problem if the results are produced at a slower rate than an earlier pipeline stage such as for certain accelerator units like the sieve or chaser.

10.2 Integration Architectures

There is the question of how the entire accelerator will fit into a standard computer architecture. Figure 10.2 shows some possible integration architectures. The only re- quirement for the accelerators is to have access to the main memory pool and the host processor. Therefore, it is possible to integrate the accelerator in any number of ways. It can be tightly integrated as a bridge and co-processor device or loosely integrated as an external I/O device. Each method has its advantages and disadvantages. In either case, the accelerator can be an on-chip or off-chip device. An on-chip accelerator, like an FPU or PadLock device, will be very tightly integrated with the host processor and can share the main I/O ports. An off-chip accelerator will be easy to integrate and dropped into existing systems with only slight modifications but will need to replicate many resources of the host processor.

CPU HSX MEMORY CPU MEMORY

(A) Bridge HSX

CPU MEMORY (C) I/O Device

HSX

(B) Co−Processor

Figure 10.2: System level implementation

10.2.1 Tight Coupling

In a bridge mode, the accelerator is placed between the host processor and main memory. To the host processor, the accelerator should behave like a memory device while to the memory, it behaves like a host processor. When used in this mode, the accelerator is closer to the main memory pool than the host processor, which allows the accelerator to intercept the host’s access to memory. This allows the accelerator to regulate the

108 memory bandwidth used by the processor and block the processor when necessary. The advantage of this arrangement is that all available memory bandwidth can be consumed by the accelerator for search operations. The disadvantage is that the host processor access to main memory will suffer from slower latency. In a co-processor mode, the accelerator has the same priority as the host processor and is placed next to the main memory. When used in this mode, the accelerator is equally far from the main memory pool as the host. The accelerator communicates with the host processor via a dedicated accelerator bus. However, as it does not have direct control over the memory bus, it will be forced to share it with the main processor. This arrangement reduces the latency of for the host processor as compared to the bridge mode. Although, it may still allow the accelerator to consume a large amount of the memory bandwidth as necessary, the disadvantage is that the accelerator is not in control of the amount of memory bandwidth it uses. In both these closely coupled configurations, the accelerator is platform specific. In order to sit comfortably with the host processor, It needs to conform to the appropriate bus standards used to communicate with the host processor, which is typically different between microprocessor vendors and often proprietary. So, it further trades off platform flexibility for faster access to the main memory pool.

10.2.2 Loose Coupling

In an I/O device mode, the accelerator sits alongside other I/O devices on an I/O bus and behaves no differently from any other I/O device. Each individual accelerator unit can be mapped to an external memory or I/O space and access via the memory or dedicated I/O bus. In this mode, the accelerator will need to include a host interface block, which could be chosen from any number of standard protocols, such as PCIe. The advantage of the loose coupling is the simplicity and unlike the tightly coupled devices, it can be totally platform agnostic and be used on any platform. However, this comes as a result of reduced memory bandwidth. Any available band- width would need to be shared between the accelerator and other I/O devices. Access to main memory will also be affected as it will have to go through the main bus. Therefore, this will have the slowest performance of the different coupling mechanisms.

10.3 FPGA Implementation

Owing to developments from mainstream computer and FPGA vendors, the accelera- tor units were designed with a potential FPGA implementation in mind. The AMD

109 Torrenza1 and Intel QuickAssist2 programmes are two platform initiatives designed to open up standard computer systems. They both provide varying levels of support for the integration of specialised co-processors in their previously closed systems. Major FPGA vendors have espoused this development by developing custom prod- ucts for it. Both Xilinx and Altera have specialised products that can plug into a co-processor socket on a suitable motherboard. These products both directly connect the FPGA to the host CPU and memory access to the socket’s DDR memory slots. They are designed to work alongside the host processor. For the purpose of a sample implementation, a low-cost Xilinx Spartan3A FPGA is used as a test platform. All the implementation results quoted below, are based on the Spartan3A FPGA device, which is built on 90nm technology. There are other other FPGA families that can give better results, in terms of area, power and speed, than the Spartan3A. That is why the Spartan3A was chosen as a baseline representation for the worst-case performance available today.

10.3.1 Chaser Implementation

Report 10.1 shows relevant performance figures extracted from the implementation re- ports for a chaser. It shows that the chaser unit is capable of running at 100MHz on an FPGA, with the resource and power consumption for a single chaser unit being scored as:

Clut = 515

Cpow = 9

10.3.2 Streamer Implementation

Report 10.2 shows relevant performance figures extracted from the implementation re- ports for a streamer. The report shows that the streamer unit is capable of running at 100MHz on a Spartan3A. Furthermore, it shows the resource and power consumption for a single streamer unit running at 100MHz can each be scored as:

Mlut = 347

Mpow = 4

1http://web.archive.org/web/20071215011032/http://enterprise.amd.com/us-en/ AMD-Business/Technology-Home/Torrenza.aspx 2http://www.intel.com/technology/platforms/quickassist

110 Report 10.1 Chaser FPGA implementation results (excerpt)

Design Summary ------Number of errors: 0 Number of warnings: 63 Logic Utilization: Number of Slice Flip Flops: 359 out of 11,776 3% Number of 4 input LUTs: 514 out of 11,776 4% Logic Distribution: Number of occupied Slices: 387 out of 5,888 6% Number of Slices containing only related logic: 387 out of 387 100% Number of Slices containing unrelated logic: 0 out of 387 0% *See NOTES below for an explanation of the effects of unrelated logic. Total Number of 4 input LUTs: 515 out of 11,776 4% Number used as logic: 388 Number used as a route-thru: 1 Number used for Dual Port RAMs: 126 (Two LUTs used per Dual Port RAM) Number of bonded IOBs: 210 out of 372 56% IOB Flip Flops: 67 Number of BUFGMUXs: 1 out of 24 4%

Power summary | I(mA) | P(mW) | ------Total estimated power consumption | | 85 | --- Total Vccint 1.20V | 40 | 49 | Total Vccaux 2.50V | 14 | 35 | Total Vcco25 2.50V | 1 | 1 | --- Clocks | 8 | 9 | Inputs | 0 | 0 | Logic | 7 | 9 | Outputs | Vcco25 | 0 | 1 | Signals | 0 | 0 | --- Quiescent Vccint 1.20V | 26 | 31 | Quiescent Vccaux 2.50V | 14 | 35 | Quiescent Vcco25 2.50V | 0 | 1 |

Timing summary: ------Timing errors: 0 Score: 0 Constraints cover 26762 paths, 0 nets, and 2247 connections Design statistics: Minimum period: 9.607ns{1} (Maximum frequency: 104.091MHz)

111 Report 10.2 Streamer FPGA implementation results (excerpt)

Design Summary ------Number of errors: 0 Number of warnings: 2 Logic Utilization: Number of Slice Flip Flops: 225 out of 11,776 1% Number of 4 input LUTs: 347 out of 11,776 2% Logic Distribution: Number of occupied Slices: 255 out of 5,888 4% Number of Slices containing only related logic: 255 out of 255 100% Number of Slices containing unrelated logic: 0 out of 255 0% *See NOTES below for an explanation of the effects of unrelated logic. Total Number of 4 input LUTs: 347 out of 11,776 2% Number used as logic: 283 Number used for Dual Port RAMs: 64 (Two LUTs used per Dual Port RAM) Number of bonded IOBs: 210 out of 372 56% IOB Flip Flops: 97 Number of BUFGMUXs: 1 out of 24 4%

Power summary | I(mA) | P(mW) | ------Total estimated power consumption | | 71 | --- Total Vccint 1.20V | 29 | 35 | Total Vccaux 2.50V | 14 | 35 | Total Vcco25 2.50V | 0 | 1 | --- Clocks | 0 | 0 | Inputs | 0 | 0 | Logic | 4 | 4 | Outputs | Vcco25 | 0 | 0 | Signals | 0 | 0 | --- Quiescent Vccint 1.20V | 26 | 31 | Quiescent Vccaux 2.50V | 14 | 35 | Quiescent Vcco25 2.50V | 0 | 1 |

Timing summary: ------Timing errors: 0 Score: 0 Constraints cover 5024 paths, 0 nets, and 1769 connections Design statistics: Minimum period: 7.427ns{1} (Maximum frequency: 134.644MHz)

112 10.3.3 Sieve Implementation

Report 10.3 shows the relevant numbers from the implementation report. It shows that the sieve unit is capable of running at 100MHz on an FPGA and show the resource and power consumption for a single sieve unit running at that speed. They can each be scored as:

Vlut = 634

Vpow = 17

Report 10.3 Sieve FPGA implementation results (excerpt)

Design Summary ------Number of errors: 0 Number of warnings: 2 Logic Utilization: Number of Slice Flip Flops: 118 out of 11,776 1% Number of 4 input LUTs: 629 out of 11,776 5% Logic Distribution: Number of occupied Slices: 332 out of 5,888 5% Number of Slices containing only related logic: 332 out of 332 100% Number of Slices containing unrelated logic: 0 out of 332 0% *See NOTES below for an explanation of the effects of unrelated logic. Total Number of 4 input LUTs: 634 out of 11,776 5% Number used as logic: 373 Number used as a route-thru: 5 Number used for Dual Port RAMs: 256 (Two LUTs used per Dual Port RAM) Number of bonded IOBs: 211 out of 372 56% IOB Flip Flops: 37 Number of BUFGMUXs: 1 out of 24 4%

Power summary | I(mA) | P(mW) | ------Total estimated power consumption | | 95 | --- Total Vccint 1.20V | 45 | 54 | Total Vccaux 2.50V | 14 | 35 | Total Vcco25 2.50V | 2 | 6 | --- Clocks | 5 | 5 | Inputs | 0 | 0 | Logic | 15 | 17 | Outputs | Vcco25 | 2 | 5 | Signals | 0 | 0 | --- Quiescent Vccint 1.20V | 26 | 31 | Quiescent Vccaux 2.50V | 14 | 35 | Quiescent Vcco25 2.50V | 0 | 1 |

Timing summary: ------Timing errors: 0 Score: 0 Constraints cover 13855 paths, 0 nets, and 2648 connections Design statistics: Minimum period: 9.365ns{1} (Maximum frequency: 106.781MHz)

The retail price3 of a Spartan3A FPGA ranges from $5.75 for 1408 LUTs ($0.004 per LUT) to $35.60 for 22,528 LUTs ($0.0016 per LUT). For the price estimates, a mean value of $0.0028 (£0.0017±0.0003) per LUT is used.

3prices from NuHorizons online store

113 10.3.4 Resource & Power

From the reports above, an expression for the resource consumption and power dissipa- tion of the hardware accelerator can be developed.

Qlut = 515 NC + 357 NM + 634 NV + 126 (10.1)

Qpow = 9 NC + 4 NM + 17 NV + 75 (10.2)

Expression 10.1 calculates the total resource consumption of the different units. The constant factor of 126 is used by the accelerator bus arbitration device, which is insignificant compared to the rest once the number of accelerator units starts to increase. Expression 10.2 estimates the power consumption of the different units, in mW . Furthermore, these figures only apply to the design as implemented on a Spartan3A FPGA. If a different FPGA family is used, the values will be different. Despite this, the expressions have some value as they give us an idea of the relative resource and power consumption of each accelerator unit. These will be useful when estimating the number of accelerator units to place in an FPGA with limited constraints.

10.3.5 Physical Limits

From the chip resource requirements, the absolute maximum limit of each type of unit can be calculated. The largest Spartan3 FPGA available has 22, 528 LUT units[Xil08], this works out to 51 chaser, 63 streamer and 35 sieve units. These numbers indicate that there are more than enough resources on a low cost FPGA to hold the accelera- tor units. With such high numbers of accelerators, the main limitation in the FPGA implementation is memory bandwidth. The sieve unit does not consume any external memory bandwidth so, it is purely resource-limited but the streamer and chaser units both consume significant amounts of memory bandwidth. Assuming the maximum unit numbers above, the amount of mem- ory bandwidth required would be 65.3Gbps and 134.2Gbps for the chasers and stream- ers running at 100MHz. Assuming that the system is connected to the fastest standard DDR2-1066 memory[JED07] available, the maximum memory bandwidth available to the system is 68.2Gbps only. Therefore, this memory bandwidth ultimately limits the number to NM ≤ 32 streamers or NC ≤ 53 chasers only.

10.4 ASIC Implementation

Although the hardware accelerator is not fabricated as an ASIC, the accelerator units were synthesised for a standard cell ASIC implementation process for estimation pur- poses. The sample technologies chosen were AMS 0.35µm and UMC 0.13µm technolo-

114 AMS 0.35um UMC 0.13um 0.62 0.028

0.6 0.026

0.58 0.024 ) ) 2 0.56 2 0.022

0.54 0.02 Area (mm Area (mm

0.52 0.018

0.5 Chaser 0.016 Chaser Streamer Streamer Sieve Sieve 0.48 0.014 100 150 200 250 300 350 400 450 500 200 400 600 800 1000 1200 1400 Core speed (MHz) Core speed (MHz) AMS 0.35um UMC 0.13um 140 16

130 14 120 12 110

100 10

90 8 Power (mW) 80 Power (mW) 6 70 Chaser 4 Chaser 60 Streamer Streamer Sieve Sieve 50 2 100 150 200 250 300 350 400 450 500 200 400 600 800 1000 1200 1400 Core speed (MHz) Core speed (MHz)

Figure 10.3: ASIC area and power estimates gies. Although 0.35µm technology is not now used for new designs, it is useful to obtain some results for comparison purposes. The chosen 0.13µm technology is fairly recent and the best possible timings are obtained for this technology. It is also the more expensive fabrication technology. The retail fabrication prices4 for 0.35µ and 0.13µ are e720 and e1168 (£572 and £928) per mm2 area with some minimum die size requirements, which are 10mm2 and 25mm2 (0.35µ and 0.13µ). The minimum quantity of chips obtained for these prices are 30 (0.35µ) and 45 (0.13µ) units. Therefore, the cost per mm2 per chip is actually £19.06 and £20.62 respectively. All the estimates were obtained using the typical case library only. This is useful for getting a ballpark figure of merit for each accelerator unit but for actual fabrication purposes, a comprehensive programme of testing is necessary using the best and worst case libraries to ensure that the accelerators work correctly. Although the area-speed graphs are drawn using straight lines, actual area-speed curves are typically non-linear. This non-linear trend can actually be observed in the distribution of the points in the graphs. However, a linear extrapolation is good enough to provide an area estimate within a small margin of error. Although dynamic and static power values are also available, the dynamic power estimates are used as a measure. For regular applications, the dynamic power is more

4prices from Europractice

115 important as it is the main source of power dissipation. The static power values are only useful for mobile and embedded battery powered applications. In either case, it is ultimately dependent upon fabrication technology. For the purpose of estimation and comparison, an estimated power figure is sufficient. Figure 10.3 shows the area and power estimates for each accelerator unit at different operating frequencies. In each graph, the the highest speed plotted is the one where the operation completed with successful timing closure. These area estimates are obtained without integrated cache memories. For all cases, the estimates only apply to the core unit themselves as no pad cells were used. Whether or not pad cells are used in actual implementations will depend on how the accelerator units are integrated into the host system. These integration architectures were discussed in an earlier section.

10.4.1 Area Estimates

Area size directly translates to cost as a function of fabrication process and yield. Al- tough some care was taken during the design process to make design choices that con- sume less resources, the accelerators were designed to run at a fast speed and there is still room for some improvement in terms of area by trading-off raw speed. However, this is dependent both on the hardware technology chosen for implementation, and dependent on the final application of the accelerator. So, the present designs are kept generic, to allow final application customisation only when necessary. Linearly extrapolating each line in the graph gives the following expressions for each unit (f in MHz):

2 C035(f) = 169.7 f + 527545 µm

2 M035(f) = 197.7 f + 464857 µm

2 V035(f) = 37.3 f + 593615 µm

2 C013(f) = 2.57 f + 23769 µm

2 M013(f) = 2.58 f + 14705 µm

2 V013(f) = 1.27 f + 25893 µm

These expressions show that there is a different rate of change for each accelerator size with respect to speed but the effect of the speed on the area size will only become significant for very large numbers of accelerator units (in the order of thousands). How- ever, the changes in area size of the chaser and streamer are similar in each fabrication technology but they are both significantly different from the streamer. This should be

116 taken into account when adding multiple chasers and streamers into a final application under tight area constraints.

10.4.2 Power Estimates

Both dynamic and static power dissipation are very closely linked to the fabrication technology used. Therefore, power optimisation is mainly a fabrication issue, rather than an architecture issue. Of course, some minor steps can be taken to reduce power consumption from within an architecture. For example, instead of using a 4-bit adder, the FIFO counters were implemented as an LFSR with a single XOR gate. And, in- stead of having multiple adders to calculate the pointers and offsets in the chaser and streamer, a single adder was shared through multiplexing. All these steps are designed to reduce the amount of resource consumption, hence . However, this does not discount the fact that power is a process issue, more than an architectural one. Extrapolating the lines in the graphs linearly gives the following expressions for each unit (f in MHz):

C035(f) = 337.8 f + 18736.8 µW

M035(f) = 305.0 f + 13612.9 µW

V035(f) = 250.6 f + 11411.4 µW

C013(f) = 11.09 f + 551.9 µW

M013(f) = 7.02 f + 413.7 µW

V013(f) = 11.02 f + 475.3 µW

10.4.3 Speed Estimates

Looking at the graphs, the speed limit is about 333MHz (0.35µ) and 1.0GHz (0.13µ) respectively. For regular applications, the fastest DDR2-1066 memory has a memory bandwidth of 68.2Gbps at 533MHz. On 0.35µ technology, the memory speed is still significantly higher than the accel- erators. With the assumption that the memory runs at 533MHz while the core runs at 333MHz, the maximum theoretical bandwidth consumed is 4.3Gbps for the chaser and

7.1Gbps for the streamer. This gives a maximum limit of NC ≤ 15 chasers and NM ≤ 9 streamers per chip, assuming an unlimited area budget. On 0.13µ, the accelerators run faster than the memory speeds. Therefore, the mem- ory bandwidth becomes a serious bottleneck for high speed devices, as is the case for

117 Estimate 0.35µm ,f=333MHz 0.13µm ,f=1GHz C035 M035 V035 C013 M013 V013 Area (mm2) 0.584 0.531 0.606 0.026 0.017 0.027 Power (mW ) 131.2 115.2 94.9 11.6 7.4 11.5

Table 10.1: ASIC area and power estimates at speed general purpose processors. The maximum theoretical bandwidth at 1GHz is 12.8Gbps for a chaser and 21.3Gbps for the streamer. At these speeds, the maximum number of units are NC ≤ 5 chasers and NM ≤ 3 streamers.

10.5 Cost Estimates

For cost estimations the frequencies are fixed at 250MHz and 667MHz (0.35µ and 0.13µ), which is the median overlap speed of each size. From the expressions above, it is clear that the price for 0.35µ is very much higher than for an equivalent unit in 0.35 even after taking the fabrication cost difference into account. This difference is mainly due to the blocks of memory used in each unit, which depends on the RAM cells for the selected technology library. The 0.13µ technology library used has a compact design for the standard cells and memory.

Chaser Streamer Sieve 0.35µ £10.87 £9.81 £11.50 0.13µ £0.53 £0.34 £0.55 FPGA £0.88±0.15 £0.59±0.10 £1.08±0.19

Table 10.2: Fabrication cost per accelerator unit

The numbers in Table 10.2 are for small production runs and will be lower if the chip is mass produced. Although it may seem that it is more cost effective to implement the design in 0.13µ than an FPGA, this number does not take into account the minimum die area. As the designs have to interface directly with memory, the chip will be a pad limited design and will definitely cost much more than this table seems to indicate. In addition, it does not take into consideration the packaging costs either. Therefore, it is ultimately more cost effective to implement the design in an FPGA for custom applications.

118 10.6 Conclusion

The FPGA implementation is cheap but not very fast. However, having many hardware accelerators running in parallel, can make up for any lack in speed, as long as the FPGA is paired with suitably fast memory. Going onto an ASIC implementation allows the clock speeds to breach the 1GHz mark. However, the memory bottleneck then becomes a very serious issue. From the various implementations, it is evident that there are physical limitations on how many accelerator units can be included in a chip. Although chip area and power consumption are both important factors, the overriding physical limit in each case is the memory bandwidth. However, there is still a question of how the limited bandwidth can be best utilised. It can potentially be used to run 51 parallel chasers only, or spread out evenly between the other units. Therefore, it is important to work out how best to utilise this limited resource. Due to the fundamental nature of the accelerator units, there are a number of ways in which they can be assembled. The units can be routed dynamically, or statically, via hardware or software. The accelerator can also be integrated at different distances between the host processor and memory pool. There are advantages and disadvan- tages to the different configurations. However, these are all the subject of the end user application and do not fundamentally change the nature of this research.

119 CHAPTER 11

Analysis & Synthesis

Some questions need to be answered with regard to the results obtained so far, how the solutions presented can scale, and also the estimated cost for implementation. Although the solution accomplished the job of accelerating search, there are other possible avenues for accelerating search that improve search on other layers. There is also room for improvement with regard to the actual design of the accelerator units. Suggestions are made about how these may be explored in future work.

11.1 Important Questions

Now that the accelerator units have all been presented, some very pertinent questions need to be answered. Firstly, it needs to be checked that the accelerator units actually accomplish an acceleration. Secondly, potential bottlenecks that affect scalability should be identified. Thirdly, the potential acceleration cost needs to be estimated.

11.2 Host Processor Performance

The first question is whether any actual acceleration takes place. All the comparisons have been made between the accelerator performance and the software performance of the host processor. The biggest assumption made thus far, for obtaining all the speed-up values, is that the host processor is running optimally. If the host processor performance were degraded due to sub-optimal software, it could affect the results. In addition, a

120 different host processor architecture may also affect the actual results. Both these issues need to be taken into account for consistency in the results.

11.2.1 Software Optimisation

The bulk of the code uses data structures and algorithms from the standard C++ STL library. As mentioned before, one of the reasons that this library was chosen is because it contains time-tested, optimised and mature code. It is unlikely that any custom code would be significantly better than the STL code. Next, the choice of compiler can also affect the quality of the code produced. All the code was compiled with GCC4, which is the latest generation of this popular optimising compiler. Once again, the compiler used is time-tested, optimised and mature[GS04]. Therefore, it is unlikely that any hand-written assembly would perform significantly better. Next, the choice of optimisation level will also affect the performance of the code produced. All the code was compiled using -O2 optimisations. This optimisation level was chosen as it reflects the most common optimisation level used in user software. According to GCC documentation, it presents a balance of both size and performance and is the best choice for deployment of a program[GS04]. It is arguable that the -O3 optimisation may produce more optimised code, but only at the expense of a larger code size. This may slow things down, in turn, due to instruction cache contention. Furthermore, -O3 optimisation only perform a few extra optimisations compared to -O2 optimisation, the main one being loop unrolling, which explains the larger code size. Therefore, -O2 optimisations are good enough to present a practical indication of performance. Although the software operations were mainly written using higher level C/C++, the low level API library was mostly written using in-line functions, including in-line assembly language. Certain functions had to be written in assembly language as it was impossible to invoke the necessary instructions from within C/C++. In-line functions were chosen as they work like macros and do not need any extra function call and return overheads, which ensures that these hardware specific operations do not create any unnecessary bottlenecks in the code. So, although the software kernels used may not technically be the fully optimised versions, they are sufficiently optimised to reflect real-world usage and fully optimised versions may not present any significant performance improvements. Therefore, it is safe to say that the performance exhibited by the accelerator units reflects the true nature of hardware acceleration and is not due to a slow-down caused by poorly written software.

121 11.2.2 Processor Architecture

It is also important to have an idea of how well the host processor performs against other processor architectures. There is no advantage in having an accelerator that is 10 times faster than a host processor if this is 100 times slower than any other processor. So, for this purpose, the chosen host processor is compared against a mix of common processor architectures covering both RISC and CISC architectures. Relative performance can be estimated by using a code profile of standard library code. The code chosen is the STL library find() method for a set data structure. Listings 11.1, 11.2, 11.3, 11.4 and 11.5 are disassembly listings of the compiled find() for different architectures. This code was chosen as it is indicative of a optimised key retrieval operation. The code was compiled using GCC with -O2 optimisation. It is not easy to compare the performance of such a disparate group of microproces- sors as the architectures vary widely, both in scale and type. Hence, a few assumptions need to be made to simplify the comparison. To ignore the effects of superscalar or mul- ticore architectures, all processors are assumed to execute one instruction every clock cycle. To ignore the use of any cache prediction, it is assumed that memory access inflicts a memory transaction penalty during cache misses. To ignore the use of any branch prediction, it is assumed that branch instructions inflict a penalty. In addition, as the architectures are very different, some of the instructions may not fit the categories exactly and some further simplifications were made:

• The 68K TST instruction, is counted as a compare instruction as it is essentially a compare against zero.

• CISC architectures do not have explicit load and store instructions. Therefore, the loads and stores are counted based on the addressing modes used instead.

• Branches were only considered for instructions that actually cause a change in the program counter. Therefore, conditional instructions on the ARM were not clas- sified as branches but counted as the relevant arithmetic or memory instruction.

• Miscellaneous instructions are instructions that moved data between registers and instructions that provided convoluted linking and unlinking operations. Linking and unlinking instructions on CISC machines are more complicated than RISC machines.

A quick glance at the code will show that the pattern is mostly similar across archi- tectures because the same compiler was used. This is further confirmed by tabulating the profiles. The most important numbers in Table 11.1 are the memory access numbers because both primary and secondary searches face memory bandwidth problems.

122 00000000 , std::less, std::allocator >::find(int 305 const&)>: 0: e8660008 lwi r3,r6,8 4: 30c60004 addik r6,r6,4 8: be030034 beqid r3,52 // 3c c: 11460000 addk r10,r6,r0 309 10: e9270000 lwi r9,r7,0 14: b8100014 brid 20 // 28 18: 11030000 addk r8,r3,r0 1c: 11480000 addk r10,r8, r0 20: e9080008 lwi r8,r8,8 314 24: bc080018 beqi r8,24 // 3c 28: e8680010 lwi r3,r8,16 2c: 16491801 cmp r18,r9,r3 30: bcb2ffec bgei r18, -20 // 1c 34: e908000c lwi r8,r8,12 319 38: bc28fff0 bnei r8, -16 // 28 3c: 16465000 rsubk r18, r6, r10 40: bc120014 beqi r18,20 // 54 44: e8870000 lwi r4,r7,0 48: e86a0010 lwi r3,r10,16 324 4c: 16432001 cmp r18,r3,r4 50: bcb20010 bgei r18,16 // 60 54: f8c50000 swi r6,r5,0 58: b60f0008 rtsd r15,8 5c: 10650000 addk r3,r5,r0 329 60: f9450000 swi r10,r5,0 64: b60f0008 rtsd r15,8 68: 10650000 addk r3,r5,r0 Listing 11.1: AEMB disassembly (GCC 4.1.1)

00000000 , std::less, std::allocator >::find(int 153 const&)>: 0: e5903008 ldr r3,[r0,#8] 4: e2800004 add r0,r0,#4 ;0x4 155 8: e3530000 cmp r3,#0 ;0x0 c: e1a02001 mov r2,r1 10: e52de004 push {lr} ;(strlr,[sp,#-4]!) 14: e1a01000 mov r1,r0 18: 0a000008 beq 40 , std::less, 160 std::allocator >::find(int const&)+0x40> 1c: e592e000 ldr lr,[r2] 20: e1a0c003 mov ip,r3 24: e59c3010 ldr r3,[ip,#16] 28: e153000e cmp r3,lr 2c: a1a0100c movge r1, ip 165 30: b59cc00c ldrlt ip, [ip, #12] 34: a59cc008 ldrge ip, [ip, #8] 38: e35c0000 cmp ip,#0 ;0x0 3c: 1afffff8 bne 24 , std::less, std::allocator >::find(int const&)+0x24> 40: e1500001 cmp r0,r1 170 44: 0a000003 beq 58 , std::less, std::allocator >::find(int const&)+0x58> 48: e5922000 ldr r2,[r2] 4c: e5913010 ldr r3,[r1,#16] 50: e1520003 cmp r2,r3 54: a1a00001 movge r0, r1 175 58: e49de004 pop {lr} ;(ldrlr,[sp],#4) 5c: e12fff1e bx lr Listing 11.2: ARM disassembly (GCC 4.2.3)

123 00000000 , std::less, std::allocator >::find(int 164 const&)>: 0: 80030008 lwz r0,8(r3) 4: 39 43 00 04 addi r10,r3,4 166 8: 2f 80 00 00 cmpwi cr7,r0,0 c: 7d435378 mr r3,r10 10: 41 9e 00 38 beq- cr7,48 , std::less, std::allocator >::find(int const&)+0x48> 14: 81 64 00 00 lwz r11,0(r4) 18: 7c090378 mr r9,r0 171 1c: 48 00 00 14 b 30 , std::less, std::allocator >::find(int const&)+0x30> 20: 7d234b78 mr r3,r9 24: 81290008 lwz r9,8(r9) 28: 2f 89 00 00 cmpwi cr7,r9,0 2c: 41 9e 00 1c beq- cr7,48 , std::less, 176 std::allocator >::find(int const&)+0x48> 30: 80 09 00 10 lwz r0,16(r9) 34: 7f 80 58 00 cmpw cr7,r0,r11 38: 40 bc ff e8 bge- cr7,20 , std::less, std::allocator >::find(int const&)+0x20> 3c: 81 29 00 0c lwz r9,12(r9) 40: 2f 89 00 00 cmpwi cr7,r9,0 181 44: 40 9e ff ec bne+ cr7,30 , std::less, std::allocator >::find(int const&)+0x30> 48: 7f 83 50 00 cmpw cr7,r3,r10 4c: 4d9e0020 beqlr cr7 50: 80040000 lwz r0,0(r4) 54: 81 23 00 10 lwz r9,16(r3) 186 58: 7f 80 48 00 cmpw cr7,r0,r9 5c: 4c9c0020 bgelr cr7 60: 7d435378 mr r3,r10 64: 4e800020 blr Listing 11.3: PPC disassembly (GCC 4.1.1)

00000000 , std::less, std::allocator >::find(int 198 const&)>: 0: 4e56 0000 linkw %fp,#0 4: 2f0a movel %a2,%sp@- 6: 206e 0008 moveal %fp@(8),%a0 a: 246e 000c moveal %fp@(12),%a2 202 e: 2268 0006 moveal %a0@(6),%a1 12: 5488 addql#2,%a0 14: 2208 movel%a0,%d1 16: 4a89 tstl%a1 18: 6712 beqs 2c , std::less, 207 std::allocator >::find(int const&)+0x2c> 1a: 2012 movel %a2@,%d0 1c: b0a9 0010 cmpl %a1@(16),%d0 20: 6e1a bgts 3c , std::less, std::allocator >::find(int const&)+0x3c> 22: 2049 moveal%a1,%a0 24: 2269 0008 moveal %a1@(8),%a1 212 28: 4a89 tstl%a1 2a: 66f0 bnes 1c , std::less, std::allocator >::find(int const&)+0x1c> 2c: b288 cmpl%a0,%d1 2e: 6708 beqs 38 , std::less, std::allocator >::find(int const&)+0x38> 30: 2452 moveal%a2@,%a2 217 32: b5e8 0010 cmpal %a0@(16),%a2 36: 6c0e bges 46 , std::less, std::allocator >::find(int const&)+0x46> 38: 2041 moveal%d1,%a0 3a: 600a bras 46 , std::less, std::allocator >::find(int const&)+0x46> 3c: 2269 000c moveal %a1@(12),%a1 222 40: 4a89 tstl%a1 42: 66d8 bnes 1c , std::less, std::allocator >::find(int const&)+0x1c> 44: 60e6 bras 2c , std::less, std::allocator >::find(int const&)+0x2c> 46: 2008 movel%a0,%d0 48: 245f moveal%sp@+,%a2 227 4a: 4e5e unlk%fp 4c: 4e75 rts Listing 11.4: 68K disassembly (GCC 3.4.6)

124 00000000 , std::less, std::allocator >::find(int 208 const&)>: 0: 55 push %ebp 1: 89e5 mov %esp,%ebp 3: 57 push %edi 4: 56 push %esi 5: 53 push %ebx 213 6: 8b450c mov 0xc(%ebp),%eax 9: 8b7508 mov 0x8(%ebp),%esi c: 8b7d10 mov 0x10(%ebp),%edi f: 8b5008 mov 0x8(%eax),%edx 12: 8d5804 lea 0x4(%eax),%ebx 218 15: 89d9 mov %ebx,%ecx 17: 85d2 test %edx,%edx 19: 74 1b je 36 , std::less, std::allocator >::find(int const&)+0x36> 1b: 89d0 mov %edx,%eax 1d: 8b17 mov (%edi),%edx 223 1f: eb 09 jmp 2a , std::less, std::allocator >::find(int const&)+0x2a> 21: 89c1 mov %eax,%ecx 23: 8b4008 mov 0x8(%eax),%eax 26: 85c0 test %eax,%eax 28: 74 0c je 36 , std::less, 228 std::allocator >::find(int const&)+0x36> 2a: 395010 cmp %edx,0x10(%eax) 2d: 7d f2 jge 21 , std::less, std::allocator >::find(int const&)+0x21> 2f: 8b400c mov 0xc(%eax),%eax 32: 85c0 test %eax,%eax 34: 75 f4 jne 2a , std::less, 233 std::allocator >::find(int const&)+0x2a> 36: 39cb cmp %ecx,%ebx 38: 74 07 je 41 , std::less, std::allocator >::find(int const&)+0x41> 3a: 8b07 mov (%edi),%eax 3c: 3b4110 cmp 0x10(%ecx),%eax 3f: 7d 0b jge 4c , std::less, 238 std::allocator >::find(int const&)+0x4c> 41: 891e mov %ebx,(%esi) 43: 89f0 mov %esi,%eax 45: 5b pop %ebx 46: 5e pop %esi 47: 5f pop %edi 243 48: 5d pop %ebp 49: c20400 ret $0x4 4c: 890e mov %ecx,(%esi) 4e: 89f0 mov %esi,%eax 50: 5b pop %ebx 248 51: 5e pop %esi 52: 5f pop %edi 53: 5d pop %ebp 54: c20400 ret $0x4 Listing 11.5: X86 disassembly (GCC 4.2.3)

125 Type AEMB ARM PPC 68K X86 Arithmetic 33% 25% 27% 19% 13% Comparison 22% 83% 86% 83% 100% Addition 67% 17% 14% 17% 0% Subtraction 11% 0% 0% 0% 0% Branch 33% 17% 31% 28% 20% Conditional 67% 75% 75% 67% 67% Unconditional 33% 25% 25% 33% 33% Memory 33% 38% 27% 34% 29% Load 78% 89% 100% 91% 77% Store 22% 11% 0% 9% 23% Miscellaneous 0% 21% 15% 19% 38% Instruction Count 27 24 26 31 44

Table 11.1: Code profile for std::set::find()

Looking at the averages and range for each category across architectures: memory operations account for 32.2±5.5%, arithmetic operations account for 23.4±10% and branch operations account for 25.8±8%. This tells us that the bulk of the code is taken up by memory operations, and memory operations are also the most expensive in terms of time cost. Therefore, it is safe to compare the performance of these architectures using memory operations alone. Looking at the profile of the AEMB in comparison with the rest, it exhibits a memory profile (33%) that is similar to the average (32.2%) across architectures. Furthermore, memory operations are the most consistent across architectures as they have the smallest range (±5.5%) between the largest and smallest values. The ARM has about 5/33 = 15% more memory operations and the PPC has about 6/33 = 18% less memory operations.

This difference cannot turn Cup = 3.43 into Cup ≤ 1.0. Therefore, on the basis of memory operations, it is safe to assume that the AEMB architecture is neither significantly slower nor faster than that of other processor archi- tectures. Any performance numbers gained through the use of accelerator units when compared with this host processor, will be broadly similar with a different host processor architecture. Using higher order processors that include many complex functions such as multiple cores, large caches, and branch prediction will improve the performance of the host processor for search applications, but at the cost of higher complexity. Accelerator units are highly specialised but are simpler in concept and implementation. Then, the issue becomes one of choosing between cost and performance of a complex processor versus a simple accelerator unit.

126 11.3 Scalability

The next issue that needs to be addressed is whether this acceleration is scalable. Al- though the accelerators can accelerate a single search thread, it would be more useful if it was possible to accelerate N search threads in hardware. It should be obvious that there are a number of bottlenecks in the system and these will be the limiting factors on scalability.

11.3.1 Processor Scalability

Although the search acceleration is performed by the accelerator units, a processor bottleneck exists at two points: the available communications bandwidth between the host processor with the accelerator units, and the processing bandwidth or the ability of the processor to allocate search threads to the different accelerator units. These are two separate issues that can be considered together as the latter is partly dependent on the former. Each accelerator unit comes with an individual host processor interface, which is used by the host processor to send configuration information, receive status information, put data items into the input buffers and get data items out of the output buffers. Depending on how each accelerator is used in the search pipeline, it may only take a few transactions to configure and retrieve a single result, or it may take a large number of transactions to configure and stream data items in and out of the accelerator unit. Hence, the communication requirements for each accelerator unit and pipeline are different depending on application. The host processor interface in the prototype was configured as a shared bus. The reason that this was used was because the accelerators were tested individually. Hence, this shared bus was essentially a dedicated bus for a specific accelerator unit, one at a time. This ensures that in each situation, the entire bandwidth is available to the application kernel. A shared bus was perfectly suited to prototyping, but is not suitable for real-world applications. Therefore, the way to scale up the communication requirements would be to adopt a different communication architecture. The host processor interface could be changed into a packet based interface [SSTN03, LZJ06, Art05] running on a number of different non-bus layouts to increase the number of channels for traffic. Alternatively, it is possible to split up the host processor interface into two separate interfaces: a low traffic one for configuration, and a high traffic one for streaming data. This will further improve the use of available bandwidth. Another potential bottleneck is the processing bandwidth. The research prototype uses a single RISC processor core. As is evident from the simulation results of the

127 different accelerator units, the accelerator units are generally able to work faster than the processor core and able to generate useful results at a faster rate than can be consumed by the software kernel. Therefore, another potential bottleneck in the search pipeline is the search issue rate and result consumption rate of the software. Again, the most obvious way to increase processing bandwidth is to increase par- allelism, whether at the fine or coarse grain level. There are a number of different ways[Wal95, TEL95] to do this such as increasing pipeline depths, hardware threads and processing cores. However, all these options are contingent upon an increase in host communications bandwidth. Otherwise, the communications bandwidth will present the major bottleneck.

11.3.2 Accelerator Scalability

Besides the physical cost constraints, the other problem that will affect the scalability of the accelerator units is the inter-accelerator communications, which involves bandwidth capacity and routing issues. Inter-accelerator bandwidth is only used to stream data from one accelerator unit to another, in a point-to-point fashion. The term streaming is used because it reflects what happens in hardware. The output buffers of a transmitting unit are directly connected to the input buffers of a receiving unit. These buffers are FIFOs and the control signals are crossed in order to provide hardware flow control, which gives a maximum bandwidth of 3.2Gbps at 100MHz for each channel. This bandwidth is more than necessary for each accelerator unit, which can only generate results at the maximum rate of 1.6Gbps per channel. Furthermore, this inter- accelerator communication bypasses the host processor and is not affected by processor scalability. Hence, the inter-accelerator communication has plenty of room to spare and is scalable, subject to physical and architectural constraints. The larger issue is that of routing data streams that are not connected directly. Rout- ing can be achieved either by software routing or dedicated hardware routers. Software routing would suffer from the processor scalability issues highlighted earlier while dedi- cated hardware routing suffers from a number of physical constraints. Even if physical constraints are discounted, hardware routing would still present issues on an architec- tural level. These problems stem from the fact that the order of complexity for a hardware router tends to increase exponentially with the number of nodes. There are different methods for reducing this problem by constraining the number of physical routes while still providing the ability to route from one node to another node. However, for the search pipeline, the number of physical routes can be further reduced by considering the fact that not all units need to have access to every other unit. For example, it is

128 unlikely that results from a chaser unit need to be routed to another chaser unit or sieve unit. Furthermore, the routes are uni-directional ones as data flows from one stage to the next. However, the best way to work around this problem is to remove the necessity for routing altogether. This can be done by using the static routing architecture as proposed in section 10.1.2. This will both reduce architectural and physical constraints on the routing and due to the well defined nature of search, this architecture can still be used to solve most common search problems. Therefore, architecturally speaking, it is possible to scale up the performance of the accelerators by replicating multiple accelerator units in fixed chains. The only issues that will limit this are physical constraints. While area size is definitely a limiting factor, it is less of a major concern. The main physical constraint will be on the layout of the interconnects that form the host processor channel. These long lines will become significant as the number of accelerator units go up and will at least affect the overall speed of the system. Although there can be some creativity in laying out these lines and the processor cores, this will ultimately limit the scalability of the accelerator units.

11.3.3 Memory Scalability

Memory scalability is also a major bottleneck for a search application. We have shown that the memory bandwidth requirements for each accelerator unit is fairly high and that the use of cache memory will not help much in hiding this problem. Therefore, memory will prove to be the ultimate bottleneck in implementation scalability. A mem- ory bottleneck exists at two potential locations: the actual memory bandwidth available, and the memory contention between the accelerator units. The one positive note is that memory technology is constantly improving [Lin08, CW08] and that will help alleviate the bottleneck. The actual memory bandwidth available can be increased by using faster memory technologies and multi-channel memory. While faster memory technology is able to retrieve more data from memory in each clock cycle and thus increase bandwidth, it usually comes at the cost of higher latency[Woo08]. Due to the random nature of data access for search applications, this higher latency may prove to be a hidden problem, which can still be hidden by using higher speed memory. Multi-channel memory[ZZ05] can increase the bandwidth by accessing multiple mem- ory locations at a time and is seeing increased use in consumer level computing. This mainly involves striping different memory locations across different memory modules accessed through separate memory channels. This is a scalable solution on an architec- tural level but would end up consuming extra I/O and board space on a physical level,

129 which will again limit the scalability. Therefore, although there are work-arounds for the different scalability issues, physical limitations will ultimately limit memory scalability.

11.4 Acceleration Cost

The term cost needs to be defined for the purpose of deriving the acceleration per unit cost. In the case of this accelerator design, cost is measured in terms of monetary cost to produce the desired result. Some of the basic values have been calculated in section 10.5. However, these should be consolidated into specific values for specific configurations. A number of boundary conditions need to be assumed before the costs can be esti- mated.

Implementation Technology is assumed to be an off-the-shelf FPGA. With present technology, this will have a potential I/O clock of about 200MHz using a high speed FPGA. The FPGA is assumed to have the necessary I/O connections to communicate with the outside world at that speed. The accelerator is assumed to be closely coupled to the host processor and implemented as a bridge device.

Memory Technology is assumed to be regular DDR2 memory. With an I/O clock of 200MHz, this will limit the memory technology to DDR2-400 technology, which has a maximum bandwidth of 25.6Gbps1. The host processor consumption of this memory bandwidth is assumed to be negligible.

Host Communication Interface is assumed to connect directly to an x86 processor via HyperTransport. With an I/O clock of 200MHz, this has a maximum commu- nication bandwidth of 12.8Gbps2 in one direction. The aggregated bandwidth will not be used as it assumes a 50:50 bi-directional communication ratio. This band- width is more than sufficient to handle data streams coming in at the maximum rate from memory.

11.4.1 Configuration A

Using memory as the limiting factor, the absolute maximum number of chaser units and streamer units can each be easily computed.

25.6Gbps 100MHz M = × = 6.009 ≈ 6 max 2.13Gbps 200MHz

25.6Gbps 100MHz Cmax = × = 10 1.28Gbps 200MHz

1200MHz × 2 transfers per clock × 64bits 2200MHz × 2 transfers per clock × 32bits

130 Assuming that each chaser is directly paired with a streamer and they are configured in a simple query pipeline:

25.6Gbps 100MHz M = C = × = 3.75 ≈ 4 max max 2.13Gbps + 1.28Gbps 200MHz

This maximum figure is rounded up for two reasons: the sieve unit works better in even channel pairs; and the chaser will not consume the maximum bandwidth as it has to be reconfigured between searches. Assuming that each streamer channel pair is then connected to a sieve unit and configured in a simple boolean query pipeline:

Vmax = 2

The total resource consumption of the accelerator units can be computed from equa- tion 10.1:

Qlut = 515 NC + 357 NM + 634 NV + 126 = 4882

This will be able to fit inside a medium to large FPGA and will not fit into smaller ones. Any additional resources can be used to implement additional sieve units to enable more complicated queries. Using the figures from table 10.2 the approxmate monetary cost of such an accelerator will be:

Kfpga = £0.88 NC + £0.59 NM + £1.08 NV ≈ £8.04

Under these conditions, the accelerator unit will be able to accelerate entirely in hardware:

• Four parallel simple queries.

• Two parallel boolean queries, each with two streams and one operand.

• A combination of the above.

11.4.2 Configuration B

Assuming that the ratio of chasers to streamers is 1:2.5 as suggested in section 9.1.2 and they are configured with dynamic software routing instead:

25.6Gbps 100MHz Cmax = × = 1.938 ≈ 2 2.5 × 2.13Gbps + 1.28Gbps 200MHz

Mmax = 2.5 × Cmax ≈ 5

131 This results in an odd number of streamer units. While not very homogeneous, this oddity is acceptable if other types of queries are considered such as boolean queries with three streams and two operands.. Assuming that each streamer channel pair is connected to a sieve unit while the odd channel is joined with the output of an existing boolean sieve unit:

Vmax = 3

The total resource consumption of the accelerator units can be computed from equa- tion 10.1:

Qlut = 4843

Using the figures from table 10.2 the approxmate monetary cost of such an accelerator will be:

Kfpga ≈ £7.95

Under these conditions, the accelerator unit will be able to accelerate with some software assisted routing:

• Five parallel simple queries.

• Two parallel boolean queries, one with two streams and one operand, the other with three streams and two operands.

• A combination of the above.

This configuration is more versatile than the earlier configuration and at a similar cost. As the absolute maximum number of streamer units that can be implemented is six, this configuration also represents the maximum practical number of streamers that can be implemented for a search pipeline.

11.4.3 Configuration Comparisons

There are a couple of comparisons that can be made from the above estimates. While the cost of implementing a more complex configuration is similar to the cost of implementing a simpler configuration, the complex configuration is capable of handling a different range of search pipelines. Therefore, it is possible to implement different combinations of pipelines for different applications, within the bounds of the absolute maximum number of accelerator units. Although the number of sieve units were chosen to be a minimum, a number of additional sieve units can be added to form more complex pipelines that form more complicated boolean queries in addition to routing. Another thing to note is that, although CMOS implementations have a fundamen- tally faster clock speed, the maximum number of units is fundamentally bound by

132 memory bandwidth and is exactly the same as that for FPGAs because the bandwidth requirements scale linearly with the core clock when paired with faster memory. There- fore, the number of pipelines that can actually be accelerated in a CMOS implementation is similar to that of the FPGA. The only difference is that they can run at a much faster clock rate and complete more search operations per unit time. One way to increase the number of search pipelines is by increasing the memory bandwidth through the different methods suggested in section 11.3.3. This is ultimately the bottleneck in any search system. Another way to increase it is to redistribute the usage of the existing memory bandwidth, as shown between the two configurations above. However, this can only help improve the situation slightly.

11.5 Alternative Technologies

There are, of course, other possible ways of achieving a performance boost for search applications. A dedicated hardware search accelerator may, or may not, be the best solution for an application. Therefore, it is prudent to look at the solution presented in this research, against the many alternative solutions, which will show the advantages and disadvantage that the present solution has against the rest. It is also a good time to have another look at the search stack presented in Figure 2.1.

11.5.1 Improved Software

The simplest way to achieve any search acceleration would be to replace search algo- rithms with alternatives at the primary and secondary search layers. This would allow the search operation to be accelerated with the minimum amount of problems and cost. A quick search will reveal that there is plenty of ongoing research in the area of ap- plication specific search algorithms. As this is a pure software alternative, it does not compete directly with the hardware accelerator alternatives and can actually exist in parallel with hardware options. By providing an accelerated hardware layer that helps in performing common operations, the hardware can potentially benefit a wide variety of software search algorithms. Another way to improve search without affecting any hardware is to improve the data structures used. The choice of one data structure over another can affect the performance of algorithms significantly[Knu73]. There is also ongoing research in the area of esoteric data structures for use with newer algorithms. The search accelerator presented here should not be considered the best solution for the problem. It is important to exhaust possible software alternatives in addition to any other hardware alternative.

133 11.5.2 Content-Addressable Memories

The content-addressable memory (CAM) compares input search data against a table of stored data and returns the address of the matching data[PS06a]. CAMs have a single clock cycle throughput making them faster than other hardware-based and software- based search systems. However, the speed of a CAM comes at the cost of increased silicon area and power consumption. Most CAMs are implemented using expensive SRAM cells instead of DRAM cells. Furthermore, a typical CAM cell is half the capacity of an SRAM cell, which further exacerbates the problem. Due to this, the largest CAMs are only about 18Mbit in size[PS06a]. Therefore, it is a fairly expensive hardware solution to the search problem in terms of power and area. A binary CAM performs only exact-match searches, while a more powerful ternary CAM allows pattern matching with the use of “don’t cares”[ACS03]. Don’t cares act as wildcards during a search and are particularly attractive for implementing longest-prefix- match searches in routing tables. This makes it more suitable for performing attribute search rather than comparison searches (section 2.2). Although it can be argued that the CAM can replace a chaser unit for equality searches, it is less able to replace it for comparison searches such as greater-than or less-than searches. Therefore, while there is a place for CAMs in hardware search acceleration, they are better suited to a different class of search operations than the solution presented here.

11.5.3 Multicore Processors

The most obvious method of accelerating search, as readily suggested in [Sto90], is by performing search operations in parallel on multiple processors. This accelerates search at the host processor layer. Most popular multi-processors used today are homogeneous multi-processors such as those employed in x86 processors. Symmetric multi-processing has been the mainstay in general purpose computing acceleration over the years. The heterogeneous multi-processor system presented in this research is an alternative method to accelerate applications. Both methods suffer from the same restrictions and pitfalls caused by limited memory bandwidth. However, it has been conclusively [KTJR05] demonstrated that heterogeneous multiprocessor systems are more efficient than homogeneous systems from several perspectives. Using a heterogeneous processor can significantly reduce processor power dissipation. An increase in power consumption and heat dissipation will typically lead to higher costs for thermal packaging, fans, electricity, and even air conditioning. To reduce this, industry currently uses two broad classes of techniques for power reduction: gating-based and voltage or frequency scaling-based. Given the low core voltages of around 1 volt, there is very little more that voltage

134 scaling can do to improve power consumption. Any significant decrease in voltage scaling will eat away at the noise margin, reducing the accuracy of the digital signal. Gating circuitry itself has power and area overhead, limiting its application at the lowest granularity. This means that power still gets dissipated even when dynamic blocks are gated off. It is only feasible to use gating techniques at a large block level, which is where it is principally applied today. This can be easily used for the accelerator designed here by gating off specific accelerator units and only turning them on when they are needed in the pipeline. Given a fixed circuit area, a heterogeneous processor can provide significant advan- tages. It can match applications to the core that best meets the performance demands and it can provide an improved area-efficient coverage of various real-world work de- mands. For the area size devoted to a single additional host processor core, it is possible to include 1 sieve, 3 streamers and 2 chasers instead. This is a better allocation of resources than an existing solution of pure homogeneous multicore processors. For a similar amount of resources, a heterogeneous hardware accelerator can accelerate search by 5 times, possibly more. Further reductions in area consumption is possible, if the accelerator units are made to share resources. In the design of the accelerator units, each unit has an individual ALU unit that is not 100% exploited. The ALU units are only used during less than 50% of the machine states. Therefore, it is certainly feasible to join dual accelerator units to share the same ALU unit. Others [SKV+06] have shown that sharing ALU units for general purpose processor can reduce area consumption by almost 20%. In the case of the accelerator units, the reduced area consumption will be significantly more as the bulk of the unit is made up of the ALU device itself.

11.5.4 Data Graph Processors

Another class of heterogeneous processor that can be feasibly used to accelerate search is the class of graph processors [MHH02, NK04]. This provides an alternative to the accelerator unit layer. These processors work at an associative level by reconfiguring hardware to build the data structures physically instead of virtual data structures in memory. This has the advantage of manipulating data structures in hardware, which can be very quick and fast. It has been described elsewhere [CLRS01] that a tree is merely a specific represen- tation of a generic graph. Therefore, a graph processor can definitely represent and accelerate any tree functions. In fact, this is a very interesting class of processor as it attempts to represent complex data structures in hardware. This suggests that since primitive data structures such as stacks and queues are already explicitly implemented in hardware, there is no reason why more complex data

135 structures cannot be similarly treated. One can certainly appreciate the logic behind this and understand the benefits that come from representing data structures in hard- ware, which would allow the data structures to be quickly searched, easily manipulated, bypassing the limitations of memory bandwidth by simply not using much memory. However, hardware data structures suffer from the very physical limitations of hard- ware - the cost of hardware would grow exponentially in proportion to the problem set. The way to reduce this cost is to partly move the problem up to the software domain by swapping data graphs between memory and processor. But this will reduce many of the benefits associated with implementing complex data structures in hardware. Therefore, although exciting, the real-world use of such a processor is limited. Furthermore, in search applications, only a subset of the data structure is usually needed. Although the graph processor will likely defeat any other method of graph traversal, a search has been shown earlier, to be a linear traversal and will not benefit much from a graph processor.

11.5.5 Other Processors

There are many other types of hardware accelerators in used in the real world today including media processors and physics processors. These processors provide alternatives at the hardware layers. These processors are typically used for computationally intensive operations and are not necessarily suited nor efficient for search applications. For example, the CELL[GHF+05, CHI+05] processor is capable of performing com- pares and data ordering, which would allow it perform a sieve operation. It has a load-store unit that connects to a high bandwidth memory interconnection architecture to supply it with the data it needs at a raw rate of 60 GB/s. Therefore, it can definitely be programmed to perform a search operation. However, it is a little too costly for search applications especially with its very high power consumption.

11.6 Suggestions for Future Work

The research prototype was designed in such a way as to facilitate the collection of data. This means that it was designed as a stripped-down design, so that it can be assembled in different configurations and tested under various conditions. To accommodate this, many helpful assumptions about the usage of the accelerators could not be incorporated. Therefore, there is still room to improve the design, particularly in the optimisation of resources used.

136 11.6.1 Conjoining Arithmetic Units

Care was taken while designing each individual accelerator unit to optimise the use of resources. For example, the chaser unit uses the same ALU to perform calculations on the data and next pointer locations. Both these operations perform the addition of the node pointer with a static offset value but at different times. Studying figures 5.5, 6.5, 7.4 and their respective descriptions, the use of the ALU has been interleaved in each accelerator. While the ALU has already been interleaved within each accelerator, there were still clock cycles where the ALU was not in use. For example, the ALU unit for a chaser unit is only used to add pointers during the NULL and NEXT states. The entire main machine loop has five states in it, which means that the ALU is not in use for the other three states. A second chaser unit, could feasibly share the same ALU and interleave its calculations as well. This sort of conjoining of resources can also be applied elsewhere in the design of other accelerator units. While use of a single ALU may not result in massive savings of resources, the ALU is one of the larger blocks of an accelerator unit (the other being memory). If a significant number of accelerator units are used in an application, the number of ALU units saved would become significant. Therefore, this is definitely one form of design optimisation that should be undertaken for real-world implementations.

11.6.2 Conjoining Stream Buffers

The accelerator units used have individual input and output data buffers that are de- signed as hardware FIFOs. These FIFOs contain a multi-port memory block, which is typically expensive in terms of area cost[VST03b, VST03c, VST03a]. One port is connected to the internal accelerator unit while the other port is used externally. As the number of accelerator units increases, the number of memory blocks used increases linearly. The obvious way to reduce the cost of memory is to reduce the size of the FIFOs. However, this method has limited advantages as memory blocks tend to come in fairly standard sizes. Furthermore, the increase in chip area with respect to memory size is non-linear. Simply reducing the size of the stream buffers will not save a significant area. Halving the memory capacity of a block, results in a block that is still more than half of the original area size. In fact, the size of the FIFO used in the research prototype is already extremely small (15×32bit). An alternate way to reduce the amount of memory used is to conjoin the stream buffers. Instead of having a separate output buffer on one accelerator unit and an input buffer on another, the two buffers can be merged into a single buffer. While a larger

137 memory block is needed to store the same amount of data, this proves to be an advantage because a single 2kbit memory block is much smaller in size than two individual 1kbit memory blocks3 as evident from Table 11.2.

1kbit 2kbit 4kbit 8kbit 16kbit 32kbit 64kbit Area (mm2) 0.34 0.48 0.75 1.26 2.23 3.78 7.15 Time (ns) 2.93 3.07 3.33 3.42 3.54 4.41 4.69

Table 11.2: Specifications for 0.35µm CMOS DPRAM blocks

Although this is the easiest way to reduce memory area size, it will complicate matters for dynamic pipeline architectures. However, it is still possible to route the data dynamically using software, by treating the buffers as a unified buffer. Data can still be pumped in at the front end and extracted from the back end via software. Alternatively, all the buffers could just be considered as either input or output buffers, but not both, and treated that way instead. Another way to reduce memory resources is to conjoin the memory blocks them- selves. However, this can only be done after the accelerator units are paired up and pipelines are well defined. Similar to the situation for the ALU, the memory block is only being accessed every other clock cycle on each individual port. Therefore, it is possible to interleave multiple operations on the same memory port. This can turn a dual-port memory into a quad-port memory by time division multiplexing memory operations[SD02, Xil05]. This method is more complicated than merely merging the buffers but can be used in tandem with resultant additional savings.

11.6.3 Memory Interface

For the research prototype, all memory access goes through a central memory arbiter that handles transactions in a round-robin manner. This is fine for the research pro- totype because the accelerator units are tested one at a time, which means that each accelerator unit actually has full access to the memory. In real-world applications, this central arbiter would prove to be a bottleneck to performance and would consume sig- nificant resources as the number of accelerator units increase. Some other alternatives that can be used instead were briefly mentioned in section 11.3.3. However, the different methods mentioned mainly deal with generic memory access by multiple masters, typically a number of processor cores. The memory access patterns for each accelerator unit are not totally random and generic. Each accelerator unit is controlled by a finite-state-machine that performs memory transactions at periodic

3http://web.archive.org/web/20071006035159/http://asic.austriamicrosystems.com/ databooks/digital/mc_dpram_c35.html

138 intervals. If this is taken into account, it is possible to design a memory interface that knows in advance, when to process the transactions. This will result in a simpler design.

139 CHAPTER 12

Conclusion

This dissertation has proposed a solution to the search problem. Search is a fundamental problem in computing and as computers are increasingly invading our everyday lives, search is also becoming an everyday problem for everyone. Historically, search has received less attention than other computing problems. Search was first defined into different categories and characterised. In order to visu- alise the different components involved in a search, a novel search stack was developed. This stack links the different hardware and software components of a complex search operation together. It also serves to illustrate how search can be accelerated at different layers using alternative technologies. Furthermore, a generic search was broken down into a three-stage search pipeline. Each stage can then be individually accelerated by different types of accelerator units as they are characterised by very different operations and problems. The accelerator units form fundamental building blocks that are only capable of performing one task and performing it efficiently. They can be used on their own to offload some fundamental tasks from the host processor. The use of accelerator units give added flexibility to the overall accelerator design. On top of these unit tasks, complex search acceleration can be built. The solution presented here is novel in that these accelerator units can be combined like LEGO bricks to solve various complex search problems. Different numbers and configurations of accelerator units can be used together to form various pipelines for performing different types of search, depending on the specific application. In order to investigate the performance of these units, simulation was heavily used. Initially, a single iteration of a complex search simulation took days to run. The bulk

140 of this time was used by the data set preparation process, which is O(N log N) bound. To speed this up, a novel method of simulation was developed for accelerating the simulation. This involved freezing the simulation data onto a disk file using Verilog constructs, which was later reused in multiple simulation runs. While the bulk of the work done is related to hardware design, a large part of it was software focused. A number of search kernels were written to compare the performance of hardware acceleration against a pure software operation. These kernels were written in C++, exploiting the Standard Template Library (STL) to use optimised algorithms and data structures. The code was compiled using the optimising GCC compiler to produce compact and efficient code for testing purposes. As a result, the accelerators were shown to achieve a significant factor of hardware acceleration when compared to pure software solutions. The chaser unit was designed to perform key search, which is a primary search and is typically the first stage of any search pipeline. It is also a very common computer operation, used by any number of operations including results selection, insertion and deletion. A multi-key search can be accelerated by up to 3.43 times using this chaser unit as compared to a pure software operation. However, it does not provide as significant improvement when used for a single-key search. The streamer unit was designed to offload the mundane list retrieval task, which is a supporting task used in different search applications. On its own, it does not speed up the operation when compared to a pure software operation. However, it works as an excellent offloader, used to extract data values from fundamental data structures while freeing up the host processor for other tasks. The sieve unit was designed to perform result collation, which is a secondary search task and is typically the last stage of any secondary search pipeline. A number of these units can be combined to form different types of search operations including list union and intersection. It is capable of accelerating secondary boolean queries by up to 5.2 times as compared to a pure software operation. In addition to result collation, it can also be used to buffer and route results from other units. While memory is a major search bottleneck, increasing the cache size has been shown to have little overall effect on performance. This method of increasing general purpose microprocessor performance does not work as well when it comes to search applications. This can be easily understood through the ephemeral nature of search data. The results show that unless the cache sizes are increased to the levels matching the size of the data set, there is little benefit in increasing it. In order to test a better way of constructing cache, a structural cache, which exploits structural locality in addition to temporal and spatial locality, was developed. However, at small sizes, a structural cache only provides a small 3% boost in performance. There-

141 fore, there is little reason to integrate a structural cache unless the cost can be justified by the small increase in performance. The accelerator units are a better solution that avoid either inventing a whole new computing paradigm or a new microprocessor. Both of these other solutions, while unique, will bring a whole host of other problems including incompatibilities with present tools and platforms. These accelerators can be immediately integrated into existing computing platforms either as an on-chip bridge, co-processor or I/O device. The accelerator units were designed for FPGA implementation. With mainstream microprocessor companies opening up their platforms to hybrid computation initiatives, this is a potentially easier path for the adoption of this technology. In addition, these units can also be targeted for ASIC implementation, which will allow them to run at much higher clock speeds for a higher search throughput rate. The accelerator design is also scalable. These units are designed to be simple and small in order to simplify and reduce implementation costs. However, there is still room for improvement when it comes to resource usage. There are different parts of the design that can be shared and conjoined to further reduce resource consumption. However, these optimisations are not dealt with directly here and could be used for future work. The most important end-result of this programme of research has been the iden- tification and developmend of a low-cost method of search acceleration. While using the accelerator is one possible way of accelerating search, there are many other ways of achieving acceleration. However, the solution presented in this dissertation has the advantage of being flexible, cheap and fast. It is flexible enough to be adapted for search applications and other potential uses, while still being small and simple enough to be integrated onto existing designs at little extra cost.

142 Bibliography

[20t93] 20th International Symposium on Computer Architecture. A Case for Two- Way Skewed-Associative Caches, May 1993.

[AB05] Jeff Andrews and Nick Baker. Xbox360 system architecture. In Hot Chips, number 17, 2005. Double Check Date.

[ABB+03] Dave Abrahams, Mike Ball, Walter Banks, Greg Colvin, Hiroshi Fukutomi, Lois Goldthwaite, Yenjo Han, John Hauser, Seiji Hayashida, Howard Hin- nant, Brendan Kehoe, Reobert Klarer, Jan Kristofferson, Dietmar K¨uhl, Jens Maurer, Fusako Mitsuhashi, Hiroshi Monden, Nathan Myers, Masaya Obata, Martin O’Riordan, Tom Plum, Dan Saks, Martin Sebor, Bill Sey- mour, Bjarne Stroustrup, Detlef Vollmann, and Willem Wakker. Technical report on c++ performance. Technical Report PDTR 18015, ISO/IEC, Au- gust 2003.

[ACS03] I. Arsovski, T. Chandler, and A. Sheikholeslami. A ternary content- addressable memory (tcam) based on 4t static storage and including a current-race sensing scheme. Solid-State Circuits, IEEE Journal of, 38(1):155–158, Jan 2003.

[ADS81] Sudhir K. Arora, S. R. Dumpala, and K. C. Smith. Wcrc: An ansi sparc machine architecture for data base management. In ISCA ’81: Proceedings of the 8th annual symposium on Computer Architecture, pages 373–387, Los Alamitos, CA, USA, 1981. IEEE Computer Society Press.

[Alt07] Altera, Inc. Cyclone III Device Handbook, July 2007.

[Art05] Arteris SA. A Comparison of Network-on-Chip Busses, 2005.

143 [ASN+99] Shinsuke Azuma, Takao Sakuma, Takashi Nakano, Takaaki Ando, and Kenji Shirai. High performance sort chip. In Hot Chips, number 11, 19999.

[Bab79] E. Babb. Implementing a relational database by means of specialzed hard- ware. ACM Trans. Database Syst., 4(1):1–29, 1979.

[BDH03] Luiz Andr´eBarroso, Jeffrey Dean, and Urs H¨olzle. Web search for a planet: The google cluster architecture. IEEE Micro, 23(2):22–28, March 2003.

[BGH+92] T. F. Bowen, G. Gopal, G. Herman, T. Hickey, K. C. Lee, W. H. Mansfield, J. Raitz, and A. Weinrib. The datacycle architecture. Commun. ACM, 35(12):71–81, 1992.

[Bor99] Borland/Inprise. Interbase 6.0 documentation, 1999.

[Bro04] Leo Brodie. Thinking Forth, chapter 4,6,7,8. Creative Commons, 2004.

[BTRS05] Florin Baboescu, Dean M. Tullsen, Grigore Rosu, and Sumeet Singh. A tree based router search engine architecture with single port memories. In ISCA ’05: Proceedings of the 32nd annual international symposium on Computer Architecture, pages 123–133, Washington, DC, USA, 2005. IEEE Computer Society.

[CHI+05] Scott Clark, Kent Haselhorst, Kerry Imming, John Irish, Dave Krolak, and Tolga Ozguner. Cell broadband engine interconnect and memory interface. In Hot Chips, number 17, 2005.

[CLRS01] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms. The MIT Press, 2nd edition, 2001.

[CW08] Judy Chen and Fred Ware. The next generation of mobile memory. Presented at MEMCON’08, July 2008.

[DeW78] David J. DeWitt. Direct - a multiprocessor organization for supporting relational data base management systems. In ISCA ’78: Proceedings of the 5th annual symposium on Computer architecture, pages 182–189, New York, NY, USA, 1978. ACM.

[DG92] David DeWitt and Jim Gray. Parallel database systems: the future of high performance database systems. Commun. ACM, 35(6):85–98, 1992.

[DGG+86] David J. DeWitt, Robert H. Gerber, Goetz Graefe, Michael L. Heytens, Kr- ishna B. Kumar, and M. Muralikrishna. GAMMA — A high performance dataflow database machine. In Proceedings of the 12th International Con- ference on Very Large Data Bases, pages 228–237, 1986.

144 [Fer60] David E. Ferguson. Fibonaccian searching. Commun. ACM, 3(12):648, 1960.

[FFP+05] Daniel Fallmann, Helmut Fallmann, Andreas Pramb¨ock, Horst Reiterer, Martin Schumacher, Thomas Steinmaurer, and Roland Wagner. Comparison of the enterprise functionality of open source database management systems, Apr 2005.

[FH05] Michael J. Flynn and Patrick Hung. Microprocessor design issues: Thoughts on the road ahead. IEEE Micro, 25(3):16–31, 2005.

[FK93] Shinya Fushimi and Masaru Kitsuregawa. Greo: a commercial database processor based on a pipelined hardware sorter. In SIGMOD ’93: Proceedings of the 1993 ACM SIGMOD international conference on Management of data, pages 449–452, New York, NY, USA, 1993. ACM.

[FKS97] Terrence Fountain, P´eter Kacsuk, and Dezs˝oSima. Advanced Computer Architectures: A Design Space Approach, chapter 10-18. Addison-Wesley, 1st edition, 1997.

[FKT86] Shinya Fushimi, Masaru Kitsuregawa, and Hidehiko Tanaka. An overview of the system software of a parallel relational database machine grace. In VLDB ’86: Proceedings of the 12th International Conference on Very Large Data Bases, pages 209–219, San Francisco, CA, USA, 1986. Morgan Kaufmann Publishers Inc.

[Gen04] Paul Genua. A cache primer. Technical report, Freescale Semiconductor, October 2004.

[GG00] Pierre Guerrier and Alain Greiner. A generic architecture for on-chip packet- switched interconnections. In DATE ’00: Proceedings of the conference on Design, automation and test in Europe, pages 250–256, New York, NY, USA, 2000. ACM.

[GHF+05] Michael Gschwind, Peter Hofstee, Brian Flachs, Martin Hopkins, Yukio Watanabe, and Takeshi Yamazaki. A novel simd architecture for the cell heterogeneous chip-multiprocessor. IBM, 2005.

[GLS73] Jr. George P. Copeland, G. J. Lipovski, and Stanley Y.W. Su. The archi- tecture of cassm: A cellular system for non-numeric processing. SIGARCH Comput. Archit. News, 2(4):121–128, 1973.

[GS04] Brian J. Gough and Richard M. Stallman. An Introduction to GCC, chap- ter 6. Network Theory Ltd., 2004.

145 [Han98] Jim Handy. The Cache Memory Book. Academic Press, 2nd edition, 1998.

[HB07] Simon Harding and Wolfgang Banzhaf. Fast genetic programming on GPUs. In Marc Ebner, Michael O’Neill, Anik´oEk´art, Leonardo Vanneschi, and Anna Isabel Esparcia-Alc´azar, editors, Proceedings of the 10th European Conference on Genetic Programming, volume 4445 of Lecture Notes in Com- puter Science, pages 90–101, Valencia, Spain, 11 - 13 April 2007. Springer.

[Hea95] Steve Heath. Microprocessor Architectures RISC, CISC and DSP, chapter 8. Newnes, 2nd edition, 1995.

[Hil88] Mark D. Hill. A case for direct-mapped caches. Computer, 21(12):25–40, 1988.

[Hip07] D. Richard Hipp. The virtual database engine of sqlite, 2007.

[HLW87] Gary Herman, K. C. Lee, and Abel Weinrib. The datacycle architecture for very high throughput database systems. In SIGMOD ’87: Proceedings of the 1987 ACM SIGMOD international conference on Management of data, pages 97–103, New York, NY, USA, 1987. ACM.

[HP96] John L. Hennessy and David A. Patterson. Computer Architecture: A Quan- titative Approach, chapter 2,5,8,C,E. Morgan Kaufmann, 2nd edition, 1996.

[HS89] Mark D. Hill and Alan Jay Smith. Evaluating associativity in cpu caches. IEEE Transactions on Computer, 38(12):1612–1630, December 1989.

[Int07] International Business Machines Corporation. Power ISA Version 2.05, Oc- tober 2007.

[ISH+91] U. Inoue, T. Satoh, H. Hayami, H. Takeda, T. Nakamura, and H. Fukuoka. Rinda: a relational database processor with hardware specialized for search- ing and sorting. Micro, IEEE, 11(6):61–70, Dec 1991.

[JED07] JEDEC Solid State Technology Association. JEDEC Standard: Specialty DDR2-1066 SDRAM, November 2007.

[JED08] JEDEC Solid State Technology Association. JEDEC Standard: DDR2 SDRAM Specification, April 2008.

[Jon05] M. Tim Jones. Optimization in gcc. http://www.linuxjournal.com/article/7269, January 2005.

[Kan81] Gerry Kane. 68000 Microprocessor Handbook. Osborne/McGraw-Hill, 1981.

146 [Kan87] Gerry Kane. MIPS R2000 RISC Architecture. Prentice Hall, 1987.

[KG05] Sen M. Kuo and Woon-Seng Gan. Digital Signal Processors: Architectures, Implementations, and Applications. Pearson Education Inc, 1 edition, 2005.

[Knu69] Donald E. Knuth. The Art of Computer Programming: Fundamental Algo- rithms, volume 1 of Computer Science and Information Processing. Addison- Wesley, 2nd edition, 1969.

[Knu73] Donald E. Knuth. The Art of Computer Programming: Sorting and Search- ing, volume 3 of Computer Science and Information Processing. Addison- Wesley, 1st edition, 1973.

[Knu81] Donald E. Knuth. The Art of Computer Programming: Seminumerical Algo- rithms, volume 2 of Computer Science and Information Processing. Addison- Wesley, 2nd edition, 1981.

[Koo89] Philip J. Koopman. Stack Computers: The New Wave, chapter 1-9,B-C. Ellis Horwood, 1989.

[Kor87] James F. Korsh. Data structures, algorithms, and program style. PWS Publishers, 1987.

[KTJR05] Rakesh Kumar, Dean M. Tullsen, Norman P. Jouppi, and Parthasarathy Ranganathan. Heterogeneous chip multiprocessors. Computer, 38(11):32– 38, 2005.

[KY05] David Kaeli and Pen-Chung Yew, editors. Speculative Execution in High Per- formance Architectures. Computer and Information Science Series. Chapman & Hall, 2005.

[LaF06] Eric LaForest. Next generation stack computing, 2006.

[Lan07] Joe Landman. The need for acceleration technologies to achieve cost-effective supercomputing performance for advanced applications. Technical report, AMD, 2007.

[LB08] William B. Langdon and Wolfgang Banzhaf. A SIMD interpreter for genetic programming on GPU graphics cards. In Michael O’Neill, Leonardo Van- neschi, Steven Gustafson, Anna Isabel Esparcia Alcazar, Ivanoe De Falco, Antonio Della Cioppa, and Ernesto Tarantino, editors, Proceedings of the 11th European Conference on Genetic Programming, EuroGP 2008, vol- ume 4971 of Lecture Notes in Computer Science, pages 73–85, Naples, 26-28 March 2008. Springer.

147 [LCM+06] Damjan Lampret, Chen-Min Chen, Marko Minar, Johan Rydberg, Matan Ziv-Av, Bob Gardner, Chris Ziomkowski, Greg McGary, Rohit Mathur, and Maria Bolado. OpenRISC 1000 Architecture Manual. OpenCores.Org, April 2006.

[LFM88] K. C. Lee, O. Frieder, and V. Mak. A parallel vlsi architecture for unfor- matted data processing. In DPDS ’88: Proceedings of the first international symposium on Databases in parallel and distributed systems, pages 80–86, Los Alamitos, CA, USA, 1988. IEEE Computer Society Press.

[Lin08] Joseph Lin. Rambus memory technologies update. www.rambus.com, June 2008.

[LSY02] Ruby Lee, Zhijie Shi, and Xiao Yang. How a processor can permutate n bits in o(1) cycles. In Hot Chips, number 14, 2002.

[LZJ06] ZhongHai Lu, MingChen Zhong, and Axel Jantsch. Evaluation of on-chip networks using deflection routing. In GLSVLSI ’06: Proceedings of the 16th ACM Great Lakes symposium on VLSI, pages 296–301, New York, NY, USA, 2006. ACM Press.

[MA06] MySQL-AB. Mysql 5.1 reference manual, Aug 2006.

[McC07] Ian McCallum. Intel quickassist technology accelerator abstraction layer (aal). Technical report, Intel, 2007.

[McF06] Grant McFarland. Microprocessor Design. McGraw-Hill, 2006.

[Mer08] Rick Merritt. Cpu designers debate multi-core future. EE-Times, February 2008.

[MHH02] Oskar Mencer, Zhining Huang, and Lorenz Huelsbergen. Hagar: Efficient multi-context graph processors. In 12th International Conference on Field- Programmable Logic and Applications, pages 915–924. Springer, 2002.

[Mil00] Veljko Milutinovi´c. Surviving the Design of Microprocessor and Multiproces- sor Systems. John Wiley & Sons Inc, 2000.

[MK04] Morris M. Mano and Charles R. Kime. Logic and Computer Design Funda- mentals, chapter 9,14. Pearson Prentice-Hall, 3rd edition, 2004.

[NK04] Anna Nepomniaschaya and Zbigniew Kokosinski. Associative graph proces- sor and its properties. In PARELEC ’04: Proceedings of the international conference on Parallel Computing in Electrical Engineering, pages 297–302, Washington, DC, USA, 2004. IEEE Computer Society.

148 [NP08] Wolfgang Nejdl and Raluca Paiu. R.: I know i stored it somewhere - con- textual information and ranking on our desktop. 2008.

[Okl01] Vojin G. Oklobdzija. The Computer Engineering Handbook: Electrical En- gineering Handbook. CRC Press, Inc., Boca Raton, FL, USA, 2001.

[Pay00] Bernd Paysan. A four stack processor, 2000.

[PDG05] PostgreSQL-Development-Group. Postgresql 8.1 documentation, 2005.

[Pel05] Stephen Pelc. Programming Forth, chapter 2,5. Microprocessor Engineering Limited, 2005.

[PH05] David A. Patterson and John L. Hennessy. Computer Organization and De- sign: The Hardware/Software Interface, chapter 2,7,9,C,D. Morgan Kauf- mann, 2005.

[Por] James N. Porter. Five decades of disk drive industry firsts. http://www.disktrend.com/5decades2.htm.

[PS06a] K. Pagiamtzis and A. Sheikholeslami. Content-addressable memory (cam) circuits and architectures: a tutorial and survey. Solid-State Circuits, IEEE Journal of, 41(3):712–727, March 2006.

[PS06b] Kostas Pagiamtzis and Ali Sheikholeslami. Content-addressable memory (CAM) circuits and architectures: A tutorial and survey. IEEE Journal of Solid-State Circuits, 41(3):712–727, March 2006.

[Rob78] David C. Roberts. A specialized computer architecture for text retrieval. In CAW ’78: Proceedings of the fourth workshop on Computer architecture for non-numeric processing, pages 51–59, New York, NY, USA, 1978. ACM.

[RSK04] Pamela Ravasio, Sissel Guttormsen Sch¨ar, and Helmut Krueger. In pursuit of desktop evolution: User problems and practices with modern desktop systems. ACM Trans. Comput.-Hum. Interact., 11(2):156–180, 2004.

[Sak02] Dan Saks. Representing and manipulating hardware in standard c and c++. Embedded Systems Conference San Francisco, 2002.

[SB88] Gerard Salton and Chris Buckley. Parallel text search methods. Communi- cations of the ACM, 31(2):202–215, Feb 1988.

[SD02] Nick Sawyer and Marc Defossez. Quad-Port Memories in Virtex Devices. Xilinx Inc, September 2002. XAPP228.

149 [SF96] Robert Sedgewick and Philippe Flajolet. An Introduction to the Analysis of Algorithms, chapter 5-8. Addison-Wesley, 1st edition, 1996.

[Shi06] Sajjan G. Shiva. Advanced Computer Architecture. Taylor & Francis, 1st edition, 2006.

[Sil02] Silicore and Opencores. WISHBONE System-on-Chip (SOC) Interconnect Architecture for Portable IP Cores, b3 edition, Sept 2002.

[SKV+06] David Sheldon, Rakesh Kumar, Frank Vahid, Dean Tullsen, and Roman Lysecky. Conjoining soft-core fpga processors. In ICCAD ’06: Proceedings of the 2006 IEEE/ACM international conference on Computer-aided design, pages 694–701, New York, NY, USA, 2006. ACM.

[SL75] Stanley Y. W. Su and G. Jack Lipovski. Cassm: a cellular system for very large data bases. In VLDB ’75: Proceedings of the 1st International Confer- ence on Very Large Data Bases, pages 456–472, New York, NY, USA, 1975. ACM.

[SL95] Alexander Stepanov and Meng Lee. The standard template library. Technical Report 95-11, HP Laboratories, November 1995.

[Smy03] Bill Smyth. Computing Patterns in Strings. Pearson Addison-Wesley, 1st edition, 2003.

[SSTN03] Ilkka Saastamoinen, David Sig¨uenza-Tortosa, and Jari Nurmi. An ip-based on-chip packet-switched network. pages 193–213, 2003.

[Sta06] William Stallings. Computer Organization & Architecture: Designing for Performance, chapter 18. Pearson Prentice-Hall, 7th edition, 2006.

[Ste06] Alexander Stepanov. Short history of stl, August 2006.

[Sto90] Harold S. Stone. High-Performance Computer Architecture. Addison-Wesley, 2nd edition, 1990.

[Str94] Bjarne Stroustrup. The Design and Evolution of C++. Addison-Wesley Pub Co, March 1994.

[Sun06] Sun Microsystems, Inc. OpenSPARC T1 Specification, August 2006.

[Tan04] Shawn Tan. AEMB: 32-bit RISC Microprocessor Core Data Sheet. Open- Cores.Org, 2004.

150 [Tan05] Andrew S. Tanenbaum. Structured Computer Organization. Pearson Prentice-Hall, 5th edition, 2005.

[TDB+06] Xuan-Tu Tran, Jean Durupt, Francois BERTRAND Bertrand, Vincent Beroulle, and Chantal Robach. A dft architecture for asynchronous networks- on-chip. In ETS ’06: Proceedings of the Eleventh IEEE European Test Sym- posium, pages 219–224, Washington, DC, USA, 2006. IEEE Computer Soci- ety.

[TEL95] Dean M. Tullsen, Susan J. Eggers, and Henry M. Levy. Simultaneous mul- tithreading: maximizing on-chip parallelism. In ISCA ’95: Proceedings of the 22nd annual international symposium on Computer architecture, pages 392–403, New York, NY, USA, 1995. ACM.

[van02] Ruud van der Pas. Memory hierarchy in cache-based systems. Technical report, Sun Microsystems, November 2002.

[Vir03] Virtual Silicon Inc. Virtual Silicon: 0.13um High Density Standard Cell Library, 1.2 edition, Aug 2003.

[Vir04] Virtual Silicon Inc. Virtual Silicon: 0.18um VIP Standard Cell Library Tape Out Ready, 1.0 edition, Jul 2004.

[Vit01] Jeffrey Scott Vitter. External memory algorithms and data structures: deal- ing with massive data. ACM Comput. Surv., 33(2):209–271, 2001.

[VST03a] Dual-Port SRAM Compiler UMC 0.13um (L130E-HS-FSG), June 2003.

[VST03b] Single-Port SRAM Compiler UMC 0.13um (L130E-HS-FSG), March 2003.

[VST03c] Two-Port SRAM Compiler UMC 0.13um (L130E-HS-FSG), June 2003.

[VST04] Single-Port SRAM Compiler UMC 0.18um (L180 GII), August 2004.

[Wal95] David W. Wall. Limits of instruction-level parallelism. pages 432–444, 1995.

[Wik09a] Wikipedia. Desktop search — wikipedia, the free encyclopedia, 2009. [On- line; accessed 17-March-2009].

[Wik09b] Wikipedia. Non-uniform memory access — wikipedia, the free encyclopedia, 2009. [Online; accessed 17-March-2009].

[Wik09c] Wikipedia. Stored procedure — wikipedia, the free encyclopedia, 2009. [On- line; accessed 17-March-2009].

151 [Wik09d] Wikipedia. — wikipedia, the free encyclopedia, 2009. [Online; accessed 17-March-2009].

[Woo08] Steven Woo. Memory system challenges in the multi-core era. Presented at MEMCON’08, July 2008.

[Xil04] Xilinx, Inc. Microblaze Processor Reference Guide: EDK6.2i, June 2004.

[Xil05] Xilinx Inc. Using Block RAM in Spartan3 Generation FPGAs, March 2005. XAPP463.

[Xil08] Xilinx, Inc. Spartan-3A FPGA Family: Data Sheet, April 2008.

[ZZ05] Zhichun Zhu and Zhao Zhang. A performance comparison of dram memory system optimizations for smt processors. In HPCA ’05: Proceedings of the 11th International Symposium on High-Performance Computer Architecture, pages 213–224, Washington, DC, USA, 2005. IEEE Computer Society.

LATEX 2ε

152