Design and Development of a Heterogeneous Hardware Search Accelerator

Design and Development of a Heterogeneous Hardware Search Accelerator This dissertation is submitted for the degree of Doctor of Philosophy Tan, Shawn Ser Ngiap Magdalene College May 21, 2009 Abstract Search is a fundamental computing problem and is used in any number of applications that are invading our everyday lives. However, it has not received as much attention as other fundamental computing problems. Historically, there have been several attempts at designing complex machines to accelerate search applications. However, with the cost of transistors falling dramatically, it may be useful to design a novel on-chip hardware accelerator for search applications. A search application is any application that traverses a data set in order to find one or more records that meet certain fitting criteria. These applications can be broken down into several low level operations, which can be accelerated by specialised hardware units. A special search stack can be used to visualise the different levels of a search operation. Three hardware accelerator units were designed to work alongside a host processor. A significant speed-up in performance when compared against pure software solutions was observed under ideal simulation conditions An unconventional method for virtually saving and loading search data was developed within the simulation construct to reduce simulation time. This method of acceleration is not the only possible solution as search can be accelerated at a number of levels. However, the proposed architecture is unique in the way that the accelerator units can be combined like LEGO bricks, giving this solution flexibility and scalability. Search is memory intensive, but the performance of regular cache memory that exploit temporal and spatial locality was found wanting. A certain cache memory that exploited structural locality instead of temporal and spatial locality was also developed to improve the performance. As search is a fundamental computational operation, it is used in almost every application, not just obvious search applications. Therefore, the hardware accelerator units can be applied to almost every software application. Obvious examples include genetics and law enforcement while less obvious examples include gaming and operating system software. In fact, it would be useful to integrate accelerator units with slower microprocessors to improve general search performance. The accelerator units can be implemented using an off-the-shelf FPGA at speeds of around 200MHz or in ASIC for 333MHz (0.35µm) and 1.0GHz (0.18µm) operations. A regular FPGA is able to accelerate up to five parallel simple queries or two heterogeneous boolean queries or a combination of each when used with regular DDR2 memory. This solution is particularly low-cost for accelerating search, avoiding the need for expensive system-level solutions. Declaration I hereby declare that my thesis entitled is not substantially the same as any that I have submitted for a degree or diploma or other qualification at any other University. I further state that no part of my thesis has already been or is being concurrently submitted for any such degree, diploma or other qualification. This dissertation is the result of my own work and includes nothing which is the outcome of work done in collaboration except where specifically indicated in the text. This dissertation does not exceed the limit of length prescribed by the Degree Committee of the Engineering Department. The length of my thesis is approximately 45,000 words with 41 figures and 25 listings. Signed, Shawn Tan i Acknowledgements I would like to take this opportunity to express my gratitude to the following people who have helped me, in one way or another, throughout the duration of my research at Cambridge and the write-up at home in Malaysia. Dr David Holburn, for being the nicest supervisor that one can hope for, without whom this work would be difficult to accomplish. I want to express my thanks for everything you’ve done for me in the past four years; welcoming me into your family, getting things done within the department and patiently reading through my thesis. All the members of the department and division, for making it a nice place and easy environment to work in. Mr Stephen Mounsey, Mr John Norcott and Miss Eleanor Blair for technical assistance in setting up the various software tools that I needed. Mr Mick Furber for all the assistance in the electrical teaching lab. Friends from college, for helping me through tough times and keeping me sane. Jack Nie for helping me print out my thesis and handling all of the administrative issues in submitting my thesis. Drs Ray Chan and Ming Yeong Lim for being my companions on my many travels. Zen Cho for being my shoulder to cry on when things were not going well. All my friends and family in Malaysia, for their belief in me and support throughout the duration of this research. I would like to thank my sister and my parents for all the patience and tolerance that they have shown me during the final stretch of this work. My niece and nephews, Jarellynn, Jareick and Jarell for lending me their bubbling energy when I needed a boost. This thesis is dedicated to them. ii Contents 1 Introduction 1 1.1 Justifying Search Acceleration . .... 1 1.2 HistoricalJustification . 3 1.3 Objectives................................... 5 2 Search Basics 7 2.1 SearchStack ................................. 7 2.2 CategorisingSearch ............................. 9 2.2.1 PrimarySearch............................ 9 2.2.2 SecondarySearch........................... 10 2.3 DataStructures&Algorithms . 11 2.3.1 DataStructures............................ 11 2.3.2 Algorithms .............................. 13 2.4 SearchProblems ............................... 14 3 Search Application 16 3.1 SearchApplication .............................. 16 3.1.1 ExampleQuery............................ 17 3.1.2 PipelineBreakdown .. .. .. .. .. .. .. 17 3.1.3 QueryIllustration .......................... 19 3.2 SearchProfile................................. 20 3.2.1 KeySearch .............................. 21 3.2.2 ListRetrieval............................. 21 3.2.3 ResultCollation ........................... 22 3.2.4 OverallProfile ............................ 22 iii 4 General Architecture 25 4.1 InitialConsiderations . 25 4.2 HardwareArchitecture. 26 4.2.1 Multi-CoreProcessing . 26 4.2.2 WordSize............................... 27 4.2.3 HostProcessor ............................ 27 4.3 SoftwareArchitecture . 27 4.3.1 SoftwareToolchain. .. .. .. .. .. .. .. 27 4.3.2 StandardLibraries . .. .. .. .. .. .. .. 28 4.3.3 CustomLibrary............................ 28 4.4 InitialArchitecture. .. .. .. .. .. .. .. .. 29 4.4.1 StackProcessors ........................... 30 5 Streamer Unit 32 5.1 Introduction.................................. 32 5.1.1 DesignConsiderations . 33 5.2 Architecture.................................. 33 5.2.1 Configuration............................. 34 5.2.2 OperatingModes........................... 35 5.2.3 StateMachine ............................ 36 5.3 StreamerSimulation ............................. 37 5.3.1 Kernel Functional Simulation . 38 5.3.2 KernelTimingSimulation . 39 5.3.3 Kernel Performance Simulation . 44 5.4 Conclusion .................................. 45 6 Sieve Unit 46 6.1 Introduction.................................. 46 6.1.1 DesignConsiderations . 47 6.2 Architecture.................................. 47 6.2.1 Configuration............................. 48 6.2.2 Modes................................. 49 6.2.3 Operation............................... 50 6.3 SimulationResults .............................. 51 6.3.1 Kernel Functional Simulation . 51 6.3.2 KernelSoftwarePumpTiming . 51 6.3.3 KernelSoftwarePumpPerformance . 56 6.3.4 KernelHardwarePipeTiming . 57 6.3.5 KernelHardwarePipePerformance . 59 iv 6.4 Conclusion .................................. 60 7 Chaser Unit 63 7.1 Introduction.................................. 63 7.1.1 DesignConsiderations . 64 7.2 ChaserArchitecture ............................. 64 7.2.1 Configuration............................. 65 7.2.2 Operation............................... 67 7.3 KernelSimulationResults . 68 7.3.1 Kernel Functional Simulation . 68 7.3.2 KernelSingleKeyTiming . 69 7.3.3 KernelSingleKeyPerformance . 72 7.3.4 KernelMultiKeyTiming . 73 7.3.5 KernelMultiKeyPerformance . 75 7.4 Conclusion .................................. 76 8 Memory Interface 79 8.1 Introduction.................................. 79 8.2 CachePrimer................................. 79 8.3 CachePrinciples ............................... 80 8.3.1 InstructionCache .......................... 82 8.3.2 DataCache.............................. 84 8.4 CacheParameters .............................. 85 8.4.1 InstructionCache .......................... 86 8.4.2 Data Cache Trends (Repeat Key) . 87 8.4.3 Data Cache Trends (Random Key) . 89 8.5 DataCachePrefetching ........................... 90 8.5.1 StaticPrefetching .......................... 90 8.5.2 DynamicPrefetching. 92 8.5.3 PrefetchedDataCache. 92 8.6 CacheIntegration............................... 94 8.6.1 CacheSizeRatio ........................... 95 8.6.2 StructuralLocality . .. .. .. .. .. .. .. 95 8.7 Conclusion .................................. 96 9 Search Pipelines 97 9.1 Pipelines.................................... 97 9.1.1 PrimarySearch............................ 97 9.1.2 SimpleQuery............................. 99 v 9.1.3 RangeQuery ............................. 100 9.1.4 BooleanQuery ...........................

Load more