Toward a Hardware Accelerated Future

Toward a Hardware Accelerated Future The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters Citation Lyons, Michael John. 2013. Toward a Hardware Accelerated Future. Doctoral dissertation, Harvard University. Citable link http://nrs.harvard.edu/urn-3:HUL.InstRepos:11182688 Terms of Use This article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http:// nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#LAA Toward a Hardware Accelerated Future A dissertation presented by Michael John Lyons to The School of Engineering and Applied Sciences in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the subject of Computer Science Harvard University Cambridge, Massachusetts June, 2013 c 2013 – Michael John Lyons All rights reserved. Dissertation advisors Author David M. Brooks, Gu-Yeon Wei Michael John Lyons Toward a Hardware Accelerated Future Abstract Hardware accelerators provide a rare opportunity to achieve orders-of-magnitude performance and power improvements with customized circuit designs. Many forms of hardware acceleration exist—attributes and trade-o↵s of each approach is discussed. Single-algorithm accelerators, which maximize efficiency gains through maximum specialization, are one such approach. By combining many of these into a many-accelerator system, high specialization is possible with fewer specialization limits. The development of one such single-algorithm hardware accelerator for managing compressed Bloom filters in wireless sensor networks is presented. Results from the development of the accelerator highlight scalability deficiencies in the way accelerators are currently integrated into processors, and that the majority of accelerator area is consumed by generic SRAM memory rather than algorithm-specific logic. These results motivate development of the accelerator store, a system architecture designed for the needs of many-accelerator systems. In particular, the accelerator store improves inter- accelerator communication and includes support for sharing accelerator SRAM memories. Using a security application as an example, the accelerator store architecture is able to reduce total processor area by 30% with less than 1% performance overhead. Using the accelerator store as a base, the ShrinkFit framework allows accelerators to grow and shrink, to achieve accelerated performance within small FPGA budgets and efficiently expand for more performance when larger FPGA budgets are available. The ability to resize accelerators is particularly useful for hybrid systems combining GP-CPUs and FPGA resources, in which ap- plications may deploy accelerators to a shared FPGA fabric. ShrinkFit performance overheads for small and large FPGA resources are found to be low using a robotic bee brain workload and FPGA prototype. Finally, future directions are briefly discussed along with details about the production of the robotic bee helicopter brain prototype. iii Contents Title i Copyright ii Abstract iii Table of contents iv List of figures ix List of tables xii Previous work xiii Acknowledgments xiv 1 The potential for accelerated computing 1 1.1 Trendsincomputing .................................... 2 1.1.1 Theendofclockscaling .............................. 2 1.1.2 Thelimitsofmulticore............................... 3 1.1.3 Hardware acceleration . 4 1.2 Accelerated architecture taxonomy . 5 1.2.1 DSP......................................... 6 1.2.2 Graphics Processing Unit (GPU) . 8 1.2.3 Specialized homogeneous multicore (SHM) . 10 1.2.4 CustomizedISA(CISA).............................. 12 iv 1.2.5 Single-ISA heterogeneous multicore (SIHM) . 14 1.2.6 MobileSoC..................................... 16 1.2.7 Reconfigurable computing . 19 1.3 Challenges.......................................... 21 1.3.1 Rapid accelerator development . 22 1.3.2 Identifying algorithms to accelerate . 23 1.3.3 Minimizing accelerator area overheads . 23 1.3.4 Scaling accelerator systems . 24 2 Accelerator composition 26 2.1 Acceleratordesignrequirements. 26 2.2 Power and performance optimization strategies . 28 2.3 Bloom Filter Algorithms . 29 2.3.1 Bloomfilters .................................... 29 2.3.2 Multiply and Shift Hashing . 31 2.3.3 Golomb-RiceCoding................................ 31 2.4 Accelerator Architecture . 32 2.4.1 BloomFilterMemory ............................... 33 2.4.2 Memory Data Controller . 33 2.4.3 Memory Address Controller . 34 2.4.4 Decompressor.................................... 34 2.4.5 Compressor..................................... 35 2.5 Accelerator Evaluation . 35 2.5.1 ItemInsertionandQuerying . 37 2.5.2 CompressingBloomFilters . 39 2.5.3 Merging Compressed Bloom Filters . 40 2.6 Application Evaluations . 41 2.6.1 MoteStatus..................................... 42 2.6.2 ObjectTracking .................................. 44 2.6.3 Duplicate Packet Removal . 45 v 2.7 Takeaways.......................................... 45 3 Accelerator store 46 3.1 Accelerator Characterization . 48 3.1.1 Accelerator composition characterization . 49 3.1.2 Memory access pattern characterization . 50 3.1.3 Shared memory selection methodology . 53 3.2 Accelerator store design . 55 3.2.1 Accelerator store features . 56 3.2.2 Architecture of the accelerator store . 58 3.2.3 Distributed accelerator store architecture . 60 3.2.4 Accelerator/accelerator store interface . 61 3.2.5 Accelerator store software interface . 62 3.3 Accelerator Store Evaluation . 64 3.3.1 Accelerator-based system model . 64 3.3.2 Embedded application . 65 3.3.3 Server application . 74 3.4 Relatedwork ........................................ 78 4ShrinkFit 80 4.1 Motivation ......................................... 82 4.2 Conceptual approach . 84 4.2.1 Decomposition . 84 4.2.2 Building ShrinkFit accelerators with VMs . 85 4.2.3 Modulecontexts .................................. 86 4.2.4 Accelerator resource sharing . 86 4.2.5 Dynamic accelerator resizing . 87 4.3 Framework implementation . 87 4.3.1 Accelerator store . 88 4.3.2 Slicermodule.................................... 89 4.3.3 ShrinkFitwrapper ................................. 91 vi 4.3.4 ShrinkFit framework area costs . 93 4.4 SoftwareDevelopment ................................... 94 4.4.1 Decomposing accelerators . 95 4.4.2 Configure ShrinkFit hard logic blocks . 95 4.4.3 ShrinklibSDK ................................... 95 4.5 ShrinkFitmoduleevaluation. 96 4.5.1 ShrinkFitPMimplementations . 97 4.5.2 Evaluation methodology . 99 4.5.3 PM performance scalability . 101 4.6 RoboBee application evaluation . 102 4.6.1 Application evaluation overview . 102 4.6.2 Bandwidthimpact ................................. 104 4.6.3 Bu↵eringimpact .................................. 105 4.6.4 Hard logic block area overheads . 107 4.7 Relatedwork ........................................ 108 5 Future directions 109 5.1 Accelerator store scalability . 109 5.1.1 Subset arbitration . 110 5.1.2 Multistage arbitration . 110 5.2 Unifiedsystem+ASmemory................................ 111 5.3 Dynamic handle allocation . 111 5.4 ShrinkFit dynamic reprogramming . 112 A Helicopter brain prototype 113 A.1 Brain ............................................ 115 A.2 Helicopters ......................................... 116 A.3 Objectives.......................................... 117 A.4 SystemArchitecture .................................... 118 A.5 HBPConnectors ...................................... 119 A.5.1 Connection to mainboard . 119 vii A.5.2 Connection to optical flow sensor ring . 120 A.5.3 JTAG ........................................ 122 A.5.4 I2C ......................................... 122 A.5.5 SPI ......................................... 122 A.5.6 GPIO ........................................ 123 A.6 Components......................................... 123 A.6.1 FPGA........................................ 124 A.6.2 Flashmemory ................................... 125 A.6.3 1.0V+3.3V Buck Converter . 126 A.6.4 4.7V Boost+Buck Converter . 127 A.6.5 ADC......................................... 127 A.6.6 100 MHz Oscillator . 127 A.6.7 ExternalIOpins .................................. 127 A.6.8 PCB......................................... 128 A.7 Helicopter brain prototype implementation . 129 Bibliography 141 viii List of Figures 2.1 Bloom filter hardware accelerator hardware flow . 32 2.2 Decompressor information flow . 34 2.3 Placed-and-routed Bloom filter accelerator design . 36 2.4 Average power usage of Bloom filter hardware accelerator modules . 37 2.5 Item insertion times of application-specific hardware design logic and general purpose designlogic ......................................... 38 2.6 Bloom filter reading and merging delay at a 1% false positive rate . 40 2.7 Storage cost per item for a 16KB Bloom filter and 1% false positive rate . 43 3.1 Comparison of architecture styles . 47 3.2 Total memory bandwidth utilization of several accelerators . 51 3.3 Accelerator SRAM memories sorted by memory size per bandwidth . 54 3.4 Accelerator store system architecture . 56 3.5 Embedded application accelerator activity . 66 3.6 Embedded application “top 20 to share” memory bandwidth . 67 3.7 Embedded application contention performance overhead . 68 3.8 DistributedASarchitecture . 69 3.9 Embedded application access latency and contention performance overhead . 70 3.10 Embeddedapppowerbreakdown . 72 3.11 Embedded

Load more