Moore's Law Motivation

Motivation: Moore’s Law • Every two years: – Double the number of transistors CprE 488 – Embedded Systems Design – Build higher performance general-purpose processors Lecture 8 – Hardware Acceleration • Make the transistors available to the masses • Increase performance (1.8×↑) Joseph Zambreno • Lower the cost of computing (1.8×↓) Electrical and Computer Engineering Iowa State University • Sounds great, what’s the catch? www.ece.iastate.edu/~zambreno rcl.ece.iastate.edu Gordon Moore First, solve the problem. Then, write the code. – John Johnson Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.2 Motivation: Moore’s Law (cont.) Motivation: Dennard Scaling • The “catch” – powering the transistors without • As transistors get smaller their power density stays melting the chip! constant 10,000,000,000 2,200,000,000 Transistor: 2D Voltage-Controlled Switch 1,000,000,000 Chip Transistor Dimensions 100,000,000 Count Voltage 10,000,000 ×0.7 1,000,000 Doping Concentrations 100,000 Robert Dennard 10,000 2300 Area 0.5×↓ 1,000 130W 100 Capacitance 0.7×↓ 10 0.5W 1 Frequency 1.4×↑ 0 Power = Capacitance × Frequency × Voltage2 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 Power 0.5×↓ Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.3 Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.4 Motivation Dennard Scaling (cont.) Motivation: Dark Silicon • In mid 2000s, Dennard scaling “broke” • Dark silicon – the fraction of transistors that need to be powered off at all times (due to power + thermal constraints) Transistor: 2D Voltage-Controlled Switch Area 0.5×↓ Dimensions Power 0.5×↓ Voltage ×0.7 • Evolution of processors strongly motivated by this ending of Doping Dennard scaling Concentrations – Expected continued evolution towards HW specialization / acceleration Area 0.5×↓ Capacitance 0.7×↓ Frequency 1.4×↑ Power = Capacitance × Frequency × Voltage2 Power 0.5×↓ 2015 Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.5 Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.6 This Week’s Topic Straight from the Headlines... • Accelerating a diverse range of applications using • Hardware Acceleration: reconfigurable logic is a trending area: – D. Hoang, D. Lopresti, “FPGA Implementation of – Performance analysis and overhead Systolic Sequence Alignment” – D. Ross, O. Vellacott, M. Turner, “An FPGA-based – Coprocessors vs accelerators Hardware Accelerator for Image Processing” – J. Lockwood, “Design and Implementation of a – Common acceleration techniques Multicast, Input-Buffered ATM Switch for the iPOINT Testbed” – Acceleration examples • What these examples have in common: – Illustrative of the potential for custom computing to enable faster scientific and engineering • Reading: Wolf section 10.4-10.5 discovery – Relatively small impact (outside of academia) – All work done in 1992 Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.7 Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.8 And Today? A (Brief) Background • Reconfigurable logic supporting the data center, • Earliest reconfigurable computer proposed at UCLA in the 1960s – G. Estrin et al., “Parallel Processing in a Restructurable Computer specializing clusters , and tightly integrated on-chip System,” IEEE Trans. Electronic Computers, pp. 747-755, Dec. 1963. – Basic concepts well ahead of the enabling technology – could only prototype a crude approximation • Current chips – contain memory cells that hold both configuration and state information – Only a partial architecture exists before programming – After configuration, the device provides an execution environment for a specific application • Goal is to adapt at the logic-level to solve specific problems Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.9 Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.10 A (Brief) Background So Who Cares? • Today, well known niches for FPGAs: • Claim – reconfigurable processors offer a definite advantage over – Emulation general-purpose counterparts with regards to functional density – Telecom – A. DeHon. “The Density Advantage of Configurable Computing,” IEEE – Space / defense Computer, pp. 41-49, Apr. 2000 – Computations per chip area per cycle time • $4.5B market, major device manufacturers: – Xilinx (~45% market share) – Altera (~40%) • Considering general computing trends (Moore’s Law, Dennard “Vodafone Sure Signal: Inside a – Microsemi / Actel (10%) Femtocell” – http://zdnet.com (Xilinx Scaling, “dark” silicon) – what can I get for my N billion transistors? – Others (Lattice, Achronix, Tabula) Spartan 3E FPGA used for glue logic) Altera Stratix IV EP4S40G5 Intel Haswell Core i7-5960X “Sneak Peak: Inside NVIDIA’s Emulation Lab” “How Mars Rover Got its ‘Dream Mode’” – – http://blogs.nvidia.com (Cadence Palladium http://eetimes.com (Microsemi FPGAs cluster used for GPU emulation) reprogrammed for surface exploration) Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.11 Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.12 Accelerated Systems Accelerator vs. Coprocessor • Use additional computational units dedicated to • A coprocessor executes instructions some functionality – Instructions are dispatched by the CPU • Hardware/software co-design: joint design of • An accelerator appears as a device on the bus hardware and software architectures – Typically controlled by registers (memory-mapped request IO) accelerator result datadata CPU memory I/O Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.13 Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.14 EX-08.1: Instruction Augmentation Accelerated System Design • Processor can only describe a small number • First, determine that the system really needs to of basic computations in a cycle be accelerated I – How much faster will the accelerator on the core – I bits -> 2 operations function? • Many operations could be performed on 2 W- – How much data transfer overhead? Compute bound bit words vs memory bound vs I/O bound? • ALU implementations restrict execution of • Design accelerator and system interface some simple operations – e. g. bit reversal • If tighter CPU integration required: a31 a30………. a0 – Create a functional unit for augmented instructions Swap bit – Compiler techniques to identify/use new functional positions unit b31 b0 Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.15 Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.16 Amdahl’s Law Heterogeneous Execution Model • If an optimization improves a fraction f of execution time by a factor of a instructions executed over time Told 1 Speedup [(1 f ) f / a]Told (1 f ) f / a 49% of initialization code 0.5% of run time • This formula is known as Amdahl’s Law co-processor “hot” loop • Lessons from 1% of code 99% of run time – If f →100%, then speedup = a – If a →∞, the speedup = 1 / (1 – f ) 49% of clean up • Summary code 0.5% of run time – Make the common case fast – Watch out for the non-optimized component Gene Amdahl Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.17 Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.18 Heterogeneous Computing: Performance Hardware/Software Partitioning Move “bottleneck” computations from software to • C Code for FIR Filter Hardware ‘for’ loop hardware Hardware/software* * * * * * * * * partitioning* * * . Designer creates forfor (i=0;(i=0; i i << 16;128; i++) i++) custom accelerator y[i]y[i] +=+= c[i]c[i] ** x[i]x[i] selects+ +performance + + + critical+ . regions. .... using hardware • Example: for hardware implement . .... + + + design methodology .... – Application requires a week of CPU time + + . – One computation consumes 99% of execution time + . Compiler 100 90 Kernel Application Execution 80 speedup speedup time 70 60 50 Hw/Sw 50 34 5.0 hours 50 Sw 40 Sw 100 50 3.3 hours Processor Processor Processor FPGA 30 20 200 67 2.5 hours 10 0 500 83 2.0 hours TimeTime EnergyEnergy 1000 91 1.8 hours • ~1000 cycles ~ 10 cycles Speedup = 1000 cycles/ 10 cycles = 100x Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.19 Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.20 High-Level Synthesis Accelerator Design Challenges HighUpdated-level • Debugging – how to properly test the accelerator separately, BinaryCode • Problem: Describing circuit and then in conjunction with the rest of the system (hw/sw using HDL is time co-simulation) HW/SWDecompilation Partitioning consuming/difficult • Coherency – how to safely share results between CPU and Compiler DecompilatioHigh-level • Solution: High-level accelerator Synthesisn synthesis – Impact on cache design, shared memory – Solutions looks similar to those for resource sharing in Libraries/Libraries/ Software Hardware – Create circuit from high-level conventional operating systems, but are typically ad-hoc ObjectObject code CodeCode Linker – Allows developers to use • Analysis – determining the effects of any hardware higher-level specification parallelism on performance Bitstream Bitstream – Potentially, enables synthesis – Must take into account accelerator execution time, data transfer for software developers time, synchronization overhead – Heterogeneous multi-threading helps, but complicates design uP FPGA significantly • More on this in a bit – Overlapping I/O and computation (streaming) Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration)

Moore's Law Motivation

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support