<<

Motivation: Moore’s Law • Every two years: – Double the number of transistors CprE 488 – Embedded Systems Design – Build higher performance general-purpose processors Lecture 8 – • Make the transistors available to the masses • Increase performance (1.8×↑) Joseph Zambreno • Lower the cost of (1.8×↓) Electrical and Engineering Iowa State University • Sounds great, what’s the catch? www.ece.iastate.edu/~zambreno

rcl.ece.iastate.edu First, solve the problem. Then, write the code. – John Johnson Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.2

Motivation: Moore’s Law (cont.) Motivation: Dennard Scaling • The “catch” – powering the transistors without • As transistors get smaller their power density stays melting the chip! constant

10,000,000,000 2,200,000,000 Transistor: 2D Voltage-Controlled Switch 1,000,000,000 Chip Transistor Dimensions 100,000,000 Count Voltage 10,000,000 ×0.7 1,000,000 Doping Concentrations 100,000 Robert Dennard 10,000 2300 Area 0.5×↓ 1,000 130W 100 Capacitance 0.7×↓ 10 0.5W 1 Frequency 1.4×↑ 0 Power = Capacitance × Frequency × Voltage2 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 Power 0.5×↓

Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.3 Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.4 Motivation Dennard Scaling (cont.) Motivation: Dark Silicon • In mid 2000s, Dennard scaling “broke” • Dark silicon – the fraction of transistors that need to be powered off at all times (due to power + thermal constraints) Transistor: 2D Voltage-Controlled Switch Area 0.5×↓ Dimensions Power 0.5×↓ Voltage ×0.7 • Evolution of processors strongly motivated by this ending of Doping Dennard scaling Concentrations – Expected continued evolution towards HW specialization / acceleration Area 0.5×↓

Capacitance 0.7×↓

Frequency 1.4×↑ Power = Capacitance × Frequency × Voltage2 Power 0.5×↓ 2015

Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.5 Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.6

This Week’s Topic Straight from the Headlines... • Accelerating a diverse range of applications using • Hardware Acceleration: reconfigurable logic is a trending area: – D. Hoang, D. Lopresti, “FPGA Implementation of – Performance analysis and Systolic Sequence Alignment” – D. Ross, O. Vellacott, M. Turner, “An FPGA-based – vs accelerators Hardware Accelerator for Image Processing” – J. Lockwood, “Design and Implementation of a – Common acceleration techniques Multicast, Input-Buffered ATM Switch for the iPOINT Testbed” – Acceleration examples • What these examples have in common: – Illustrative of the potential for custom computing to enable faster scientific and engineering • Reading: Wolf section 10.4-10.5 discovery – Relatively small impact (outside of academia) – All work done in 1992

Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.7 Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.8 And Today? A (Brief) Background

• Reconfigurable logic supporting the center, • Earliest reconfigurable computer proposed at UCLA in the 1960s – G. Estrin et al., “Parallel Processing in a Restructurable Computer specializing clusters , and tightly integrated on-chip System,” IEEE Trans. Electronic , pp. 747-755, Dec. 1963. – Basic concepts well ahead of the enabling technology – could only prototype a crude approximation

• Current chips – contain memory cells that hold both configuration and state information – Only a partial architecture exists before programming – After configuration, the device provides an execution environment for a specific application

• Goal is to adapt at the logic-level to solve specific problems

Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.9 Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.10

A (Brief) Background So Who Cares? • Today, well known niches for FPGAs: • Claim – reconfigurable processors offer a definite advantage over – Emulation general-purpose counterparts with regards to functional density – Telecom – A. DeHon. “The Density Advantage of Configurable Computing,” IEEE – Space / defense Computer, pp. 41-49, Apr. 2000 – Computations per chip area per cycle time • $4.5B market, major device manufacturers: – (~45% market share) – (~40%) • Considering general computing trends (Moore’s Law, Dennard “Vodafone Sure Signal: Inside a – Microsemi / (10%) Femtocell” – http://zdnet.com (Xilinx Scaling, “dark” silicon) – what can I get for my N billion transistors? – Others (Lattice, , ) Spartan 3E FPGA used for glue logic) Altera IV EP4S40G5 Haswell Core i7-5960X

“Sneak Peak: Inside ’s Emulation Lab” “How Mars Rover Got its ‘Dream Mode’” – – http://blogs.nvidia.com (Cadence Palladium http://eetimes.com (Microsemi FPGAs cluster used for GPU emulation) reprogrammed for surface exploration) Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.11 Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.12 Accelerated Systems Accelerator vs. • Use additional computational units dedicated to • A coprocessor executes instructions some functionality – Instructions are dispatched by the CPU • Hardware/ co-design: joint design of • An accelerator appears as a device on the bus hardware and software architectures – Typically controlled by registers (memory-mapped

request IO) accelerator result datadata CPU memory

I/O

Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.13 Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.14

EX-08.1: Instruction Augmentation Accelerated System Design • can only describe a small number • First, determine that the system really needs to of basic computations in a cycle be accelerated I – How much faster will the accelerator on the core – I -> 2 operations ? • Many operations could be performed on 2 W- – How much data transfer overhead? Compute bound words vs memory bound vs I/O bound?

• ALU implementations restrict execution of • Design accelerator and system interface some simple operations

– e. g. bit reversal • If tighter CPU integration required: a31 a30………. a0 – Create a functional unit for augmented instructions Swap bit – Compiler techniques to identify/use new functional positions unit

b31 b0 Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.15 Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.16 Amdahl’s Law Heterogeneous Execution Model • If an optimization improves a fraction f of execution time by a factor of a instructions executed over time Told 1   [(1 f )  f / a]Told (1 f )  f / a 49% of initialization code 0.5% of run time • This formula is known as Amdahl’s Law co-processor “hot” loop • Lessons from 1% of code 99% of run time – If f →100%, then speedup = a – If a →∞, the speedup = 1 / (1 – f ) 49% of clean up • Summary code 0.5% of run time – Make the common case fast – Watch out for the non-optimized component Gene Amdahl

Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.17 Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.18

Heterogeneous Computing: Performance Hardware/Software Partitioning Move “bottleneck” computations from software to • C Code for FIR Filter Hardware ‘for’ loop hardware Hardware/software* * * * * * * * * partitioning* * * ...... Designer creates forfor (i=0;(i=0; i i << 16;128; i++) i++) custom accelerator y[i]y[i] +=+= c[i]c[i] ** x[i]x[i] selects+ +performance + + + critical+ . . regions...... using hardware • Example: for hardware implement ...... + + + design methodology .... – Application requires a week of CPU time + + ...... – One computation consumes 99% of execution time + ......

Compiler 100 90 Kernel Application Execution 80 speedup speedup time 70 60 50 Hw/Sw 50 34 5.0 hours 50 Sw 40 Sw 100 50 3.3 hours Processor Processor Processor FPGA 30 20 200 67 2.5 hours 10 0 500 83 2.0 hours TimeTime EnergyEnergy 1000 91 1.8 hours • ~1000 cycles  ~ 10 cycles  Speedup = 1000 cycles/ 10 cycles = 100x

Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.19 Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.20 High-Level Synthesis Accelerator Design Challenges

HighUpdated-level • Debugging – how to properly test the accelerator separately, BinaryCode • Problem: Describing circuit and then in conjunction with the rest of the system (hw/sw using HDL is time co-simulation) HW/SWDecompilation Partitioning consuming/difficult • Coherency – how to safely share results between CPU and Compiler DecompilatioHigh-level • Solution: High-level accelerator Synthesisn synthesis – Impact on design, – Solutions looks similar to those for resource sharing in Libraries/Libraries/ Software Hardware – Create circuit from high-level conventional operating systems, but are typically ad-hoc ObjectObject code CodeCode Linker – Allows developers to use • Analysis – determining the effects of any hardware higher-level specification parallelism on performance Bitstream Bitstream – Potentially, enables synthesis – Must take into account accelerator execution time, data transfer for software developers time, synchronization overhead – Heterogeneous multi-threading helps, but complicates design uP FPGA significantly • More on this in a bit – Overlapping I/O and computation (streaming)

Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.21 Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.22

A Representative Example An (Equally) Representative Ex. • PRISM-I: early example of an FPGA / CPU coupling: • Applicability of FPGAs to a commonly occurring computational kernel in – P. Athanas and H. Silverman. “Processor Reconfiguration through scientific applications Instruction Set Metamorphosis,” IEEE Computer, pp. 11-18, Mar. 1993 – X. Wang and J. Zambreno, "An Efficient Architecture for Floating-Point Eigenvalue Decomposition", Proc. of the Int’l Symposium on Field- – An attempt to augment a RISC instruction set with application-specific Programmable Custom Computing Machines (FCCM), May 2014 operators – Uses the modified one-sided Jacobi rotation approach, mapped to a 1D systolic – Significant performance speedups for parallel bit-level operations array to extract parallelism in EVD computation • Consider a permutation table from the Data Standard – Fundamental design challenge – tackling O(Nk) complexity with P resources (DES): (P<

#define DO_PERMUTATION(a, temp, b, offset, mask) \ temp = ((a>>offset) ^ b) & mask; \ b ^= temp; \ a ^= temp<

#define INITIAL_PERMUTATION(left, temp, right) \ DO_PERMUTATION(left, temp, right, 4, 0x0f0f0f0f) \ DO_PERMUTATION(left, temp, right, 16, 0x0000ffff) \ DO_PERMUTATION(right, temp, left, 2, 0x33333333) \ DO_PERMUTATION(right, temp, left, 8, 0x00ff00ff) \ right = (right << 1) | (right >> 31); \ temp = (left ^ right) & 0xaaaaaaaa; \ right ^= temp; \ left ^= temp; \ left = (left << 1) | (left >> 31); DES Initial Permutation Table. Note the equivalent HW realization can be Partial libgcrypt implementation from des.c. Not well- implemented solely via routing resources. suited for conventional SW implementation.

Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.23 Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.24 A Cause for Pessimism? FPGA Evolutionary Pressures • Hardware amenability was a design criterion for the Advanced Encryption Standard (AES) • Base logic element consists of a binary lookup-table – Flurry of activity looking at AES on FPGA (literally 1000s of implementations, optimizations, attacks) (LUT) and a flip-flop – J. Zambreno, D. Nguyen and A. Choudhary, "Exploring Area/Delay Tradeoffs in an AES FPGA Implementation". Proc. of the Int’l Conference on Field- • Inter-element routing dominates both delay and area for Programmable Logic and its Applications (FPL), Aug. 2004 non-trivial circuits • Main contribution: an early exploration of the design decisions that lead to area/delay tradeoffs in an AES accelerator • Significant (at the time) of 23.57 Gbps for AES-128E in ECB mode • Consider the critical path of a (naïve) ripple-carry adder:

delay = N  (LUT_delay + routing_delay) aesni_encrypt: = N  (1.0 ns + 0.25 ns) R R R R R ... Input 1 2 3 4 5 plaintext .L000enc1_loop: = 1.25 ns per bit aesenc %xmm4,%xmm0 4-LUT decl %ecx movups (%edx),%xmm4 4-LUT So, a 32-bit adder at 25 MHz? leal 16(%edx),%edx R10 R9 R8 R7 R6 Output jnz .L000enc1_loop 4-LUT Ciphertext aesenclast %xmm4,%xmm0 movups %xmm0,(%eax) 4-LUT ret ... Partial libgcrypt implementation using Intel’s 4-LUT AES-NI instruction set. Performance is ~0.75 cycles per byte (40+ Gbps per- with relatively easy software patches) Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.25 Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.26

FPGA Evolutionary Pressures (2) FPGA Evolutionary Pressures (3) 1) Simple arithmetic circuits 4) Hardware / software codesign – Dedicated carry logic and routing paths – Dedicated CPUs – Altera Flex 6000, Xilinx XC4000 series (1991) – Altera Cyclone V, Xilinx Virtex II Pro Series (2002) 2) Memory-intensive tasks 5) Hardware-in-the-loop integration – Various transceivers, clock generators, interfaces, ADCs, temperature sensors, – LUTs configurable as distributed RAM resources, etc. later built-in SRAM blocks, FIFOs – Altera Starix, Xilinx Virtex II Pro Series (2002) – Altera Flex 10K, Xilinx Virtex series (1998) 6) Scientific / engineering workloads 3) DSP applications – Floating-point cores – Dedicated multipliers, later multiply-accumulate operators – Altera Stratix 10 series (2014) – Altera Stratix, Xilinx Virtex II series (2001) • Compelling application: • Compelling application: – K. Townsend and J. Zambreno, "Reduce, – S. Vyas, C. Kumar, J. Zambreno, C. Gill, R. Reuse, Recycle (R^3): a Design Methodology Cytron and P. Jones, "An FPGA-based for Sparse Matrix Vector Multiplication on Plant-on-Chip Platform for Cyber-Physical Reconfigurable Platforms", Proc. of the Int’l System Analysis", IEEE Embedded Conference on Application-specific Systems, Systems Letters (ESL), vol. 6, no. 1, 2014 Architectures and Processors (ASAP), 2013. – Close integration with NIOS CPU, – 13.7 Gflops for SpMV (matrix dependent) interfacing with sensors, actuators

Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.27 Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.28 Accelerator Proximity Splash 2 (1993) Workstation • J. Arnold et al., "The Splash 2 processor and applications," Proc. Of the Standalone Processing Unit Int’l Conference on Computer Design (ICCD), Oct. 1993. Coprocessor Attached Processing Unit – Attached to a Sparc-2 base, with 17 Xilinx XC4010 FPGAs, 17 512KB SRAMs, and 9 TI SN74ACT8841s

External Input From Prev Board M0 M1 M2 M3 M4 M5 M6 M7 M8 R-Bus Array Boards CPU F F F F F1 F2 F3 F4 5 6 7 8 Memory I/O Caches Interface S-Bus FU Interface F Crossbar Switch Board0

To Next • Although self-reconfiguration is possible, some software F F F F F F F F Sparc-based Board 16 15 14 13 12 11 10 9 integration with a reconfigurable accelerator is almost Host SIMD S-Bus M M M M M M M M always present 16 15 14 13 12 11 10 9 External Output • CPU – FPGA proximity has implications for programming model, device capacity, I/O – Development required up to 5 (!) programs, with code for the host CPU, interface board, F0, F1-F16, and the TI crossbar chips

Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.29 Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.30

Xilinx Zynq (2013) Automation to the Rescue? • Coupling of dual-core ARM Cortex-A9 with reconfigurable logic • Developer efficiency continues to be a limiting factor • Numerous approaches to the “behavioral synthesis” problem of generating useful • “Processing System” is fully integrated and hardwired, and the hardware from high-level descriptions platform can behave like a typical processor by default • C-to-HDL variants: – Handel-C (Oxford) • ARM CPU can control reconfiguration of programmable logic region – ROCCC (UC-Riverside) – Catapult C () • Development environment is processor/IP-centric versus FPGA- – SystemC (Accelera) – Cynthesizer C (Cadence) centric – ImpulseC (Impulse) • Many other comparable approaches: – HDL Coder (Mathworks) – Vivado High-Level Synthesis (Xilinx) – Bluespec, SystemVerilog New RCL grad student seen trying to escape the lab • Opinion: these tools can automate certain classes of logic, BUT: – Cannot generate efficient output for “hard problems” – Unfamiliar / uncomfortable syntax for both SW and HW engineers – Similar algorithmic challenges to auto-parallelizing compilers – Sorry students, you’re still learning VHDL  Digilent ZedBoard and Zynq-7000 FPGA Xilinx Zynq System-level View in XPS Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.31 Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.32 The Right Stuff Typical Application Acceleration Methodology • Applications that map well to FPGA-based acceleration tend to have common characteristics: Algorithmic • What is the purpose of the application? – Significant kernels of computation, significant data Understanding • Can its complexity be lowered? • Amdahl’s law is typically more relevant than Gustafson’s law Application • Profile. Don’t Speculate. – Daniel Bernstein • Where are the application’s “hotspots”? – Fixed-point or integer data representation Profiling • If application is Gflop-constrained, use a GPU System-Level • What hardware and software • Changing (slowly), see Altera Stratix 10-based systems Design infrastructure is needed? – Fine-grained parallelism • But if working set fits in cache, will still be hard to beat (MHz Architectural • How can I accelerate for MHz) Design the bottlenecks? • Systolic model of computation, where FPGA depth > equivalent CPU depth and number of FPGA PEs >> number of x86 HW / SW • How does FUs Co-Design it all fit? – Real-time constraints, system integration Integration and • HPC workloads should go on HPCs (i.e. accelerators are an orthogonal concern) Test • Current GPUs cannot make useful latency guarantees Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.33 Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.34

Ex. Project (1) – HW Accelerated Kalman Filtering Ex. Project (2) – Accelerated Graph Processing Piecewise-Affine Kalman Filter (PWAKF) • Implementation of a bulk synchronous parallel model using a high performance reconfigurable

Competitive performance to a large Intel system at significantly less power consumption

Significant speedups over SW equivalent

BRAMi,0 BRAMi,1 BRAMi,2

Port A Port A Port A Port B Port B Port B

Transpose

C3x3 D3x3 A3x3 B3x3 2:1 2:1 2:1

E=D+CA-1B Resource usage grows linearly with model size

Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.35 Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.36 Our Reconfigurable Platform Other Views

• Convey HC-2ex “hybrid core” computer: “Iowa State University Students Win – Tightly coupled host-coprocessor via HCMI MemoCODE Design Contest Using Convey – Globally shared, high performance memory (80 GB/s) Computer HC-1”, MarketWire, July 2012 – Four Application Engines (AEs) • Can be used as a large , or with custom “personalities” that accelerate arbitrary applications – Support for all interfacing logic, hardware/software co-simulation, early prototyping, performance profiling – Conventional C/C++ (host) and VHDL/ (coprocessor) for development

“Team from Iowa State Wins 2014 MemoCODE Design Contest”, PRWeb, Oct. 2014 • 14 FPGAs in total: – 4 FPGAs used as the AEs (Xilinx Virtex 5- LX330s operating at 150 MHz) – 8 FPGAs are used as Memory Controllers – 2 FPGAs are used as the Application Engine Hub to interface with the CPU subsystem

Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.37 Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.38

Ex. Project (3) – Real-Time Object Tracking The Algorithm

• Read image “Camera” Model • Undistort image 1. Test a previously developed approach for visually • Convert to grayscale tracking markers mounted on a motor-grader blade Image • Perform adaptive thresholding 2. Modify as needed, port and accelerate the tracking Segmentation algorithm to a Zynq-based development board Object Pruning • Mask image using previous location 3. Create modular hardware IP cores for reuse in • Remove too large / small objects

future projects Marker • Find contours Detection • Fit ellipses to contours 4. Demonstrate the real-world accuracy of the • Find most appropriate ellipses (<=6) tracking Found 6 5. Integrate into existing SoC camera-logger-display No centers? Yes platform • Sara Beery, “Tracking the Blade of a • Reproject centers 6. Field testing and evaluation Motor Grader: A Vision Based Solution”. into image John Deere Intelligent Solutions • Find blade location Machine Automation Group, Sep 2015.

Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.39 Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.40 Algorithm Our Initial Observations • Matlab prototype made heavy use of the toolbox, so conversion to OpenCV (C/C++) for an embedded platform was relatively straightforward • Initial performance was quite slow, but with several opportunities for early optimization – Image masking / cropping can be done earlier and much more aggressively – Simpler adaptive thresholding algorithms can be applied with good accuracy / performance tradeoffs – Reading image files from disk a likely bottleneck, but can be generally ignored as project intent was to target SoC camera platform (reading 1. Input image (from file) 2. Lens undistortion 3. Grayscale conversion straight from sensor) 4. Adaptive thresholding 5. Object removal 6. Contours 7. Detected markers

Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.41 Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.42

Profiling: Version 1.2 System-Level Design (Zedboard)

SD Card Zynq-7020 FPGA Processing System Reconfigurable Logic Linux filesystem, Input images Pixel Pipeline UART ARM CPU Grayscale

DRAM Linux, OpenCV application, HW VDMA Masking interface logic Intermediate buffers, OpenCV Thresholding objects

Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.43 Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.44 Architectural Design Resource Utilization • Grayscale conversion and masking are relatively straightforward • Sliding window architecture for adaptive thresholding – applicable to many different image processing algorithms • Advantages: – Software configurable for different resolutions, window sizes – High performance (1 pixel per clock), heavily pipelined for clock frequency

Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.45 Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.46

Performance Results What Does the Future Hold? • Continued device specialization 64 – Blurred lines between FPGAs, GPUs, and CPUs 61 60 – X floating-point units, Y other accelerators, 56 and Z memory controllers with some software- configurable layout and multi-tasking model

38 • Deeper integration - reconfigurable logic in (more) commodity chips 31 G: Grayscale M: Masking T: Thresholding • Use in embedded applications continues to C: Cropping grow – Dynamic power versus performance tradeoffs Hottest Windows 11 feature

3 2 2 • Slow adoption in HPC space – Graph 500 (now), Green 500 (soon), Top 500 (eventually)

Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.47 Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.48 Acknowledgments • These slides are inspired in part by material developed and copyright by: – Marilyn Wolf (Georgia Tech) – Jason Bakos (University of South Carolina) – Greg Stitt (University of Florida) – Hadi Esmaeilzadeh (Georgia Tech)

Zambreno, Spring 2017 © ISU CprE 488 (Hardware Acceleration) Lect-08.49