<<

CS152: Systems Architecture Dark Silicon, Application-Specific Acceleration

Sang-Woo Jun Winter 2019 Not All Transistors Can Be Active!

 Utilization wall: “With each successive generation, the percentage of a chip that can switch at full frequency drops exponentially due to power constraints.” -- Venkatesh, ASPLOS ‘10

 The following slides adapted from Michael Taylor’s 2012 talk “Is Dark Silicon Useful? Harnessing the Four Horsemen of the Coming Dark Silicon Apocalypse” – Marked ‘**’ Tradeoffs Between Cores And Frequency** 4 cores @ 1.8 GHz

Next generation

… …

4x4 cores @ .9 GHz 2x4 cores @ 1.8 GHz 4 cores @ 2x1.8 GHz (16 dim) (8 cores dark, 8 dim) (12 cores dark) The Four Horsemen**

 What do we do with this dark silicon?  “Four top contenders, each of which seemed like an unlikely candidate from the beginning, carrying unwelcome burdens in design, manufacturing and programming. None is ideal, but each has its benefit and the optimal solution probably incorporates all four of them…” The Shrinking Horseman (#1)**

 “Area is expensive. Chip designers will just build smaller chips instead of having dark silicon in their designs!”  First, dark silicon doesn’t mean useless silicon, it just means it’s under- clocked or not used all of the time.  There’s lots of dark silicon in current chips: o On-chip GPU on AMD Fusion or Sandybridge for GCC • L3 is very dark for applications with small working sets • SSE units for integer apps • … The Shrinking Horseman (#1)**

 Competition and Margins o If there is an advantage to be had from using dark silicon, you have to use it too, to keep up with the Jones.  Diminished Returns (e.g., $10 silicon selling for $200 today) o Savings Exponentially Diminishing: $5, $2.5, $1.25, 63c o Overheads: packaging, test, marketing, etc. o Chip structures like I/O Pad Area do not scale  Exponential increase in Power Density -> Exponential Rise in Temperature  But, some chips will shrink o Nasty low margin, high competition chips; or a monopoly (Sony Cell) The Dim Horseman (#2)**

 Spatial dimming: Have enough cores to exceed power budget, but underclock them

 Gen 1 & 2 Multicores (higher core count, lower freqs)  Near Threshold Voltage (NTV) Operation o Delay Loss/Lower clock speed > Energy Gain o But, make it up with lots of dim cores The Dim Horseman (#2)**

 Temporal Dimming : Have enough cores to exceed power budget, but use them only in bursts o Dim cores, but overclock if cold – e.g., Intel TurboBoost o E.g., ARM A15 Core in mobile phones • A15 power usage way above sustainable for phone. • 10 second bursts at most (big.LITTLE) The Specialized Horseman (#3)**

 “We will use all of that dark silicon area to build specialized cores, each of them tuned for the task at hand (10-100x more energy efficient), and only turn on the ones we need…”  Insights: o Power is now more expensive than area o Specialized logic can improve energy efficiency by 10- 1000x The Specialized Horseman (#3)**

 C-cores Approach: o Fill dark silicon with Conservation Cores, or c-cores, which are automatically-generated, specialized energy- saving that save energy on common apps  Execution jumps among c-cores (hot code) and a host CPU (cold code) o Power-gate HW that is not currently in use o Coherent Memory & Patching Support for C-cores Typical Energy Savings** The Specialized Horseman (#3) -- Pssst

 Another active thrust in this area is reconfigurable hardware acceleration using Field-Programmable Gate Arrays (FPGA) o A single FPGA fabric can be configured at runtime to act like any C-core o Not as efficient as a prefabricated C-core, but can cover any at runtime o More on this later! The Deus Ex Machina Horseman (#4)**

 Deus Ex Machina: “A plot device whereby a seemingly unsolvable problem is suddenly and abruptly solved with the unexpected intervention of some new event, character, ability or object.”

 “ are the fundamental problem”  “, Trigate, High-K, nanotubes, 3D, for one-time improvements, but none are sustainable solutions across process generations.” The Deus Ex Machina Horseman (#4)**

 Possible “Beyond CMOS” Device Directions o Nano-electrical Mechanical Relays? o Tunnel Field Effect Transistors (TFETS)? o Spin-Transfer Torque MRAM (STT-MRAM)? o Graphene? o Human brain? o DNA Computing? CS152: Computer Systems Architecture Field-Programmable Gate Arrays

Sang-Woo Jun Winter 2019 What Are FPGAs

 Field-Programmable Gate Array  Can be configured to act like any circuit – More later!  Can do many things, but we focus on computation acceleration FPGAs Come In Many Forms

PCIe-Attached In-Storage

CPU Integrated In-Network How Is It Different From CPU/GPUs

 GPU – The other major accelerator  CPU/GPU hardware is fixed o “General purpose” o we write programs (sequence of instructions) for them  FPGA hardware is not fixed o “Special purpose” o Hardware can be whatever we want o Will our hardware require/support ? Maybe!  Optimized hardware is very efficient o GPU-level performance** o 10x power efficiency (300 W vs 30 W) Analogy

CPU/GPU comes with fixed circuits FPGA gives you a big bag of components

To build whatever Could be a CPU/GPU!

“The Z-Berry” “Experimental Investigations on Radiation Characteristics of IC Chips” benryves.com “Z80 Computer” Shadi Soundation: Homebrew 4 CPU Fine-Grained Parallelism of Special-Purpose Circuits

퐺×푚1×푚2  Example -- Calculating gravitational force: 2 2 (푥1−푥2) +(푦1−푦2)  8 instructions on a CPU → 8 cycles**  Much fewer cycles on a special purpose circuit 2 2 A = G × m1 × m2 B = (x1 - x2) C = (y1 - y2) A = G × m C = x - x E = y - y 1 1 2 1 2 D = B + C B = A × m D = C2 F = E2 2 Ret = B / G G = D + F 3 cycles with compound operations Ret = B / G May slow down clock 2 2 Ret = (G × m1 × m2) / ((x1 - x2) + (y1 - y2) ) 4 cycles with basic operations 1 cycle with even further compound operations Coarse-Grained Parallelism of Special-Purpose Circuits  Typical unit of parallelism for general-purpose units are threads ~= cores  Special-purpose processing units can also be replicated for parallelism o Large, complex processing units: Few can fit in chip o Small, simple processing units: Many can fit in chip  Independent operations can explicitly be parallelized across dedicated hardware modules o Hundreds/thousands of operations are regularly done in parallel  Only generates hardware useful for the application o Instruction? Decoding? Cache? Coherence? How Is It Different From ASICs

 ASIC (Application-Specific ) o Special chip purpose-built for an application o E.g., ASIC bitcoin miner, Intel neural network accelerator o Function cannot be changed once expensively built  + FPGAs can be field-programmed o Function can be changed completely whenever o FPGA fabric emulates custom circuits  - Emulated circuits are not as efficient as bare- o ~10x performance (larger circuits, faster clock) o ~10x power efficiency Basic FPGA Architecture “Configurable (CLB)” Programmable ~ I/O block Latch 6-Input Look-Up Table FF

Ex) 2-LUT for “AND” Input 1 Input 2 Output Stores state for 0 0 0 sequential circuit 0 1 0 construction 1 0 0 Programmable interconnect 1 1 1 Basic FPGA Architecture – DSP Blocks “DSP block”  CLBs act as gates – Many needed to implement high-level logic  Arithmetic operation provided as efficient ALU blocks o “ Processing (DSP) blocks” o Each block provides an adder + multiplier

× +/- Basic FPGA Architecture – Block RAM “Block RAM”  CLB can act as flip- o (~1 bit/block) – tiny!  Some on-chip SRAM provided as blocks o ~18/36 Kbit/block, MBs per chip o access to → multi- TB/s Basic FPGA Architecture – Hard Cores

 Some functions are provided as

Memory efficient, non-configurable “hard cores” o Multi-core ARM cores (“Zynq” series) o Multi-Gigabit Transceivers o PCIe/Ethernet PHY o Memory controllers

Ethernet o …

ARM PCIe Example Accelerator Card Architecture

 “FPGA Mezzanine Card” Expansion o Network Ports, Memory, Storage, PCIe, … General-Purpose I/O Pins Multi-Gigabit Transceivers FMC

1GbE DRAM

FPGA 40GbE DRAM

PCIe Example Accelerator Card (VCU108) Programming/Using an FPGA Accelerator

 Bitfile is programmed to FPGA over “JTAG” interface o Typically used over USB cable o Supports FPGA programming, limited debugging access, etc  PCIe-attached FPGA accelerator card is typically used similarly to GPUs o Program FPGA, execute software o Software copies data to FPGA board, notify FPGA -> FPGA logic performs computations -> Software copies data back from FPGA  FPGA flexibility gives immense freedom of usage patterns o Streaming, coherent memory, …