ESE532: System-On-A-Chip Architecture Today Apple A12
Total Page:16
File Type:pdf, Size:1020Kb
ESE532: System-on-a-Chip Architecture Today • Case for Programmable SoC • Course Goals • Outcomes Day 1: August 28, 2019 • New/evovling Course, Risks, Tools Introduction and Overview • Sample Optimization Everyone grab: • This course (incl. policies, logistics) • Preclass • FeedbacK Sheet (1/2 page) 1 2 Penn ESE532 Fall 2019 -- DeHon Penn ESE532 Fall 2019 -- DeHon Apple A12 Bionic Questions • 84mm2, 7nm • Why do today’s SoC look like they do? • 7 Billion Tr. • How approach programming modern SoCs? • iPhone XS, XR • How design a custom SoC? – IPad 2019 • When building a System-on-a-Chip (SoC) • 6 ARM cores – How much area should go into: – 2 fast • Processor cores, GPUs, – 4 low energy FPGA logic, memory, interconnect, • 4 custom GPUs custom functions (which) …. ? • Neural EnGine – 5 Trillion ops/s? 3 4 Penn ESE532 Fall 2018 -- DeHon Penn ESE532 Fall 2019 -- DeHon FPGA Field-Programmable Gate Array K-LUT (typical k=4) Compute block Case for Programmable SoC w/ optional output Flip-Flop ESE171, ESE150, CIS371 5 6 Penn ESE532 Fall 2019 -- DeHon Penn ESE532 Fall 2019 -- DeHon 1 The Way things Were Today 25 years ago • Microprocessor may not be fast enough – (but often it is) • Wanted programmability – Or low enough energy – used a processor • Time and Cost of a custom IC is too high • Wanted high-throughput – $100M’s of dollars for development, Years – used a custom IC • FPGAs promising • Wanted product differentiation – But build everything from prog. gates? – Got it at the board level – Select which ICs and how wired • Premium for small part count • Build a custom IC – And avoid chip crossing – ICs with Billions of Transistors 7 8 Penn ESE532– FallIt 2019 was -- DeHon about gates and logic Penn ESE532 Fall 2019 -- DeHon Non-Recurring Engineering NRE Costs (NRE) Costs • Costs spent up front on development – Engineering Design Time – Prototypes – Mask costs • Recurring Engineering – Costs to produce each chip 9 10 Penn ESE532 Fall 2019 -- DeHon Penn ESE532 Fall 2019 -- DeHon NRE Cost (continued) Amortize NRE with Volume 11 12 Penn ESE532 Fall 2019 -- DeHon https://semiengineering.com/how-much-will-that-chip-cost/ Penn ESE532 Fall 2019 -- DeHon 2 Economics Large ICs Forcing fewer, more • Now contain significant software customizable – Almost all have embedded processors chips • Must co-design SW and HW • Economics force fewer, more customizable chips • Must solve complete computing task – Mask costs in the millions of dollars – Tasks has components with variety of needs – Custom IC design NRE 10s—100s of millions of dollars • Need market of billions of dollars to recoup investment – Some don’t need custom circuit • With fixed or slowly growing total IC industry revenues – 90/10 Rule • è Number of uniQue chips must decrease • Chips must be programmable 13 14 Penn ESE532 Fall 2019 -- DeHon Penn ESE532 Fall 2019 -- DeHon Given Demand for Programmable SoC Programmable • Implementation Platform for innovation – This is what you target (avoid NRE) • How do we get higher performance than – Implementation a processor, while retaining vehicle programmability? 15 16 Penn ESE532 Fall 2019 -- DeHon Penn ESE532 Fall 2019 -- DeHon Programmable SoC Then and Now 25 years ago Today • Programmability? • Programmability? – use a processor – uP, FPGA, GPU, PSoC • High-throughput UG1085 – used a custom IC • High-throughput Xilinx – FPGA, GPU, PSoC, • Wanted product UltraScale custom Zynq differentiation • Wanted product TRM – board level (p27) differentiation – Select & wired IC – Program FPGAs, • Build a custom IC PSoC – It was about gates • Build a custom IC 17 and logic 18 Penn ESE532 Fall 2019 -- DeHon Penn ESE532 Fall 2019 -- DeHon – System and software 3 Goals • Create Computer Engineers Course – SW/HW divide is wrong, outdated – Parallelism, data movement, resource Goals, Outcomes management, abstractions – Cannot build a chip without software • SoC user – know how to exploit • SoC designer – architecture space, hw/sw codesign 19 • Project experience – design and optimization20 Penn ESE532 Fall 2019 -- DeHon Penn ESE532 Fall 2019 -- DeHon Outcomes Roles • Design, optimize, and program a modern System-on-a-Chip. • PhD Qualifier • Analyze, identify bottlenecks, design-space – One broad Computer Engineering – Modeling à write equations to estimate • CMPE Concurrency • Decompose into parallel components • Hands-on Project course • Characterize and develop real-time solutions • Implement both hardware and software solutions • Formulate hardware/software tradeoffs, and 21 perform hardware/software codesign 22 Penn ESE532 Fall 2019 -- DeHon Penn ESE532 Fall 2019 -- DeHon New and Evolving Course Outcomes • Spring 2017 – first offering – Raw, all assignments new … some buggy – Assignments too tedious, long • Understand the system on a chip from • Fall 2017 – second offering gates to application software, including: – Refine assignments, project – Increased explicit modeling empHasis – on-chip memories and communication – Hard, not insane networks, I/O interfacing, design of • Fall 2018 – third offering accelerators, processors, firmware and – Not mucH different from 2017 – Added real-time ethernet data handling; project groups of 3 OS/infrastructure software. – Many students cHallenged witH C and software engineering • Understand and estimate key design – Stream debug and performance challenging • Fall 2019 – now metrics and requirements including: – Basic structure remains same – area, latency, throughput, energy, power, – Try front-load more C – Try better introduce Stream optimization and debug predictability, and reliability. 23 – Group writeup on projects 24 Penn ESE532 Fall 2019 -- DeHon Penn ESE532 Fall 2019 -- DeHon 4 Tools Distinction • Are complex CIS240, 371, 501 ESE532 • Will be challenging, but good for you to • Best Effort Computing • Hardware-Software build confidence can understand and – Run as fast as you can codesign • Binary compatible – Willing to recompile, maybe master rewrite code • ISA separation – Define/refine hardware • Tool runtimes can be long • Shared memory • Real-Time parallelism • Learning and sharing experience will be – Guarantee meet deadline part of assignments • Non shared-memory parallelism models 25 26 Penn ESE532 Fall 2019 -- DeHon Penn ESE532 Fall 2019 -- DeHon Abstraction Stack Software Embedded Sys: ESE350/519 Systems SoC Arch: ESE 532 Approach -- Example Processors Processor Arch: Mixed Signal: ADC, DAC CIS 371/501 ESE 568 Switched Capacitors (CIS240) Gates, Memories Digital: Analog: Amplifier, Compare Circuits/VLSI ESE370/570 ESE419/572 Voltage/Current Ref. Devices: ESE521 Transistors (ESE218) 27 28 Penn ESE532 Fall 2019 -- DeHon Penn ESE532 Fall 2019 -- DeHon Abstract Approach SPICE Circuit Simulator • Identify requirements, bottlenecks Matrix Solve Ax=B • Decompose Parallel Opportunities A matrix – At extreme, how parallel could make it? B vector x unknown vector – What forms of parallelism exist? Solve for x • Thread-level, data parallel, instruction-level • Design space of mapping Linear Algebra solving n eqns – Choices of where to map, area-time tradeoffs in n unknowns. • Map, analyze, refine Example: Kapre+DeHon, TRCAD 2012 – Write equations to understand, predict 29 30 Penn ESE532 Fall 2019 -- DeHon Penn ESE532 Fall 2019 -- DeHon 5 Analyze Analyze • T=Tmodeleval+Tmatsolve+Tctrl 31 32 Penn ESE532 Fall 2019 -- DeHon Penn ESE532 Fall 2019 -- DeHon Speedup Analyze • If only accelerated model evaluation only about 2x speedup • If want better than 14x speed, must also attack control • T=Tmodeleval+Tmatsolve+Tctrl • What should we speedup first? • What happens if only speedup modeleval? – T=Tmatsolve+(Tmodeleval)/S+Tctrl 33 34 Penn ESE532 Fall 2019 -- DeHon Penn ESE532 Fall 2019 -- DeHon Model Evaluation: Trivial Spatial Parallelism Hardware Implementation •Every operation (*, + /) gets dedicated hardware. VD1 / vj I D1 = I s ´(e -1) •Implement task in space à use additional area for each operator. d VD1 / vj 1 GD1 = dV (I D1) = I s ´e ´ •Parallel – all operations occur simultaneously. D1 vj * * ÷ * * ÷ e f b e f b - * ÷ - * ÷ d c a d c a ex ex 35 36 Penn ESE532 Fall 2019 -- DeHonVerilog-AMS as Domain-Specific Language Penn ESE532 Fall 2019 -- DeHon 6 Parallelism: Model Evaluation Spatial Too Big? Data Parallel Fully spatial Custom VLIW • Every device circuit independent • Many of each type b ÷ * e * f * of device ÷ - ÷ + • Can evaluate in d * c a parallel ex x – T=Tseq/Nproc e • Build pipelined ~100x Speedup circuit for model ~10x Speedup Multiple FPGAs – Tseq=Ncomp*Tcycle 1 FPGA VLIW=Very Long Instruction Word vs. Tpipe=Tcycle 37 38 Penn ESE532 Fall 2019 -- DeHon Penn ESE532 Fall 2019 -- DeHon exploits Instruction-Level Parallelism Parallelism: Model Evaluation Parallelism: Matrix Solve • Spatial end up • Use custom • Needed direct solver? bottlenecked by evaluation engines • E.g. Gaussian other components • …or GPUs elimination • Data dependence on previous reduce – Limited data parallelism • Parallelism in subtracts • Some row 39 independence 40 Penn ESE532 Fall 2019 -- DeHon Penn ESE532 Fall 2019 -- DeHon Example Matrix Example Matrix 41 42 Penn ESE532 Fall 2019 -- DeHon Penn ESE532 Fall 2019 -- DeHon 7 Example Matrix Dataflow Processing Element (PE) Graph Incoming Nodes Messages Dataflow trigger * + ÷ Graph Outgoing Reduce to critical path: Fanout Messages from 9 sequential operations to path of 5 operations. 43 44 Penn ESE532 Fall 2019 -- DeHon Penn ESE532 Fall 2019 -- DeHon Parallelism: Matrix Solve Matrix Solve Only • Settled on constructing dataflow graph • Graph can be iteration independent