Apple A13 Bionic Questions
Total Page:16
File Type:pdf, Size:1020Kb
ESE532: System-on-a-Chip Architecture Today • Case for Programmable SoC Day 1: September 2, 2020 • Course Goals Introduction and Overview • Outcomes (lecture start 10:35am) • Evovling Course, Risks, Tools • Sample Optimization Note: both linked to web (also slides) • This course (incl. policies, logistics) www.seas.upenn.edu/~ese532/fall2020/fall2020.html • Preclass • Feedback form 1 2 Penn ESE532 Fall 2020 -- DeHon Penn ESE532 Fall 2020 -- DeHon Apple A13 Bionic Questions • 98mm2, 7nm • Why do today’s SoC look like they do? • 8.5 Billion Tr. • How approach programming modern SoCs? • iPhone 11 + • How design a custom SoC? • 6 ARM cores • When building a System-on-a-Chip (SoC) – 2 fast (2.6GHz) – How much area should go into: – 4 low energy • Processor cores, GPUs, • 4 custom GPUs FPGA logic, memory, interconnect, • NeurAl Engine custom functions (which) …. ? – 5 Trillion ops/s? 3 4 Penn ESE532 FAll 2020-- DeHon Penn ESE532 Fall 2020 -- DeHon FPGA Field-Programmable Gate Array K-LUT (typical k=4 or 6) Compute block Case for Programmable SoC w/ optional output Flip-Flop ESE171, ESE150, CIS371 5 6 Penn ESE532 Fall 2020 -- DeHon Penn ESE532 Fall 2020 -- DeHon 1 End of uProcessor Scaling Old • Moore’s Law scaling Now delivered faster • Dennard’s Law transistors kicked in • Processors rode • uP were burning Moore’s Law more power – Turning transistors • Lost ability to scale into performance down voltage • Could wait and ride • Processor technology curve performance stalled 7 8 Penn ESE532 Fall 2020 -- DeHon Penn ESE532 Fall 2020 -- DeHon The Way things Were Today 30 years ago • Microprocessor may not be fast enough • Wanted programmability – (but often it is) – used a processor – Or low enough energy • Wanted it a little faster • Single core processor scaling has ended – Next year’s processor would run faster… • Time and Cost of a custom IC is too high • Wanted high-throughput – $100M’s of dollars for development, Years – used a custom IC • Wanted product differentiation • FPGAs promising – Got it at the board level – But build everything from prog. gates? – Select which ICs and how wired • Premium for small part count • Build a custom IC – And avoid chip crossing – It was about gates and logic 9 10 Penn ESE532 Fall 2020 -- DeHon Penn ESE532– FallICs 2020 --withDeHon Billions of Transistors Non-Recurring Engineering NRE Costs (NRE) Costs • Costs spent up front on development – Engineering Design Time – Prototypes – Mask costs • Recurring Engineering – Costs to produce each chip 11 12 Penn ESE532 Fall 2020 -- DeHon Penn ESE532 Fall 2020 -- DeHon 2 NRE Cost (continued) Amortize NRE with Volume 13 14 Penn ESE532 Fall 2020 -- DeHon https://semiengineering.com/how-much-will-that-chip-cost/ Penn ESE532 Fall 2020 -- DeHon Economics Large ICs Forcing fewer, more • Now contain significant software customizable – Almost all have embedded processors chips • Must co-design SW and HW • Economics force fewer, more customizable chips • Must solve complete computing task – Mask costs in the millions of Dollars – Tasks has components with variety of needs – Custom IC DesiGn NRE 10s—100s of millions of dollars • NeeD market of billions of Dollars to recoup investment – Some don’t need custom circuit • With fixeD or slowly GrowinG total IC inDustry revenues – 90/10 Rule • è Number of unique chips must Decrease • Chips must be proGrammable 15 16 Penn ESE532 Fall 2020 -- DeHon Penn ESE532 Fall 2020 -- DeHon Given Demand for Programmable SoC Programmable • Implementation Platform for innovation – This is what you target (avoid NRE) • How do we get higher performance than – Implementation a processor, while retaining vehicle programmability? 17 18 Penn ESE532 Fall 2020 -- DeHon Penn ESE532 Fall 2020 -- DeHon 3 Programmable SoC Then and Now 30 years ago Today • Programmability? • Programmability? – use a processor – uP, FPGA, GPU, PSoC • Faster • Faster – Processors scaled UG1085 • Can’t get with single • High-throughput core Xilinx – used a custom IC UltraScale • High-throughput • Wanted product Zynq – FPGA, GPU, PSoC, differentiation TRM custom – board level (p27) • Wanted product – Select & wired IC differentiation • Build a custom IC – Program FPGAs, PSoC – It was about gates and • Build a custom IC logic 19 – System and software 20 Penn ESE532 Fall 2020 -- DeHon Penn ESE532 Fall 2020 -- DeHon Goals • Create Computer Engineers – SW/HW divide is wrong, outdated – Computer engineers understand computation Course • HW and SW are just tools and design options Goals, Outcomes – Parallelism, data movement, resource management, abstractions – Cannot build a chip without software • SoC user – know how to exploit • SoC designer – architecture space, hw/sw codesign 21 • Project experience – design and optimization22 Penn ESE532 Fall 2020 -- DeHon Penn ESE532 Fall 2020 -- DeHon Outcomes Roles • Design, optimize, and program a modern System-on-a-Chip. • PhD Qualifier • Analyze, identify bottlenecks, design-space – One broad Computer Engineering – Modeling à write equations to estimate • CMPE Concurrency • Decompose into parallel components • Hands-on Project course • Characterize and develop real-time solutions • Implement both hardware and software solutions • Formulate hardware/software tradeoffs, and 23 perform hardware/software codesign 24 Penn ESE532 Fall 2020 -- DeHon Penn ESE532 Fall 2020 -- DeHon 4 Evolving Course Outcomes • Spring 2017 – first offering – Raw, all assignments new, buggy, too tedious, long • Fall 2017 – second offering • Understand the system on a chip from – Refine assignments, project; increased explicit modeling emphasis gates to application software, including: – Hard, not insane – on-chip memories and communication • Fall 2018 – third offering (similar 2017) – Added real-time ethernet data handling; project groups of 3 networks, I/O interfacing, design of – Many students challenged with C and software engineering accelerators, processors, firmware and – Stream debug and performance challenging OS/infrastructure software. • Fall 2019 – fourth (structure same) – Try front-load more C, better introduce Stream optimization and debug • Understand and estimate key design – Group writeup on projects metrics and requirements including: • Fall 2020 – now – Move to Vitis (from SDSoC) – area, latency, throughput, energy, power, – Use Amazon cloud for first half; F1 instance for FPGA access HW – Then transition to Ultra96 (SoC FPGA) for projects predictability, and reliability. 25 26 Penn ESE532 Fall 2020 -- DeHon Penn ESE532 Fall 2020 -- DeHon Tools Distinction • Are complex CIS240, 371, 471, 571 ESE532 • Will be challenging, but good for you to • Best Effort Computing • Hardware-Software build confidence can understand and – Run as fast as you can codesign • Binary compatible – Willing to recompile, maybe master rewrite code • ISA separation – Define/refine hardware • Tool runtimes can be long • Shared memory • Real-Time parallelism • Learning and sharing experience will be – Guarantee meet deadline part of assignments • Non shared-memory parallelism models 27 28 Penn ESE532 Fall 2020 -- DeHon Penn ESE532 Fall 2020 -- DeHon Distinction Abstraction Stack ESE680: Software ESE680 Embedded Sys: Hardware/Software Co- ESE532: HW/SW ESE350/519 Design for Machine • Deep computer Systems ML SoC Arch: Learning engineering ESE 532 • Deep on Application (ML) • Broad application • More accessible to CS Processor Arch: Mixed Signal: ADC, DAC • Program in C Processors – Less previous experience CIS 371/501 ESE 568 Switched Capacitors with circuits and • Suitable followup if want (CIS240) architecture to dig deeper Gates, Memories Digital: Analog: Amplifier, Compare th • Won’t be as deep on • 5 offering Circuits/VLSI ESE370/570 ESE419/572 Voltage/Current Ref. understanding HW and optimization Devices: ESE521 Transistors (ESE218) • Program in Pytorch, OpenCL 29 30 Penn ESE532• First Fall 2020offerin -- DeHong Penn ESE532 Fall 2020 -- DeHon 5 Abstract Approach • Identify requirements, bottlenecks • Decompose Parallel Opportunities Approach -- Example – At extreme, how parallel could make it? – What forms of parallelism exist? • Thread-level, data parallel, instruction-level • Design space of mapping – Choices of where to map, area-time tradeoffs • Map, analyze, refine – Write equations to understand, predict 31 32 Penn ESE532 Fall 2020 -- DeHon Penn ESE532 Fall 2020 -- DeHon Example SPICE Circuit Example: SPICE Circuit Simulator Simulator Matrix Solve Ax=B A matrix B vector x unknown vector Solve for x = KCL, KVL * Linear Algebra solving n eqns in n unknowns. Example: Kapre+DeHon, TRCAD 2012 33 34 Penn ESE532 Fall 2020 -- DeHon Penn ESE532 Fall 2020 -- DeHon * Kirchhoff {Current,Voltage} Laws Analyze Analyze • T=Tmodeleval+Tmatsolve+Tctrl 35 36 Penn ESE532 Fall 2020 -- DeHon Penn ESE532 Fall 2020 -- DeHon 6 Speedup Analyze • If only accelerated model evaluation only about 2x speedup • If want better than 14x speed, must also attack control • T=Tmodeleval+Tmatsolve+Tctrl • What should we speedup first? • What happens if only speedup modeleval? – T=Tmatsolve+(Tmodeleval)/S+Tctrl 37 38 Penn ESE532 Fall 2020 -- DeHon Penn ESE532 Fall 2020 -- DeHon Model Evaluation: Trivial Spatial Parallelism Hardware Implementation •Every operation (*, + /) gets dedicated hardware. VD1 / vj I D1 = I s ´(e -1) •Implement task in space à use additional area for each operator. d VD1 / vj 1 GD1 = dV (I D1) = I s ´e ´ •Parallel – all operations occur simultaneously. D1 vj * * ÷ * * ÷ e f b e f b - * ÷ - * ÷ d c a d c a ex ex 39 40 Penn ESE532 Fall 2020 -- DeHonVerilog-AMS as Domain-Specific