ESE532: System-on-a-Chip Architecture Today • Case for Programmable SoC Day 1: September 2, 2020 • Course Goals Introduction and Overview • Outcomes (lecture start 10:35am) • Evovling Course, Risks, Tools • Sample Optimization Note: both linked to web (also slides) • This course (incl. policies, logistics) www.seas.upenn.edu/~ese532/fall2020/fall2020.html • Preclass • Feedback form 1 2 Penn ESE532 Fall 2020 -- DeHon Penn ESE532 Fall 2020 -- DeHon

Apple A13 Bionic Questions

• 98mm2, 7nm • Why do today’s SoC look like they do? • 8.5 Billion Tr. • How approach programming modern SoCs? • iPhone 11 + • How design a custom SoC? • 6 ARM cores • When building a System-on-a-Chip (SoC) – 2 fast (2.6GHz) – How much area should go into: – 4 low energy • Processor cores, GPUs, • 4 custom GPUs FPGA logic, memory, interconnect, • Neural Engine custom functions (which) …. ? – 5 Trillion ops/s?

3 4 Penn ESE532 Fall 2020-- DeHon Penn ESE532 Fall 2020 -- DeHon

FPGA Field-Programmable Gate Array

K-LUT (typical k=4 or 6) Compute block Case for Programmable SoC w/ optional output Flip-Flop

ESE171, ESE150, CIS371 5 6 Penn ESE532 Fall 2020 -- DeHon Penn ESE532 Fall 2020 -- DeHon

1 End of uProcessor Scaling

Old • Moore’s Law scaling Now delivered faster • Dennard’s Law transistors kicked in • Processors rode • uP were burning Moore’s Law more power – Turning transistors • Lost ability to scale into performance down voltage • Could wait and ride • Processor technology curve performance stalled 7 8 Penn ESE532 Fall 2020 -- DeHon Penn ESE532 Fall 2020 -- DeHon

The Way things Were Today 30 years ago • Microprocessor may not be fast enough • Wanted programmability – (but often it is) – used a processor – Or low enough energy • Wanted it a little faster • Single core processor scaling has ended – Next year’s processor would run faster… • Time and Cost of a custom IC is too high • Wanted high-throughput – $100M’s of dollars for development, Years – used a custom IC • Wanted product differentiation • FPGAs promising – Got it at the board level – But build everything from prog. gates? – Select which ICs and how wired • Premium for small part count • Build a custom IC – And avoid chip crossing – It was about gates and logic 9 10 Penn ESE532 Fall 2020 -- DeHon Penn ESE532– FallICs 2020 --withDeHon Billions of Transistors

Non-Recurring Engineering NRE Costs (NRE) Costs • Costs spent up front on development – Engineering Design Time – Prototypes – Mask costs • Recurring Engineering – Costs to produce each chip

11 12 Penn ESE532 Fall 2020 -- DeHon Penn ESE532 Fall 2020 -- DeHon

2 NRE Cost (continued) Amortize NRE with Volume

13 14 Penn ESE532 Fall 2020 -- DeHon https://semiengineering.com/how-much-will-that-chip-cost/ Penn ESE532 Fall 2020 -- DeHon

Economics Large ICs Forcing fewer, more • Now contain significant software customizable – Almost all have embedded processors chips • Must co-design SW and HW • Economics force fewer, more customizable chips • Must solve complete computing task – Mask costs in the millions of dollars – Tasks has components with variety of needs – Custom IC design NRE 10s—100s of millions of dollars • Need market of billions of dollars to recoup investment – Some don’t need custom circuit • With fixed or slowly growing total IC industry revenues – 90/10 Rule • è Number of unique chips must decrease • Chips must be programmable 15 16 Penn ESE532 Fall 2020 -- DeHon Penn ESE532 Fall 2020 -- DeHon

Given Demand for Programmable SoC Programmable • Implementation Platform for innovation – This is what you target (avoid NRE) • How do we get higher performance than – Implementation a processor, while retaining vehicle programmability?

17 18 Penn ESE532 Fall 2020 -- DeHon Penn ESE532 Fall 2020 -- DeHon

3 Programmable SoC Then and Now 30 years ago Today • Programmability? • Programmability? – use a processor – uP, FPGA, GPU, PSoC • Faster • Faster – Processors scaled UG1085 • Can’t get with single • High-throughput core – used a custom IC UltraScale • High-throughput • Wanted product Zynq – FPGA, GPU, PSoC, differentiation TRM custom – board level (p27) • Wanted product – Select & wired IC differentiation • Build a custom IC – Program FPGAs, PSoC – It was about gates and • Build a custom IC logic 19 – System and software 20 Penn ESE532 Fall 2020 -- DeHon Penn ESE532 Fall 2020 -- DeHon

Goals • Create Computer Engineers – SW/HW divide is wrong, outdated – Computer engineers understand computation Course • HW and SW are just tools and design options Goals, Outcomes – Parallelism, data movement, resource management, abstractions – Cannot build a chip without software • SoC user – know how to exploit • SoC designer – architecture space, hw/sw codesign

21 • Project experience – design and optimization22 Penn ESE532 Fall 2020 -- DeHon Penn ESE532 Fall 2020 -- DeHon

Outcomes Roles • Design, optimize, and program a modern System-on-a-Chip. • PhD Qualifier • Analyze, identify bottlenecks, design-space – One broad Computer Engineering – Modeling à write equations to estimate • CMPE Concurrency • Decompose into parallel components • Hands-on Project course • Characterize and develop real-time solutions • Implement both hardware and software solutions • Formulate hardware/software tradeoffs, and 23 perform hardware/software codesign 24 Penn ESE532 Fall 2020 -- DeHon Penn ESE532 Fall 2020 -- DeHon

4 Evolving Course Outcomes • Spring 2017 – first offering – Raw, all assignments new, buggy, too tedious, long • Fall 2017 – second offering • Understand the from – Refine assignments, project; increased explicit modeling emphasis gates to application software, including: – Hard, not insane – on-chip memories and communication • Fall 2018 – third offering (similar 2017) – Added real-time ethernet data handling; project groups of 3 networks, I/O interfacing, design of – Many students challenged with C and software engineering accelerators, processors, firmware and – Stream debug and performance challenging OS/infrastructure software. • Fall 2019 – fourth (structure same) – Try front-load more C, better introduce Stream optimization and debug • Understand and estimate key design – Group writeup on projects metrics and requirements including: • Fall 2020 – now – Move to Vitis (from SDSoC) – area, latency, throughput, energy, power, – Use Amazon cloud for first half; F1 instance for FPGA access HW – Then transition to Ultra96 (SoC FPGA) for projects predictability, and reliability. 25 26 Penn ESE532 Fall 2020 -- DeHon Penn ESE532 Fall 2020 -- DeHon

Tools Distinction

• Are complex CIS240, 371, 471, 571 ESE532 • Will be challenging, but good for you to • Best Effort Computing • Hardware-Software build confidence can understand and – Run as fast as you can codesign • Binary compatible – Willing to recompile, maybe master rewrite code • ISA separation – Define/refine hardware • Tool runtimes can be long • Shared memory • Real-Time parallelism • Learning and sharing experience will be – Guarantee meet deadline part of assignments • Non shared-memory parallelism models

27 28 Penn ESE532 Fall 2020 -- DeHon Penn ESE532 Fall 2020 -- DeHon

Distinction Abstraction Stack ESE680: Software ESE680 Embedded Sys: Hardware/Software Co- ESE532: HW/SW ESE350/519 Design for Machine • Deep computer Systems ML SoC Arch: Learning engineering ESE 532 • Deep on Application (ML) • Broad application • More accessible to CS Processor Arch: Mixed Signal: ADC, DAC • Program in C Processors – Less previous experience CIS 371/501 ESE 568 Switched Capacitors with circuits and • Suitable followup if want (CIS240) architecture to dig deeper Gates, Memories Digital: Analog: Amplifier, Compare th • Won’t be as deep on • 5 offering Circuits/VLSI ESE370/570 ESE419/572 Voltage/Current Ref. understanding HW and optimization Devices: ESE521 Transistors (ESE218) • Program in Pytorch, OpenCL 29 30 Penn ESE532• First Fall 2020offerin -- DeHong Penn ESE532 Fall 2020 -- DeHon

5 Abstract Approach

• Identify requirements, bottlenecks • Decompose Parallel Opportunities Approach -- Example – At extreme, how parallel could make it? – What forms of parallelism exist? • Thread-level, data parallel, instruction-level • Design space of mapping – Choices of where to map, area-time tradeoffs • Map, analyze, refine – Write equations to understand, predict 31 32 Penn ESE532 Fall 2020 -- DeHon Penn ESE532 Fall 2020 -- DeHon

Example SPICE Circuit Example: SPICE Circuit Simulator

Simulator Matrix Solve Ax=B A matrix B vector x unknown vector Solve for x = KCL, KVL *

Linear Algebra solving n eqns in n unknowns. Example: Kapre+DeHon, TRCAD 2012 33 34 Penn ESE532 Fall 2020 -- DeHon Penn ESE532 Fall 2020 -- DeHon * Kirchhoff {Current,Voltage} Laws

Analyze Analyze

• T=Tmodeleval+Tmatsolve+Tctrl

35 36 Penn ESE532 Fall 2020 -- DeHon Penn ESE532 Fall 2020 -- DeHon

6 Speedup Analyze

• If only accelerated model evaluation only about 2x speedup • If want better than 14x speed, must also attack control

• T=Tmodeleval+Tmatsolve+Tctrl • What should we speedup first? • What happens if only speedup modeleval?

– T=Tmatsolve+(Tmodeleval)/S+Tctrl 37 38 Penn ESE532 Fall 2020 -- DeHon Penn ESE532 Fall 2020 -- DeHon

Model Evaluation: Trivial Spatial Parallelism Hardware Implementation •Every operation (*, + /) gets dedicated hardware. VD1 / vj I D1 = I s ´(e -1) •Implement task in space à use additional area for each operator. d VD1 / vj 1 GD1 = dV (I D1) = I s ´e ´ •Parallel – all operations occur simultaneously. D1 vj

* * ÷ * * ÷ e f b e f b - * ÷ - * ÷ d c a d c a ex ex 39 40 Penn ESE532 Fall 2020 -- DeHonVerilog-AMS as Domain-Specific Language Penn ESE532 Fall 2020 -- DeHon

Parallelism: Model Evaluation Spatial Too Big? Data Parallel Fully spatial Custom VLIW • Every device circuit independent • Many of each type ÷ * e * f * b of device ÷ - ÷ + • Can evaluate in d * c a parallel ex x – T=Tseq/Nproc e • Build pipelined ~100x Speedup ~10x Speedup circuit for model Multiple FPGAs – Tseq=Ncomp*Tcycle 1 FPGA VLIW=Very Long Instruction Word vs. Tpipe=Tcycle 41 42 Penn ESE532 Fall 2020 -- DeHon Penn ESE532 Fall 2020 -- DeHon exploits Instruction-Level Parallelism

7 Parallelism: Model Evaluation Parallelism: Matrix Solve

• Spatial end up • Use custom • Needed direct solver? bottlenecked by evaluation engines • E.g. Gaussian other components • …or GPUs elimination • Data dependence on previous reduce – Limited data parallelism • Parallelism in subtracts • Some row 43 independence 44 Penn ESE532 Fall 2020 -- DeHon Penn ESE532 Fall 2020 -- DeHon

Example Matrix Example Matrix

45 46 Penn ESE532 Fall 2020 -- DeHon Penn ESE532 Fall 2020 -- DeHon

Example Matrix Dataflow Processing Element (PE)

Graph Incoming Nodes Messages

Dataflow trigger * + ÷

Graph Outgoing Reduce to critical path: Fanout Messages from 9 sequential operations to path of 5 operations. 47 48 Penn ESE532 Fall 2020 -- DeHon Penn ESE532 Fall 2020 -- DeHon

8 Parallelism: Matrix Solve Matrix Solve Only

• Settled on constructing dataflow graph • Graph can be iteration independent – Statically scheduled – (cheaper) ~2.4 x mean • This is bottleneck to further acceleration 49 50 Penn ESE532 Fall 2020 -- DeHon Penn ESE532 Fall 2020 -- DeHon

Parallelism Controller? Parallelism Controller

• Could leave • Customized sequential datapath controller • For some designs, becomes the bottleneck once T =N +N +10*N others accelerated seqctrl add mul divide • Has internal parallelism in Tvliwctrl=Max(Nadd/2,Nmul,10*Ndivide) condition evaluation T=Tmodeleval/S1+(Tmatsolve)/S2+Tctrl 51 52 Penn ESE532 Fall 2020 -- DeHon Penn ESE532 Fall 2020 -- DeHon

Single-Chip Solution Area-Time for Each

[1 Slice = 4 LUTs] 53 54 Penn ESE532 Fall 2020 -- DeHon Penn ESE532 Fall 2020 -- DeHon

9 Composite Speedup Modern SoC

55 56 Penn ESE532 Fall 2020 -- DeHon Penn ESE532 Fall 2020 -- DeHon

Class Components • Lecture (incl. preclass exercise) – Slides on web before class • (you can print if want a follow-along copy) – N.B. I encourage class participation Class Components • Attend Synch recording; Questions (“warm” calls) • Reading [~1 required paper/lecture] – online: Canvas, IEEE, ACM, also ZynqBook, Parallel Programming for FPGAs • Homework – (1 per week due F5pm Eastern) • Project – open-ended (~6 weeks) 57 58 Penn ESE532 Fall 2020 -- DeHon Penn• ESE532Note Fall syllabus,2020 -- DeHon course admin online

First Half Second Half

Quickly cover breadth • Use everything on • Real-time • Metrics, bottlenecks Line up with project • Reactive • Memory homeworks • Schedule more • Memory • Parallel models tentative • Networking • SIMD/Data Parallel – Adjust as experience and project demands • Energy • Thread-level • Going deeper • Scaling parallelism • Chip Cost • Spatial, C-to-gates • Verification 59 60 Penn ESE532 Fall 2020 -- DeHon Penn ESE532 Fall 2020 -- DeHon

10 Teaming Office & Lab Hours

• HW in Groups of 2 • Andre: T 4:15pm—5:30pm Zoom • HW: we assign – See canvas • Individual assignment writeup • Syed: • Project in Groups of 3 – First: Thursday (tomorrow) 10am • Project: you propose, we review – Make sure learning-from up to date – Make sure on piazza will poll – Most portions group writeup – Few components individual writeup

61 62 Penn ESE532 Fall 2020 -- DeHon Penn ESE532 Fall 2020 -- DeHon

C Review Preclass Exercise

• Course will rely heavily on C • Motivate the topic of the day – Program both hardware and software in C – Introduce a problem • HW1 has some C warmup problems – Introduce a design space, tradeoff, transform • Syed will hold C review • Available before lecture – on Sept. 8th, 5:30pm – (only available for 24-48 hours; download) – (before our class meeting since – Should work before lecture starts Monday 9/7 is Labor day) • Do use calculator

63 – Will be numerical examples 64 Penn ESE532 Fall 2020 -- DeHon Penn ESE532 Fall 2020 -- DeHon

Synchronous Recording Feedback Timeline • Will have anonymous feedback google • Start Zoom before 10:30am forms for each lecture • Start lecture at 10:35am – Clarity? • Lecture until 11:55am – Speed? • (most days) stay for remaining – Vocabulary? questions – General comments • Post lecture to canvas later in day • Linked on syllabus • Todays:

65 https://forms.gle/qMuup5pLkwtp187x5 66 Penn ESE532 Fall 2020 -- DeHon Penn ESE532 Fall 2020 -- DeHon

11 Policies • Your action: Admin • Canvas turn-in of assignments – Find course web page • Read it, including the policies • No handwritten work • Find Syllabus • Due on time – Find homework 1 – Individual assignments only – Find lecture slides • 3 free late days total » Will try to post before lecture • Collaboration – Find reading assignments – Find reading for lecture 2 on canvas and web – Tools – allowed • …for this lecture if you haven’t already – Designs – limited to project teams as – Find/join piazza group for course specified on assignments

• See web page 67 68 Penn ESE532 Fall 2020 -- DeHon Penn ESE532 Fall 2020 -- DeHon

Logistics Big Ideas

• Will need SD Card writer for HW7+ • Programmable Platforms – (can get $<10 on amazon.com) – Key delivery vehicle for innovative computing applications – Reduce TTM (Time-to-Market), risk – More than a microprocessor – Heterogeneous, parallel • Demand hardware-software codesign – Soft view of hardware 69 70 Penn ESE532 Fall 2020 -- DeHon Penn ESE532 –FallResource 2020 -- DeHon -aware view of parallelism

Questions?

71 Penn ESE532 Fall 2020 -- DeHon

12