ESE532: System-on-a-Chip Architecture Today • Case for Programmable SoC • Course Goals • Outcomes Day 1: August 28, 2019 • New/evovling Course, Risks, Tools Introduction and Overview • Sample Optimization Everyone grab: • This course (incl. policies, logistics) • Preclass • Feedback Sheet (1/2 page) 1 2 Penn ESE532 Fall 2019 -- DeHon Penn ESE532 Fall 2019 -- DeHon

Apple A12 Bionic Questions

• 84mm2, 7nm • Why do today’s SoC look like they do? • 7 Billion Tr. • How approach programming modern SoCs? • iPhone XS, XR • How design a custom SoC? – IPad 2019 • When building a System-on-a-Chip (SoC) • 6 ARM cores – How much area should go into: – 2 fast • Processor cores, GPUs, – 4 low energy FPGA logic, memory, interconnect, • 4 custom GPUs custom functions (which) …. ? • Neural Engine

– 5 Trillion ops/s? 3 4 Penn ESE532 Fall 2018 -- DeHon Penn ESE532 Fall 2019 -- DeHon

FPGA Field-Programmable Gate Array

K-LUT (typical k=4) Compute block Case for Programmable SoC w/ optional output Flip-Flop

ESE171, ESE150, CIS371 5 6 Penn ESE532 Fall 2019 -- DeHon Penn ESE532 Fall 2019 -- DeHon

1 The Way things Were Today 25 years ago • Microprocessor may not be fast enough – (but often it is) • Wanted programmability – Or low enough energy – used a processor • Time and Cost of a custom IC is too high • Wanted high-throughput – $100M’s of dollars for development, Years – used a custom IC • FPGAs promising • Wanted product differentiation – But build everything from prog. gates? – Got it at the board level – Select which ICs and how wired • Premium for small part count • Build a custom IC – And avoid chip crossing – ICs with Billions of Transistors 7 8 Penn ESE532– FallIt 2019 was -- DeHon about gates and logic Penn ESE532 Fall 2019 -- DeHon

Non-Recurring Engineering NRE Costs (NRE) Costs • Costs spent up front on development – Engineering Design Time – Prototypes – Mask costs • Recurring Engineering – Costs to produce each chip

9 10 Penn ESE532 Fall 2019 -- DeHon Penn ESE532 Fall 2019 -- DeHon

NRE Cost (continued) Amortize NRE with Volume

11 12 Penn ESE532 Fall 2019 -- DeHon https://semiengineering.com/how-much-will-that-chip-cost/ Penn ESE532 Fall 2019 -- DeHon

2 Economics Large ICs Forcing fewer, more • Now contain significant software customizable – Almost all have embedded processors chips • Must co-design SW and HW • Economics force fewer, more customizable chips • Must solve complete computing task – Mask costs in the millions of dollars – Tasks has components with variety of needs – Custom IC design NRE 10s—100s of millions of dollars • Need market of billions of dollars to recoup investment – Some don’t need custom circuit • With fixed or slowly growing total IC industry revenues – 90/10 Rule • è Number of unique chips must decrease • Chips must be programmable 13 14 Penn ESE532 Fall 2019 -- DeHon Penn ESE532 Fall 2019 -- DeHon

Given Demand for Programmable SoC Programmable • Implementation Platform for innovation – This is what you target (avoid NRE) • How do we get higher performance than – Implementation a processor, while retaining vehicle programmability?

15 16 Penn ESE532 Fall 2019 -- DeHon Penn ESE532 Fall 2019 -- DeHon

Programmable SoC Then and Now 25 years ago Today • Programmability? • Programmability? – use a processor – uP, FPGA, GPU, PSoC • High-throughput UG1085 – used a custom IC • High-throughput – FPGA, GPU, PSoC, • Wanted product UltraScale custom Zynq differentiation • Wanted product TRM – board level (p27) differentiation – Select & wired IC – Program FPGAs, • Build a custom IC PSoC – It was about gates • Build a custom IC 17 and logic 18 Penn ESE532 Fall 2019 -- DeHon Penn ESE532 Fall 2019 -- DeHon – System and software

3 Goals • Create Computer Engineers Course – SW/HW divide is wrong, outdated – Parallelism, data movement, resource Goals, Outcomes management, abstractions – Cannot build a chip without software • SoC user – know how to exploit • SoC designer – architecture space, hw/sw codesign

19 • Project experience – design and optimization20 Penn ESE532 Fall 2019 -- DeHon Penn ESE532 Fall 2019 -- DeHon

Outcomes Roles • Design, optimize, and program a modern System-on-a-Chip. • PhD Qualifier • Analyze, identify bottlenecks, design-space – One broad Computer Engineering – Modeling à write equations to estimate • CMPE Concurrency • Decompose into parallel components • Hands-on Project course • Characterize and develop real-time solutions • Implement both hardware and software solutions • Formulate hardware/software tradeoffs, and

21 perform hardware/software codesign 22 Penn ESE532 Fall 2019 -- DeHon Penn ESE532 Fall 2019 -- DeHon

New and Evolving Course Outcomes • Spring 2017 – first offering – Raw, all assignments new … some buggy – Assignments too tedious, long • Understand the from • Fall 2017 – second offering gates to application software, including: – Refine assignments, project – Increased explicit modeling emphasis – on-chip memories and communication – Hard, not insane networks, I/O interfacing, design of • Fall 2018 – third offering accelerators, processors, firmware and – Not much different from 2017 – Added real-time ethernet data handling; project groups of 3 OS/infrastructure software. – Many students challenged with and software engineering • Understand and estimate key design – Stream debug and performance challenging • Fall 2019 – now metrics and requirements including: – Basic structure remains same – area, latency, throughput, energy, power, – Try front-load more C – Try better introduce Stream optimization and debug predictability, and reliability. 23 – Group writeup on projects 24 Penn ESE532 Fall 2019 -- DeHon Penn ESE532 Fall 2019 -- DeHon

4 Tools Distinction

• Are complex CIS240, 371, 501 ESE532 • Will be challenging, but good for you to • Best Effort Computing • Hardware-Software build confidence can understand and – Run as fast as you can codesign • Binary compatible – Willing to recompile, maybe master rewrite code • ISA separation – Define/refine hardware • Tool runtimes can be long • Shared memory • Real-Time parallelism • Learning and sharing experience will be – Guarantee meet deadline part of assignments • Non shared-memory parallelism models

25 26 Penn ESE532 Fall 2019 -- DeHon Penn ESE532 Fall 2019 -- DeHon

Abstraction Stack Software Embedded Sys: ESE350/519 Systems SoC Arch: ESE 532 Approach -- Example Processors Processor Arch: Mixed Signal: ADC, DAC CIS 371/501 ESE 568 Switched Capacitors (CIS240) Gates, Memories Digital: Analog: Amplifier, Compare Circuits/VLSI ESE370/570 ESE419/572 Voltage/Current Ref. Devices: ESE521 Transistors (ESE218)

27 28 Penn ESE532 Fall 2019 -- DeHon Penn ESE532 Fall 2019 -- DeHon

Abstract Approach SPICE Circuit Simulator

• Identify requirements, bottlenecks Matrix Solve Ax=B • Decompose Parallel Opportunities A matrix – At extreme, how parallel could make it? B vector x unknown vector – What forms of parallelism exist? Solve for x • Thread-level, data parallel, instruction-level • Design space of mapping Linear Algebra solving n eqns – Choices of where to map, area-time tradeoffs in n unknowns. • Map, analyze, refine Example: Kapre+DeHon, TRCAD 2012 – Write equations to understand, predict 29 30 Penn ESE532 Fall 2019 -- DeHon Penn ESE532 Fall 2019 -- DeHon

5 Analyze Analyze

• T=Tmodeleval+Tmatsolve+Tctrl

31 32 Penn ESE532 Fall 2019 -- DeHon Penn ESE532 Fall 2019 -- DeHon

Speedup Analyze

• If only accelerated model evaluation only about 2x speedup • If want better than 14x speed, must also attack control

• T=Tmodeleval+Tmatsolve+Tctrl • What should we speedup first? • What happens if only speedup modeleval?

– T=Tmatsolve+(Tmodeleval)/S+Tctrl 33 34 Penn ESE532 Fall 2019 -- DeHon Penn ESE532 Fall 2019 -- DeHon

Model Evaluation: Trivial Spatial Parallelism Hardware Implementation •Every operation (*, + /) gets dedicated hardware. VD1 / vj I D1 = I s ´(e -1) •Implement task in space à use additional area for each operator. VD1 / vj 1 GD1 = dV (I D1) = I s ´e ´ •Parallel – all operations occur simultaneously. D1 vj

* * ÷ * * ÷ e f b e f b - * ÷ - * ÷ d c a d c a ex ex 35 36 Penn ESE532 Fall 2019 -- DeHonVerilog-AMS as Domain-Specific Language Penn ESE532 Fall 2019 -- DeHon

6 Parallelism: Model Evaluation Spatial Too Big? Data Parallel Fully spatial Custom VLIW • Every device circuit independent • Many of each type b ÷ * e * f * of device ÷ - ÷ + • Can evaluate in d * c a parallel ex x – T=Tseq/Nproc e • Build pipelined ~100x Speedup circuit for model ~10x Speedup Multiple FPGAs – Tseq=Ncomp*Tcycle 1 FPGA VLIW=Very Long Instruction Word vs. Tpipe=Tcycle 37 38 Penn ESE532 Fall 2019 -- DeHon Penn ESE532 Fall 2019 -- DeHon exploits Instruction-Level Parallelism

Parallelism: Model Evaluation Parallelism: Matrix Solve

• Spatial end up • Use custom • Needed direct solver? bottlenecked by evaluation engines • E.g. Gaussian other components • …or GPUs elimination • Data dependence on previous reduce – Limited data parallelism • Parallelism in subtracts • Some row 39 independence 40 Penn ESE532 Fall 2019 -- DeHon Penn ESE532 Fall 2019 -- DeHon

Example Matrix Example Matrix

41 42 Penn ESE532 Fall 2019 -- DeHon Penn ESE532 Fall 2019 -- DeHon

7 Example Matrix Dataflow Processing Element (PE)

Graph Incoming Nodes

Dataflow trigger * + ÷

Graph Outgoing Reduce to critical path: Fanout Messages from 9 sequential operations to path of 5 operations. 43 44 Penn ESE532 Fall 2019 -- DeHon Penn ESE532 Fall 2019 -- DeHon

Parallelism: Matrix Solve Matrix Solve Only

• Settled on constructing dataflow graph • Graph can be iteration independent – Statically scheduled – (cheaper) ~2.4 x mean • This is bottleneck to further acceleration 45 46 Penn ESE532 Fall 2019 -- DeHon Penn ESE532 Fall 2019 -- DeHon

Parallelism Controller? Parallelism Controller

• Could leave • Customized sequential datapath controller • For some designs, becomes the bottleneck once T =N +N +10*N others accelerated seqctrl add mul divide • Has internal parallelism in Tvliwctrl=Max(Nadd/2,Nmul,10*Ndivide) condition evaluation T=Tmodeleval/S1+(Tmatsolve)/S2+Tctrl 47 48 Penn ESE532 Fall 2019 -- DeHon Penn ESE532 Fall 2019 -- DeHon

8 Single-Chip Solution Area-Time for Each

49 50 Penn ESE532 Fall 2019 -- DeHon Penn ESE532 Fall 2019 -- DeHon

Composite Speedup Modern SoC

51 52 Penn ESE532 Fall 2019 -- DeHon Penn ESE532 Fall 2019 -- DeHon

Class Components • Lecture (incl. preclass exercise) – Slides on web before class • (you can print if want a follow-along copy) – N.B. I will encourage (force) class participation Class Components • Questions (“warm” calls) • Reading [~1 required paper/lecture] – online: Canvas, IEEE, ACM, also ZynqBook, Parallel Programming for FPGAs • Homework – (1 per week due F5pm) • Project – open-ended (~6 weeks) 53 54 Penn ESE532 Fall 2019 -- DeHon Penn• ESE532Note Fall syllabus,2019 -- DeHon course admin online

9 First Half Second Half

Quickly cover breadth • Spatial, C-to-gates • Use everything on • Memory • Metrics, bottlenecks • Real-time project • Networking • Memory • Reactive • Schedule more • Energy • Parallel models tentative • Scaling – Adjust as experience • SIMD/Data Parallel Line up with and project demands • Chip Cost homeworks • Thread-level • Going deeper • Verification parallelism • Defect + Fault tolerance

55 56 Penn ESE532 Fall 2019 -- DeHon Penn ESE532 Fall 2019 -- DeHon

Teaming Office & Lab Hours

• HW in Groups of 2 • Andre: T 4:15pm—5:30pm Levine 270 • HW: we assign • Yuanlong and Eric: • Individual assignment writeup – Tuesday 10am-12pm in Ketterer • Project in Groups of 3 – Tuesday 8pm—10pm in Ketterer • Project: you propose, we review – Thursday 6pm—8pm in Detkin – Start tomorrow 8/29 – Most portions group writeup – Few components individual writeup

57 58 Penn ESE532 Fall 2019 -- DeHon Penn ESE532 Fall 2019 -- DeHon

C Review Preclass Exercise • Course will rely heavily on C – Program both hardware and software in C • Motivate the topic of the day • HW1 has some C warmup problems – Introduce a problem – Introduce a design space, tradeoff, transform • TA will hold C review – Ketterer on Sept. 3rd at 8pm • Work for ~5 minutes before start lecturing – (before our class meeting since Monday 9/2 is Labor day) • Do bring calculator class

– Watch piazza for details 59 – Will be numerical examples 60 Penn ESE532 Fall 2019 -- DeHon Penn ESE532 Fall 2019 -- DeHon

10 Policies Feedback • Canvas turn-in of assignments • No handwritten work • Will have anonymous feedback sheets for each lecture • Due on time – Clarity? – Individual assignments only – Speed? • 3 free late days total – Vocabulary? • Collaboration – General comments – Tools – allowed – Designs – limited to project teams as specified on assignments

61 • See web page 62 Penn ESE532 Fall 2019 -- DeHon Penn ESE532 Fall 2019 -- DeHon

• Your action: Admin – Find course web page Logistics • Read it, including the policies • Find Syllabus • Will need SD Card writer for HW2+ – Find homework 1 – (can get $<10 on amazon.com) – Find lecture slides » Will try to post before lecture – Find reading assignments – Find reading for lecture 2 on canvas and web • …for this lecture if you haven’t already – Find/join piazza group for course – Signup for Detkin/Ketterer card access 63 64 Penn ESE532 Fall• tiny.cc 2019 -- DeHon/detkin-access Penn ESE532 Fall 2019 -- DeHon

Cautionary Note Coming Soon Most common board failure was broken USB and power. New boards will have strain relief. • Boards not available, yet Don’t unplug USB, power cables from board. – Watch piazza • Maybe office hours Thursday or Tuesday • SDSoC (Xilinx Software) – Not available on , yet – Windows is available • Ketterer • Detkin? (fixing some last problems on Tuesday)

65 66 Penn ESE532 Fall 2019 -- DeHon Penn ESE532 Fall 2019 -- DeHon

11 Cautionary Note Most common board failure was broken USB and power. Big Ideas New boards will have strain relief. Don’t unplug USB, power cables from board. • Programmable Platforms – Key delivery vehicle for innovative computing applications – Reduce TTM, risk – More than a microprocessor – Heterogeneous, parallel • Demand hardware-software codesign – Soft view of hardware 67 68 Penn ESE532 Fall 2019 -- DeHon Penn ESE532 –FallResource 2019 -- DeHon -aware view of parallelism

12