ESE532: System-on-a-Chip Architecture Today • Case for Programmable SoC • Goals • Outcomes Day 1: August 29, 2018 • New/evovling Course, Risks, Tools Introduction and Overview • Sample Optimization • This course (incl. policies, logistics) • Zed Boards

1 2 Penn ESE532 Fall 2018 -- DeHon Penn ESE532 Fall 2018 -- DeHon

Apple A11 Bionic Questions • 90mm2 10nm FinFET • 4.3B transistors • Why do today’s SoC look like they do? • iPhone 8, 8s, X • How approach programming modern SoCs? • 6 ARM cores (64b) – 2 fast (2.4GHz) • How design a custom SoC? – 4 low energy • When building a System-on-a-Chip (SoC) • 3 custom GPUs – How much area should go into: • Neural Engine • Processor cores, GPUs, – 600 Bops? FPGA logic, memory, interconnect, • , image accel. custom functions (which) …. ? • 8MB L2 cache

3 4 Penn ESE532 Fall 2017 -- DeHonhttps://wccftech.com/apple-iphone-8-plus-a11-bionic-complete-teardown/ Penn ESE532 Fall 2018 -- DeHon

FPGA Field-Programmable Gate Array

K-LUT (typical k=4) Compute block Case for Programmable SoC w/ optional output Flip-Flop

ESE171, CIS371 5 6 Penn ESE532 Fall 2018 -- DeHon Penn ESE532 Fall 2018 -- DeHon

1 The Way things Were Today 20 years ago • Microprocessor may not be fast enough – (but often it is) • Wanted programmability – Or low enough energy – used a processor • Time and Cost of a custom IC is too high • Wanted high-throughput – $100M’s of dollars for development, Years – used a custom IC • FPGAs promising • Wanted product differentiation – But build everything from prog. gates? – Got it at the board level – Select which ICs and how wired • Premium for small part count – And avoid chip crossing • Build a custom IC – ICs with Billions of Transistors – It was about gates and logic 7 8 Penn ESE532 Fall 2018 -- DeHon Penn ESE532 Fall 2018 -- DeHon

Non-Recurring Engineering NRE Costs (NRE) Costs • Costs spent up front on development – Engineering Design Time – Prototypes – Mask costs • Recurring Engineering – Costs to produce each chip • 28-nm SoC development costs doubled Cost(Nchips) = CostNRE + Nchips× Costperchip over previous node – EE Times 2013 28nm+78%, 20nm+48%, 14nm+31%, 10nm+35% 9 10 Penn ESE532 Fall 2018 -- DeHon Penn ESE532 Fall 2018 -- DeHon

Economics Amortize NRE with Volume Forcing fewer, more customizable Cost(Nchips) = CostNRE + Nchips× Costperchip chips • Economics force fewer, more customizable chips – Mask costs in the millions of dollars – Custom IC design NRE 10s—100s of millions of dollars CostNRE € Cost= + Costperchip • Need market of billions of dollars to recoup investment Nchips • With fixed or slowly growing total IC industry revenues • ! Number of unique chips must decrease

11 12 Penn ESE532 Fall 2018 -- DeHon Penn ESE532 Fall 2018 -- DeHon

2 Given Demand for Large ICs Programmable • Now contain significant software • How do we get higher performance than – Almost all have embedded processors a processor, while retaining • Must co-design SW and HW programmability? • Must solve complete computing task – Tasks has components with variety of needs – Some don’t need custom circuit – 90/10 Rule

13 14 Penn ESE532 Fall 2018 -- DeHon Penn ESE532 Fall 2018 -- DeHon

Programmable SoC Programmable SoC • Implementation Platform for innovation – This is what you target (avoid NRE) – Implementation vehicle

15 16 Penn ESE532 Fall 2018 -- DeHon Penn ESE532 Fall 2018 -- DeHon

Then and Now 20 years ago Today • Programmability? • Programmability? – use a processor – uP, FPGA, GPU, PSoC • High-throughput • High-throughput Goals, Outcomes – used a custom IC – FPGA, GPU, PSoC, • Wanted product custom differentiation • Wanted product – board level differentiation – Select & wired IC – Program FPGAs, • Build a custom IC PSoC – It was about gates • Build a custom IC and logic 17 18 Penn ESE532 Fall 2018 -- DeHon – System and software Penn ESE532 Fall 2018 -- DeHon

3 Goals Roles • Create Computer Engineers • PhD Qualifier – SW/HW divide is wrong, outdated – One broad Computer Engineering – Parallelism, data movement, resource management, abstractions • CMPE Concurrency – Cannot build a chip without software • Hands-on Project course • SoC user – know how to exploit • SoC designer – architecture space, hw/sw codesign • Project experience – design and optimization 19 20 Penn ESE532 Fall 2018 -- DeHon Penn ESE532 Fall 2018 -- DeHon

Outcomes Outcomes • Design, optimize, and program a modern System-on-a-Chip. • Understand the from • Analyze, identify bottlenecks, design-space gates to application software, including: – Modeling " write equations to estimate – on-chip memories and communication • Decompose into parallel components networks, I/O interfacing, design of accelerators, processors, firmware and OS/ • Characterize and develop real-time solutions infrastructure software. • Implement both hardware and software • Understand and estimate key design solutions metrics and requirements including: • Formulate hardware/software tradeoffs, and – area, latency, throughput, energy, power, predictability, and reliability. perform hardware/software codesign 21 22 Penn ESE532 Fall 2018 -- DeHon Penn ESE532 Fall 2018 -- DeHon

New and Evolving Course Tools • Spring 2017 – first offering • Are complex – Raw, all assignments new … some buggy • Will be challenging, but good for you to – Assignments too tedious, long build confidence can understand and • Fall 2017 – second offering master – Refine assignments, project • Tool runtimes can be long – Increased explicit modeling emphasis • Learning and sharing experience will be – Hard, not insane part of assignments • Fall 2018 – now – Not much different from 2017

– Hopefully fewer rough edges 23 24 Penn ESE532 Fall 2018 -- DeHon Penn ESE532 Fall 2018 -- DeHon

4 Distinction Abstraction Stack

CIS240, 371, 501 ESE532 • Best Effort Computing • Hardware-Software SoC Arch: – Run as fast as you can codesign Processor Arch: ESE 532 • Binary compatible – Willing to recompile, maybe CIS 371/501 rewrite code • ISA separation – Define/refine hardware Mixed Signal: • Shared memory • Real-Time ESE 568 parallelism – Guarantee meet deadline • Non shared-memory Circuits/VLSI Digital: ESE370/570 Analog: ESE419/572 models (ESE319)

Devices: ESE521 25 26 Penn ESE532 Fall 2018 -- DeHon Penn ESE532 Fall 2018 -- DeHon (ESE218)

Abstract Approach

• Identify requirements, bottlenecks • Decompose Parallel Opportunities Approach -- Example – At extreme, how parallel could make it? – What forms of parallelism exist? • Thread-level, data parallel, instruction-level • Design space of mapping – Choices of where to map, area-time tradeoffs • Map, analyze, refine – Write equations to understand, predict 27 28 Penn ESE532 Fall 2018 -- DeHon Penn ESE532 Fall 2018 -- DeHon

SPICE Circuit Simulator Analyze Matrix Solve Ax=B A matrix B vector x unknown vector Solve for x

Linear Algebra solving n eqns in n unknowns.

Example: Kapre+DeHon, TRCAD 2012 29 30 Penn ESE532 Fall 2018 -- DeHon Penn ESE532 Fall 2018 -- DeHon

5 Analyze Speedup

• T=Tmodeleval+Tmatsolve+Tctrl • What should we speedup first? • T=Tmodeleval+Tmatsolve+Tctrl • What happens if only speedup modeleval?

31 – T=Tmatsolve+(Tmodeleval)/S+Tctrl 32 Penn ESE532 Fall 2018 -- DeHon Penn ESE532 Fall 2018 -- DeHon

Model Evaluation: Trivial Analyze Hardware Implementation

• If only accelerated model evaluation only about 2x speedup • If want better than 14x speed, must also attack control * * ÷ e f b - * ÷ d c a ex 33 34 Penn ESE532 Fall 2018 -- DeHon Penn ESE532 Fall 2018 -- DeHonVerilog-AMS as Domain-Specific Language

Spatial Parallelism Parallelism: Model Evaluation • Every operation (*, + /) gets dedicated hardware. • Implement task in space " use additional area for • Every device each operator. independent • Parallel – all operations occur simultaneously. • Many of each type of device • Can evaluate in * * ÷ parallel e f b – T=Tseq/Nproc - * ÷ • Build pipelined d c a circuit for model x e – Tseq=Ncomp*Tcycle vs. T =T 35 pipe cycle 36 Penn ESE532 Fall 2018 -- DeHon Penn ESE532 Fall 2018 -- DeHon

6 Single FPGA Mapping Parallelism: Model Evaluation Fully spatial Custom VLIW circuit • Spatial end up • Use custom bottlenecked by evaluation engines b ÷ * e * f * other components • …or GPUs - * ÷ + ÷ d c a ex ex ~100x Speedup 1-10x Speedup Multiple FPGAs 1 FPGA VLIW=Very Long Instruction Word 37 38 Penn ESE532 Fall 2018 -- DeHon exploits Instruction-Level Parallelism Penn ESE532 Fall 2018 -- DeHon

Parallelism: Matrix Solve Example Matrix

• Needed direct solver? • E.g. Gaussian elimination • Data dependence on previous reduce • Parallelism in subtracts • Some row independence 39 40 Penn ESE532 Fall 2018 -- DeHon Penn ESE532 Fall 2018 -- DeHon

Example Matrix Example Matrix

Reduce to critical path: from 9 sequential operations 41 to path of 5 operations. 42 Penn ESE532 Fall 2018 -- DeHon Penn ESE532 Fall 2018 -- DeHon

7 Dataflow PE Matrix Solve Only

Graph Incoming Nodes Messages

Dataflow trigger * + ÷

Graph Outgoing ~2.4x mean Fanout Messages

43 44 Penn ESE532 Fall 2018 -- DeHon Penn ESE532 Fall 2018 -- DeHon

Parallelism: Matrix Solve Parallelism Controller?

• Settled on • Could leave constructing dataflow sequential graph • For some designs, • Graph can be becomes the iteration independent bottleneck once – Statically scheduled others accelerated – (cheaper) • Has internal • This is bottleneck to parallelism in further acceleration condition evaluation T=Tmodeleval/S1+(Tmatsolve)/S2+Tctrl 45 46 Penn ESE532 Fall 2018 -- DeHon Penn ESE532 Fall 2018 -- DeHon

Single-Chip Solution Parallelism Controller

• Customized datapath controller

Tseqctrl=Nadd+Nmul+10*Ndivide

Tvliwctrl=Max(Nadd/2,Nmul,10*Ndivide)

47 48 Penn ESE532 Fall 2018 -- DeHon Penn ESE532 Fall 2018 -- DeHon

8 Area-Time for Each Composite Speedup

49 50 Penn ESE532 Fall 2018 -- DeHon Penn ESE532 Fall 2018 -- DeHon

Modern SoC

Class Components

51 52 Penn ESE532 Fall 2018 -- DeHon Penn ESE532 Fall 2018 -- DeHon

Class Components • Lecture (incl. preclass exercise) First Half – Slides on web before class • (you can print if want a follow-along copy) Quickly cover breadth • Spatial, C-to-gates – N.B. I will encourage (force) class participation • Metrics, bottlenecks • Real-time • Questions • Memory • Reactive • Reading [~1 required paper/lecture] • Parallel models – online: Canvas, IEEE, ACM, also ZynqBook, • SIMD/Data Parallel Line up with Parallel Programming for FPGAs • Thread-level homeworks • Homework parallelism – (1 per due F5pm) • Project – open-ended (~6 weeks) 53 54 Penn• ESE532Note Fall syllabus,2018 -- DeHon course admin online Penn ESE532 Fall 2018 -- DeHon

9 Second Half Teaming

• Use everything on • Memory • HW and Project in Groups of 2 project • Networking • HW: we assign • Schedule more • Energy • Project: you propose, we review tentative • Scaling – Adjust as experience • Individual assignment turnin and project demands • Chip Cost • Going deeper • Defect + Fault tolerance

55 56 Penn ESE532 Fall 2018 -- DeHon Penn ESE532 Fall 2018 -- DeHon

C Review Office & Lab Hours • Course will rely heavily on C • Andre: T 4:15pm—5:30pm Levine 270 – Program both hardware and software in C • Renzhi & Han: M R 6—8pm in Ketterer • HW1 has some C warmup problems – Start tomorrow 8/30 ?

• TA will hold C review – Ketterer on Sept. 4th at 6pm – (before our class meeting since Monday 9/3 is Labor day) – Watch piazza for details 57 58 Penn ESE532 Fall 2018 -- DeHon Penn ESE532 Fall 2018 -- DeHon

Preclass Exercise Feedback

• Will have anonymous feedback sheets • Motivate the topic of the day for each lecture – Introduce a problem – Clarity? – Introduce a design space, tradeoff, – Speed? transform – Vocabulary? • Work for ~5 minutes before start – General comments lecturing • Do bring calculator class – Will be numerical examples 59 60 Penn ESE532 Fall 2018 -- DeHon Penn ESE532 Fall 2018 -- DeHon

10 Policies Boards and Wait List

• Canvas turn-in of assignments • Believe everyone on the wait list has • No handwritten work been offered a permit • Due on time (3 free late days total) • To accommodate, does mean we have • Collaboration – Boards < Students < 2*Boards – Tools – allowed • Will be one board per pair of students – Designs – limited to project teams as specified on assignments • See web page 61 62 Penn ESE532 Fall 2018 -- DeHon Penn ESE532 Fall 2018 -- DeHon

Admin • Your action: Big Ideas – Find course web page • Read it, including the policies • Programmable Platforms • Find Syllabus – Key delivery vehicle for innovative – Find homework 1 computing applications – Find lecture slides – Reduce TTM, risk » Will try to post before lecture – More than a microprocessor – Find reading assignments – Heterogeneous, parallel – Find reading for lecture 2 on canvas and web • Demand hardware-software codesign • …for this lecture if you haven’t already – Soft view of hardware – Find/join piazza group for course 63 – Resource-aware view of parallelism 64 Penn ESE532 Fall 2018 -- DeHon Penn ESE532 Fall 2018 -- DeHon

Zedboard Checkout

• Ketterer (Moore 200) – card key access – tiny.cc/detkin-access • Will need SD Card writer for HW2+ – (can get $<10 on amazon.com) • By numbers

65 Penn ESE532 Fall 2018 -- DeHon

11