Universität Dortmund

Hardware Platforms for Embedded Computing Graphics: © Alexandra Nolte, Gesine Marwedel, 2003 Universität Dortmund Importance of Energy Efficiency

Efficient design needed, otherwise, the price for software flexibility cannot be paid.

© Hugo De Man (IMEC) Philips, 2007 Universität Dortmund

Embedded vs. general-purpose processors

• Embedded processors may be optimized for a category of applications. – Customization may be narrow or broad. • We may judge embedded processors using different metrics: – Code size. – Memory system performance. – Preditability. • Disappearing distinction: embedded processors everywhere Universität Dortmund Microcontroller Architectures Memory 0 Address Bus Program + CPU Data Bus Data Von Neumann

2n Architecture

Memory 0 Address Bus Program CPU Fetch Bus Harvard

Address Bus 0 Architecture Data Bus Data Universität Dortmund RISC processors

• RISC generally means highly-pipelinable, one instruction per cycle. • Pipelines of embedded RISC processors have grown over time: – ARM7 has 3-stage pipeline. – ARM9 has 5-stage pipeline. – ARM11 has eight-stage pipeline. ARM11 pipeline [ARM05]. Universität Dortmund ARM Cortex

Based on ARMv7 Architecture & Thumb®-2 ISA – ARM Cortex A Series - Applications CPUs focused on the execution of complex OS and user applications • First Product: Cortex-A8 • Executes ARM, Thumb-2 instructions – ARM Cortex R Series - Deeply embedded processors focused on Real-time environments • First Product: Cortex-R4(F) • Executes ARM, Thumb-2 instructions – ARM Cortex M Series - Microcontroller cores focused on very cost sensitive, deterministic, interrupt driven environments • First Product: ARM Cortex-M3 (2uA, 0.5mW/MHz) • Executes Thumb-2 instructions Universität Dortmund Cortex-M3 “Processor” Universität Dortmund Central Core

Harvard architecture Separate Instruction & Data buses enable parallel fetch & store Advanced 3-Stage Pipeline Includes Branch Forwarding & Speculation Additional Write-Back via Bus Matrix Universität Dortmund Microcontrollers

Memory ROM RAM CPU

I/O

Subsystems: Timers, Counters, Analog A single chip Interfaces, I/O interfaces Universität Dortmund A Microcontroller SOC example: STM32 Value line 64K-128KBytes System Diagram

• Core and operating conditions CORTEXTM-M3 64kB - 128kB Power Supply - ARM® Cortex™-M3 CPU Flash Memory Reg 1.8V 24 MHz Flash I/F POR/PDR/PVD - 1.25 DMIPS/MHz up to 24 MHz - 2.0 V to 3.6 V range 8kB SRAM XTAL oscillators 32KHz + 4~25MHz - -40 to +105 °C JTAG/SW Debug 20B Backup Data (max 24MHz) (max Int. RC oscillators Speed Bus Nested vect IT Ctrl - • Rich connectivity 40KHz + 8MHz 1 x Systick Timer

- 8 communications peripherals Hi Lite PLL ® DMA 7 Channels ARM RTC / AWU • Advanced analog / Arbiter Matrix Clock Control - 12-bit1.2 µs conversion time ADC ARM® Peripheral Bus - Dual channel 12-bit DAC Bridge (max 24MHz) Bridge 1 x 16-bit PWM Synchronized AC Timer • Enhanced control 6 x 16-bit Timer 1 x CEC - 16-bit motor control timer Up to 16 Ext. ITs 2 x Watchdog - 6x 16-bit PWM timers (independent & window) 37/51/80 I/Os 2 x USART/LIN 2-channel 12-bit DAC Smartcard / IrDa

PeripheralBus Modem Control (max 24MHz)(max • LQFP48, LQFP/BGA64, LQFP100 1 x SPI ® 1 x 12-bit ADC 16 channels 1 x SPI

ARM up to 1 x USART/LIN Smartcard/IrDa 2 Modem Control Temperature Sensor 2 x I C Universität Dortmund DSP Applications

• Audio applications • Networking • MPEG Audio • Cable modems • Portable audio • ADSL • Digital cameras • VDSL • Wireless • Cellular telephones • Base station

Embeded computing needs lots of DSP capabilities Universität Dortmund DSP architectures

Application: y[j] = ∑n-1 x[j-i]*a[i] i=0 ∀i: 0≤i ≤ n-1: yi[j] = yi-1[j] + x[j-i]*a[i] Architecture: Example: Data path ADSP210x P a - Parallelism D x - Dedicated registers x[j-i] a[i] AX AY MX MY MF Address- AF MR:=0; A1:=1; A2:=n-2; registers A0, A1, A2 MX:=x[n-1]; MY:=a[0]; .. for ( j:=1 to n) +,-,.. i+1, j-i+1 * x[j-i]*a[i] {MR:=MR+MX*MY; Address AR +,- MY:=a[A1]; MX:=x[A2]; generation A1++; A2--} unit (AGU) MR yi-1[j] Universität Dortmund DSP - Features (1)

• Multiply/accumulate (MAC) and zero-overhead loop (ZOL) instructions (as shown) • Heterogeneous registers (as shown) • Separate address generation units (AGUs) (as in ADSP 210x) Universität Dortmund Single Issue vs VLIW instr op instr op instr op instr op instr op op op instr nop op op instr op Compiler instr op instr op op nop instr op instr op nop op instr op instr op op op instr op execute instr op 1 instr/cycle instr op 3 ops/cycle instr op execute 1 instr/cycle 3-issue VLIW Single Issue CPU

2/25/2016 Embedded H. Corporaal 14 and B. Mesman Universität Dortmund ARM Processors Families

15 Cortex-M4

. ARMv7E-M Architecture . Thumb-2 only . DSP extensions

. Optional FPU (Cortex-M4F)

. Otherwise, same as Cortex-M3

. Implements full Thumb-2 instruction set . Saturated math (e.g. QADD) . Packing and unpacking (e.g. UXTB) . Signed multiply (e.g. SMULTB) . SIMD (e.g. ADD8)

Cortex M3 Total 60k* Gates

University Program Material Copyright © ARM Ltd 2012 16 Binary Upwards Compatibility

ARMv7-M Architecture

ARMv6-M Architecture

University Program Material Copyright © ARM Ltd 2012 17 Cortex-M4 DSP instructions

. Remember VLIW?

University Program Material Copyright © ARM Ltd 2012 18 Universität Dortmund

Multi-processors SoCs for Embedded Computing Graphics: © Alexandra Nolte, Gesine Marwedel, 2003 Universität Dortmund Application pull

3D gaming 1TOPS/W 3D TV 3D ambient Structured interaction decoding Ubiquitous 3D projectednavigation Autonomous display driving HMI by motion StructuredGesture detection encoding 100GOPS/W Expression recognition Gbit radio Collision H264 Adaptive avoidance encoding route

Gesture LanguageEmotion dictation recognition recognition 10GOPS/W UWB A/V Sign 5 GOPS/W Image streaming recognition recognition 802.11n Mobile Si Xray Base-band H264 decoding Auto Fully recognition personalization (security) 2005 2007 2009 2011 2013 2015 [IMEC] Year of Introduction Universität Dortmund Power Bottleneck

• Power trend

100 300

Dynamic Power 250 1 200 Possible trajectory for 10-2 Gate- 150 Oxide high-k Leakage dielectrics Sub- 100 10-4 Threshold

Power Consumption Leakage 50 Physical Gate Length [nm] Length Gate Physical

10-6 0 1990 1995 2000 2005 2010 2015 2020 Power density trend 150 125 Leakage Power Power Density 100 2 75 Dynamic Power (Watts/cm ) 50 25 0 250nm 180nm 130nm 90nm 65nm [STM ASIC] Universität Dortmund Multi-Core & Power

Power Cache Power = 1/4 4 Performance Performance = 1/2 3 Large Core 2 2 Small Core 1 1 1 1

C1 C2 4 4 Multi-Core: 3 3 Cache Power efficient 2 2 Better power and C3 C4 thermal management 1 1 Universität Dortmund µArchitecture Techniques

100% Multi-threading Increase on-die Memory Single Thread Full HW Utilization 75% Pentium® M ST Wait for Mem 50% Multi-Threading MT1 Wait for Mem Pentium® III MT2 Wait 25% 486 Pentium® Pentium® 4 Cache % of Total Area MT3 0% 1u 0.5u 0.25u 0.13u 65nm Improved performance, no impact on thermals & power delivery

Chip Multi-processing 3,5 3 Multi Core C1 C2 2,5 2 Cache Large 1,5 Single Core Core C3 C4 Relative Performance 1 1 2 3 4 Die Area, Power Universität Dortmund Integrated SoC Mobile • High-speed SMP for “almost sequential” GP • “Processor arrays” for domain-specific throughput computing (100x GOPS/W) ultra parallel…

24 Universität Dortmund H-SOC in 2013 – Apple A7

Used in IPad AIR & IPhone 5s Universität Dortmund H-SOC in 2015/16 Tegra K1 Universität Dortmund Heterogeneous Computing in K1 Visual Analytics & Computational Photography Universität Dortmund

Accelerated (Heterogeneous) Embedded Computing Graphics: © Alexandra Nolte, Gesine Marwedel, 2003 Universität Dortmund Hardware Execution Model

Lane 0 Lane 0 Lane 0 Lane 1 Lane 1 Lane 1 CPU

Lane 15 Lane 15 Lane 15 Core 0 Core 1 Core 15 GPU CPU Memory

GPU Memory

• GPU is built from multiple parallel cores, each core contains a multithreaded SIMD processor with multiple lanes but with no scalar processor • CPU sends whole “grid” over to GPU, which distributes thread blocks among cores (each thread block executes on one core) – Programmer unaware of number of cores

29 Universität Dortmund CPUs vs GPUs

ALU ALU Control ALU ALU CPU GPU

Cache

DRAM DRAM

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007- 2010 30 ECE 408, University of Illi i U b Ch i Universität Dortmund CUDA Programmer's View of GPUs

A GPU contains multiple SIMD Units. Universität Dortmund CUDA Programmer's View of GPUs

A GPU contains multiple SIMD Units. All of them can access global memory. Universität Dortmund Simplified CUDA Programming Model • Computation performed by a very large number of independent small scalar threads (CUDA threads or microthreads) grouped into thread blocks.

// C version of DAXPY loop. void daxpy(int n, double a, double*x, double*y) { for (int i=0; i>>(n,2.0,x,y); __device__ // Piece run on GP-GPU. void daxpy(int n, double a, double*x, double*y) { int i = blockIdx.x*blockDim.x + threadId.x; if (i

33 Universität Dortmund Thread Hierarchy in CUDA

Grid contains Thread Blocks

Thread Block contains Threads Universität Dortmund Sharing memory

• Mobile GPUs share memory with CPU

Converging also for general computing: “Heterogeneous System Architecture” Universität Dortmund Energy Efficiency Again

What if workload is MP+GPU not Friendly to MultiProc or GPU? MP

Efficient software design needed, otherwise, the price for software flexibility cannot be paid.

© Hugo De Man (IMEC) Philips, 2007 Universität Dortmund FPGA Reconfigurable computing

Computer architecture combining some of the flexibility of software with the high performance of hardware by processing with very flexible high speed computing fabrics like field-programmable gate arrays (FPGAs). • The principal difference when compared to using ordinary is the ability to make substantial changes to the itself in addition to the control flow. • The main difference with custom hardware, i.e. application-specific integrated circuits (ASICs) is the possibility to adapt the hardware during runtime by "loading" a new circuit on the reconfigurable fabric.

[wikipedia] Universität Dortmund ASIC or FPGA?

ASIC = specify, design and fabricate a new chip FPGA = specify, design and configure a configurable chip FPGA Architecture

The basic structure of an FPGA is composed of the following elements:  Look-up table (LUT): This element performs logic operations  Flip-Flop (FF): This register element stores the result of the LUT  Wires: These elements connect elements to one another, both Logic and clock  Input/Output (I/O) pads: These physically available ports get signals in and out of the FPGA.

ESS | FPGA for Dummies | 2015-12-08 | Maurizio Donna FPGA Components: Logic

How can we implement any circuit in an FPGA? Combinational logic is represented by a truth table (e.g. full adder).  Implement truth table in small memories (LUTs).  A function is implemented by writing all possible values that the function can take in the LUT  The inputs values are used to address the LUT and retrieve the value of the function corresponding to the input values

ESS | FPGA for Dummies | 2015-12-08 | Maurizio Donna FPGA Components: Logic

A LUT is basically a multiplexer that evaluates the truth table stored in the configuration SRAM cells (can be seen as a one bit wide ROM).

How to handle sequential logic? Add a flip-flop to the output of LUT (Clocked Storage element).

This is called Basic Logic Element (BLE): circuit can now use output from LUT or from FF.

ESS | FPGA for Dummies | 2015-12-08 | Maurizio Donna FPGA Components: wires Before FPGA is programmed, it doesn’t know which CLBs will be connected: CLB CLB CLB connections are design dependent, so there are wires everywhere (both for

DATA and CLOCK)!!!!! CLB CLB CLB CLBs are typically arranged in a grid, with wires on all sides.

To connect CLB to wires some Connection box are used: these devices allow inputs and outputs of CLB to connect to different wires

ESS | FPGA for Dummies | 2015-12-08 | Maurizio Donna FPGA Components: wires

Connection boxes allow CLBs to connect to routing wires but that only allows to move signals along a single wire; to connect wires together Switch boxes (switch matrices) are used: these connect horizontal and vertical routing channels. The flexibility defines how many wires a single wire can connect into the box. Switch box/matrix

ROUTABILITY is a measure of the CLB CLB number of circuits that can be routed

HIGHER FLEXIBILITY = CLB CLB BETTER ROUTABILITY

ESS | FPGA for Dummies | 2015-12-08 | Maurizio Donna FPGA Components: wires FPGA layout is called a “FABRIC”: is a 2-dimensional array of CLBs and programmable interconnections. Sometimes referred to as an “island style” architecture.

ESS | FPGA for Dummies | 2015-12-08 | Maurizio Donna FPGA Components: memory The FPGA fabric includes embedded memory elements that can be used as random-access memory (RAM), read-only memory (ROM), or shift registers. These elements are block RAMs (BRAMs), LUTs, and shift registers.

Using LUTs as SRAM, this is called DISTRIBUTE RAM

Included dedicated RAM components in the FPGA fabric are called BLOCKs RAM

ESS | FPGA for Dummies | 2015-12-08 | Maurizio Donna FPGA Components: input/output The IO PAD connect the signals from the PCB to the internal logic. The IOB are organized in banks (depending on the technology and the producer the number of IOB per bank change). All the PAD in the same bank, share a common supply voltage: not all the different standard could be implemented at the same time in the same bank!!!! There are special PAD for ground (GND), supplies (VCC, VCCINT, VCCAUX, etc…), clocks and for programming (JTAG).

ESS | FPGA for Dummies | 2015-12-08 | Maurizio Donna FPGA Components: input/output The IO Blocks (IOB) support a wide range of commercial standard (LVTTL, LVCMOS, LVDS, etc…) both single ended and differential (in that case pair of contiguous pad are used). In the PAD are available FF that are use to resynchronize the signal with the internal clock.

ESS | FPGA for Dummies | 2015-12-08 | Maurizio Donna Universität Dortmund HW Design flow Universität Dortmund Designing with FPGA

• FPGAs are configured using a HW design flow – Describe the desired behavior in a HDL – Use the FPGA design automation tools to turn the HDL description into a configuration bitstream • After configuration, the FPGA operates like dedicated hardware • HW design expertise needed, low abstraction level, much slower than SW design on processors! What about mixing FPGAs and Processors? Traditional Discrete Component Architecture

Source: The Zynq Book Heterogenous Architecture CPU+FPGA

Source: The Zynq Book Mapping of an Embedded SoC Hardware Architecture to Zynq

Source: White Paper: Extensible Processing Platform Comparison with Alternative Solutions

ASIC ASSP 2 Chip Zynq Solution Performance ✚ ✚  ✚ Power ✚ ✚  ✚ Unit Cost ✚ ✚   Total Cost of  ✚ ✚ ✚ Ownership Risk  ✚ ✚ ✚ Time to Market  ✚ ✚ ✚

Flexibility   ✚ ✚ Scalability   ✚ ✚

 positive, negative,  neutral

Source: Xilinx Video Tutorials Basic Design Flow for Zynq SoC

Source: The Zynq Book Design Flow for Zynq SoC

Source: Xilinx White Paper: Extensible Processing Platform