Universität Dortmund
Hardware Platforms for Embedded Computing Graphics: © Alexandra Nolte, Gesine Marwedel, 2003 Universität Dortmund Importance of Energy Efficiency
Efficient software design needed, otherwise, the price for software flexibility cannot be paid.
© Hugo De Man (IMEC) Philips, 2007 Universität Dortmund
Embedded vs. general-purpose processors
• Embedded processors may be optimized for a category of applications. – Customization may be narrow or broad. • We may judge embedded processors using different metrics: – Code size. – Memory system performance. – Preditability. • Disappearing distinction: embedded processors everywhere Universität Dortmund Microcontroller Architectures Memory 0 Address Bus Program + CPU Data Bus Data Von Neumann
2n Architecture
Memory 0 Address Bus Program CPU Fetch Bus Harvard
Address Bus 0 Architecture Data Bus Data Universität Dortmund RISC processors
• RISC generally means highly-pipelinable, one instruction per cycle. • Pipelines of embedded RISC processors have grown over time: – ARM7 has 3-stage pipeline. – ARM9 has 5-stage pipeline. – ARM11 has eight-stage pipeline. ARM11 pipeline [ARM05]. Universität Dortmund ARM Cortex
Based on ARMv7 Architecture & Thumb®-2 ISA – ARM Cortex A Series - Applications CPUs focused on the execution of complex OS and user applications • First Product: Cortex-A8 • Executes ARM, Thumb-2 instructions – ARM Cortex R Series - Deeply embedded processors focused on Real-time environments • First Product: Cortex-R4(F) • Executes ARM, Thumb-2 instructions – ARM Cortex M Series - Microcontroller cores focused on very cost sensitive, deterministic, interrupt driven environments • First Product: ARM Cortex-M3 (2uA, 0.5mW/MHz) • Executes Thumb-2 instructions Universität Dortmund Cortex-M3 “Processor” Universität Dortmund Central Core
Harvard architecture Separate Instruction & Data buses enable parallel fetch & store Advanced 3-Stage Pipeline Includes Branch Forwarding & Speculation Additional Write-Back via Bus Matrix Universität Dortmund Microcontrollers
Memory ROM RAM CPU
I/O
Subsystems: Timers, Counters, Analog A single chip Interfaces, I/O interfaces Universität Dortmund A Microcontroller SOC example: STM32 Value line 64K-128KBytes System Diagram
• Core and operating conditions CORTEXTM-M3 64kB - 128kB Power Supply - ARM® Cortex™-M3 CPU Flash Memory Reg 1.8V 24 MHz Flash I/F POR/PDR/PVD - 1.25 DMIPS/MHz up to 24 MHz - 2.0 V to 3.6 V range 8kB SRAM XTAL oscillators 32KHz + 4~25MHz - -40 to +105 °C JTAG/SW Debug 20B Backup Data (max 24MHz) (max Int. RC oscillators Speed Bus Nested vect IT Ctrl - • Rich connectivity 40KHz + 8MHz 1 x Systick Timer
- 8 communications peripherals Hi Lite PLL ® DMA 7 Channels ARM RTC / AWU • Advanced analog / Arbiter Matrix Clock Control - 12-bit1.2 µs conversion time ADC ARM® Peripheral Bus - Dual channel 12-bit DAC Bridge (max 24MHz) Bridge 1 x 16-bit PWM Synchronized AC Timer • Enhanced control 6 x 16-bit Timer 1 x CEC - 16-bit motor control timer Up to 16 Ext. ITs 2 x Watchdog - 6x 16-bit PWM timers (independent & window) 37/51/80 I/Os 2 x USART/LIN 2-channel 12-bit DAC Smartcard / IrDa
PeripheralBus Modem Control (max 24MHz)(max • LQFP48, LQFP/BGA64, LQFP100 1 x SPI ® 1 x 12-bit ADC 16 channels 1 x SPI
ARM up to 1 x USART/LIN Smartcard/IrDa 2 Modem Control Temperature Sensor 2 x I C Universität Dortmund DSP Applications
• Audio applications • Networking • MPEG Audio • Cable modems • Portable audio • ADSL • Digital cameras • VDSL • Wireless • Cellular telephones • Base station
Embeded computing needs lots of DSP capabilities Universität Dortmund DSP architectures
Application: y[j] = ∑n-1 x[j-i]*a[i] i=0 ∀i: 0≤i ≤ n-1: yi[j] = yi-1[j] + x[j-i]*a[i] Architecture: Example: Data path ADSP210x P a - Parallelism D x - Dedicated registers x[j-i] a[i] AX AY MX MY MF Address- AF MR:=0; A1:=1; A2:=n-2; registers A0, A1, A2 MX:=x[n-1]; MY:=a[0]; .. for ( j:=1 to n) +,-,.. i+1, j-i+1 * x[j-i]*a[i] {MR:=MR+MX*MY; Address AR +,- MY:=a[A1]; MX:=x[A2]; generation A1++; A2--} unit (AGU) MR yi-1[j] Universität Dortmund DSP - Features (1)
• Multiply/accumulate (MAC) and zero-overhead loop (ZOL) instructions (as shown) • Heterogeneous registers (as shown) • Separate address generation units (AGUs) (as in ADSP 210x) Universität Dortmund Single Issue vs VLIW instr op instr op instr op instr op instr op op op instr nop op op instr op Compiler instr op instr op op nop instr op instr op nop op instr op instr op op op instr op execute instr op 1 instr/cycle instr op 3 ops/cycle instr op execute 1 instr/cycle 3-issue VLIW Single Issue CPU
2/25/2016 Embedded Computer Architecture H. Corporaal 14 and B. Mesman Universität Dortmund ARM Processors Families
15 Cortex-M4
. ARMv7E-M Architecture . Thumb-2 only . DSP extensions
. Optional FPU (Cortex-M4F)
. Otherwise, same as Cortex-M3
. Implements full Thumb-2 instruction set . Saturated math (e.g. QADD) . Packing and unpacking (e.g. UXTB) . Signed multiply (e.g. SMULTB) . SIMD (e.g. ADD8)
Cortex M3 Total 60k* Gates
University Program Material Copyright © ARM Ltd 2012 16 Binary Upwards Compatibility
ARMv7-M Architecture
ARMv6-M Architecture
University Program Material Copyright © ARM Ltd 2012 17 Cortex-M4 DSP instructions
. Remember VLIW?
University Program Material Copyright © ARM Ltd 2012 18 Universität Dortmund
Multi-processors SoCs for Embedded Computing Graphics: © Alexandra Nolte, Gesine Marwedel, 2003 Universität Dortmund Application pull
3D gaming 1TOPS/W 3D TV 3D ambient Structured interaction decoding Ubiquitous 3D projectednavigation Autonomous display driving HMI by motion StructuredGesture detection encoding 100GOPS/W Expression recognition Gbit radio Collision H264 Adaptive avoidance encoding route
Gesture LanguageEmotion dictation recognition recognition 10GOPS/W UWB A/V Sign 5 GOPS/W Image streaming recognition recognition 802.11n Mobile Si Xray Base-band H264 decoding Auto Fully recognition personalization (security) 2005 2007 2009 2011 2013 2015 [IMEC] Year of Introduction Universität Dortmund Power Bottleneck
• Power trend
100 300
Dynamic Power 250 1 200 Possible trajectory for 10-2 Gate- 150 Oxide high-k Leakage dielectrics Sub- 100 10-4 Threshold
Power Consumption Leakage 50 Physical Gate Length [nm] Length Gate Physical
10-6 0 1990 1995 2000 2005 2010 2015 2020 Power density trend 150 125 Leakage Power Power Density 100 2 75 Dynamic Power (Watts/cm ) 50 25 0 250nm 180nm 130nm 90nm 65nm [STM ASIC] Universität Dortmund Multi-Core & Power
Power Cache Power = 1/4 4 Performance Performance = 1/2 3 Large Core 2 2 Small Core 1 1 1 1
C1 C2 4 4 Multi-Core: 3 3 Cache Power efficient 2 2 Better power and C3 C4 thermal management 1 1 Universität Dortmund µArchitecture Techniques
100% Multi-threading Increase on-die Memory Single Thread Full HW Utilization 75% Pentium® M ST Wait for Mem 50% Multi-Threading MT1 Wait for Mem Pentium® III MT2 Wait 25% 486 Pentium® Pentium® 4 Cache % of Total Area MT3 0% 1u 0.5u 0.25u 0.13u 65nm Improved performance, no impact on thermals & power delivery
Chip Multi-processing 3,5 3 Multi Core C1 C2 2,5 2 Cache Large 1,5 Single Core Core C3 C4 Relative Performance 1 1 2 3 4 Die Area, Power Universität Dortmund Integrated SoC Mobile • High-speed SMP for “almost sequential” GP • “Processor arrays” for domain-specific throughput computing (100x GOPS/W) ultra parallel…
24 Universität Dortmund H-SOC in 2013 – Apple A7
Used in IPad AIR & IPhone 5s Universität Dortmund H-SOC in 2015/16 Tegra K1 Universität Dortmund Heterogeneous Computing in K1 Visual Analytics & Computational Photography Universität Dortmund
Accelerated (Heterogeneous) Embedded Computing Graphics: © Alexandra Nolte, Gesine Marwedel, 2003 Universität Dortmund Hardware Execution Model
Lane 0 Lane 0 Lane 0 Lane 1 Lane 1 Lane 1 CPU
Lane 15 Lane 15 Lane 15 Core 0 Core 1 Core 15 GPU CPU Memory
GPU Memory
• GPU is built from multiple parallel cores, each core contains a multithreaded SIMD processor with multiple lanes but with no scalar processor • CPU sends whole “grid” over to GPU, which distributes thread blocks among cores (each thread block executes on one core) – Programmer unaware of number of cores
29 Universität Dortmund CPUs vs GPUs
ALU ALU Control ALU ALU CPU GPU
Cache
DRAM DRAM
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007- 2010 30 ECE 408, University of Illi i U b Ch i Universität Dortmund CUDA Programmer's View of GPUs
A GPU contains multiple SIMD Units. Universität Dortmund CUDA Programmer's View of GPUs
A GPU contains multiple SIMD Units. All of them can access global memory. Universität Dortmund Simplified CUDA Programming Model • Computation performed by a very large number of independent small scalar threads (CUDA threads or microthreads) grouped into thread blocks.
// C version of DAXPY loop. void daxpy(int n, double a, double*x, double*y) { for (int i=0; i 33 Universität Dortmund Thread Hierarchy in CUDA Grid contains Thread Blocks Thread Block contains Threads Universität Dortmund Sharing memory • Mobile GPUs share memory with CPU Converging also for general computing: “Heterogeneous System Architecture” Universität Dortmund Energy Efficiency Again What if workload is MP+GPU not Friendly to MultiProc or GPU? MP Efficient software design needed, otherwise, the price for software flexibility cannot be paid. © Hugo De Man (IMEC) Philips, 2007 Universität Dortmund FPGA Reconfigurable computing Computer architecture combining some of the flexibility of software with the high performance of hardware by processing with very flexible high speed computing fabrics like field-programmable gate arrays (FPGAs). • The principal difference when compared to using ordinary microprocessors is the ability to make substantial changes to the datapath itself in addition to the control flow. • The main difference with custom hardware, i.e. application-specific integrated circuits (ASICs) is the possibility to adapt the hardware during runtime by "loading" a new circuit on the reconfigurable fabric. [wikipedia] Universität Dortmund ASIC or FPGA? ASIC = specify, design and fabricate a new chip FPGA = specify, design and configure a configurable chip FPGA Architecture The basic structure of an FPGA is composed of the following elements: Look-up table (LUT): This element performs logic operations Flip-Flop (FF): This register element stores the result of the LUT Wires: These elements connect elements to one another, both Logic and clock Input/Output (I/O) pads: These physically available ports get signals in and out of the FPGA. ESS | FPGA for Dummies | 2015-12-08 | Maurizio Donna FPGA Components: Logic How can we implement any circuit in an FPGA? Combinational logic is represented by a truth table (e.g. full adder). Implement truth table in small memories (LUTs). A function is implemented by writing all possible values that the function can take in the LUT The inputs values are used to address the LUT and retrieve the value of the function corresponding to the input values ESS | FPGA for Dummies | 2015-12-08 | Maurizio Donna FPGA Components: Logic A LUT is basically a multiplexer that evaluates the truth table stored in the configuration SRAM cells (can be seen as a one bit wide ROM). How to handle sequential logic? Add a flip-flop to the output of LUT (Clocked Storage element). This is called Basic Logic Element (BLE): circuit can now use output from LUT or from FF. ESS | FPGA for Dummies | 2015-12-08 | Maurizio Donna FPGA Components: wires Before FPGA is programmed, it doesn’t know which CLBs will be connected: CLB CLB CLB connections are design dependent, so there are wires everywhere (both for DATA and CLOCK)!!!!! CLB CLB CLB CLBs are typically arranged in a grid, with wires on all sides. To connect CLB to wires some Connection box are used: these devices allow inputs and outputs of CLB to connect to different wires ESS | FPGA for Dummies | 2015-12-08 | Maurizio Donna FPGA Components: wires Connection boxes allow CLBs to connect to routing wires but that only allows to move signals along a single wire; to connect wires together Switch boxes (switch matrices) are used: these connect horizontal and vertical routing channels. The flexibility defines how many wires a single wire can connect into the box. Switch box/matrix ROUTABILITY is a measure of the CLB CLB number of circuits that can be routed HIGHER FLEXIBILITY = CLB CLB BETTER ROUTABILITY ESS | FPGA for Dummies | 2015-12-08 | Maurizio Donna FPGA Components: wires FPGA layout is called a “FABRIC”: is a 2-dimensional array of CLBs and programmable interconnections. Sometimes referred to as an “island style” architecture. ESS | FPGA for Dummies | 2015-12-08 | Maurizio Donna FPGA Components: memory The FPGA fabric includes embedded memory elements that can be used as random-access memory (RAM), read-only memory (ROM), or shift registers. These elements are block RAMs (BRAMs), LUTs, and shift registers. Using LUTs as SRAM, this is called DISTRIBUTE RAM Included dedicated RAM components in the FPGA fabric are called BLOCKs RAM ESS | FPGA for Dummies | 2015-12-08 | Maurizio Donna FPGA Components: input/output The IO PAD connect the signals from the PCB to the internal logic. The IOB are organized in banks (depending on the technology and the producer the number of IOB per bank change). All the PAD in the same bank, share a common supply voltage: not all the different standard could be implemented at the same time in the same bank!!!! There are special PAD for ground (GND), supplies (VCC, VCCINT, VCCAUX, etc…), clocks and for programming (JTAG). ESS | FPGA for Dummies | 2015-12-08 | Maurizio Donna FPGA Components: input/output The IO Blocks (IOB) support a wide range of commercial standard (LVTTL, LVCMOS, LVDS, etc…) both single ended and differential (in that case pair of contiguous pad are used). In the PAD are available FF that are use to resynchronize the signal with the internal clock. ESS | FPGA for Dummies | 2015-12-08 | Maurizio Donna Universität Dortmund HW Design flow Universität Dortmund Designing with FPGA • FPGAs are configured using a HW design flow – Describe the desired behavior in a HDL – Use the FPGA design automation tools to turn the HDL description into a configuration bitstream • After configuration, the FPGA operates like dedicated hardware • HW design expertise needed, low abstraction level, much slower than SW design on processors! What about mixing FPGAs and Processors? Traditional Discrete Component Architecture Source: The Zynq Book Heterogenous Architecture CPU+FPGA Source: The Zynq Book Mapping of an Embedded SoC Hardware Architecture to Zynq Source: Xilinx White Paper: Extensible Processing Platform Comparison with Alternative Solutions ASIC ASSP 2 Chip Zynq Solution Performance ✚ ✚ ✚ Power ✚ ✚ ✚ Unit Cost ✚ ✚ Total Cost of ✚ ✚ ✚ Ownership Risk ✚ ✚ ✚ Time to Market ✚ ✚ ✚ Flexibility ✚ ✚ Scalability ✚ ✚ positive, negative, neutral Source: Xilinx Video Tutorials Basic Design Flow for Zynq SoC Source: The Zynq Book Design Flow for Zynq SoC Source: Xilinx White Paper: Extensible Processing Platform