Hardware Platforms for Embedded Computing Graphics: © Alexandra Nolte, Gesine Marwedel, 2003 Universität Dortmund Importance of Energy Efficiency
Total Page:16
File Type:pdf, Size:1020Kb
Universität Dortmund Hardware Platforms for Embedded Computing Graphics: © Alexandra Nolte, Gesine Marwedel, 2003 Universität Dortmund Importance of Energy Efficiency Efficient software design needed, otherwise, the price for software flexibility cannot be paid. © Hugo De Man (IMEC) Philips, 2007 Universität Dortmund Embedded vs. general-purpose processors • Embedded processors may be optimized for a category of applications. – Customization may be narrow or broad. • We may judge embedded processors using different metrics: – Code size. – Memory system performance. – Preditability. • Disappearing distinction: embedded processors everywhere Universität Dortmund Microcontroller Architectures Memory 0 Address Bus Program + CPU Data Bus Data Von Neumann 2n Architecture Memory 0 Address Bus Program CPU Fetch Bus Harvard Address Bus 0 Architecture Data Bus Data Universität Dortmund RISC processors • RISC generally means highly-pipelinable, one instruction per cycle. • Pipelines of embedded RISC processors have grown over time: – ARM7 has 3-stage pipeline. – ARM9 has 5-stage pipeline. – ARM11 has eight-stage pipeline. ARM11 pipeline [ARM05]. Universität Dortmund ARM Cortex Based on ARMv7 Architecture & Thumb®-2 ISA – ARM Cortex A Series - Applications CPUs focused on the execution of complex OS and user applications • First Product: Cortex-A8 • Executes ARM, Thumb-2 instructions – ARM Cortex R Series - Deeply embedded processors focused on Real-time environments • First Product: Cortex-R4(F) • Executes ARM, Thumb-2 instructions – ARM Cortex M Series - Microcontroller cores focused on very cost sensitive, deterministic, interrupt driven environments • First Product: ARM Cortex-M3 (2uA, 0.5mW/MHz) • Executes Thumb-2 instructions Universität Dortmund Cortex-M3 “Processor” Universität Dortmund Central Core Harvard architecture Separate Instruction & Data buses enable parallel fetch & store Advanced 3-Stage Pipeline Includes Branch Forwarding & Speculation Additional Write-Back via Bus Matrix Universität Dortmund Microcontrollers Memory ROM RAM CPU I/O Subsystems: Timers, Counters, Analog A single chip Interfaces, I/O interfaces Universität Dortmund A Microcontroller SOC example: STM32 Value line 64K-128KBytes System Diagram • Core and operating conditions CORTEXTM-M3 64kB - 128kB Power Supply - ARM® Cortex™-M3 CPU Flash Memory Reg 1.8V 24 MHz Flash I/F POR/PDR/PVD - 1.25 DMIPS/MHz up to 24 MHz - 2.0 V to 3.6 V range 8kB SRAM XTAL oscillators 32KHz + 4~25MHz - -40 to +105 °C JTAG/SW Debug 20B Backup Data (max 24MHz) (max Int. RC oscillators Speed Bus Nested vect IT Ctrl - • Rich connectivity 40KHz + 8MHz 1 x Systick Timer - 8 communications peripherals Hi Lite PLL ® DMA 7 Channels ARM RTC / AWU • Advanced analog / Arbiter Matrix Clock Control - 12-bit1.2 µs conversion time ADC ARM® Peripheral Bus - Dual channel 12-bit DAC Bridge (max 24MHz) Bridge 1 x 16-bit PWM Synchronized AC Timer • Enhanced control 6 x 16-bit Timer 1 x CEC - 16-bit motor control timer Up to 16 Ext. ITs 2 x Watchdog - 6x 16-bit PWM timers (independent & window) 37/51/80 I/Os 2 x USART/LIN 2-channel 12-bit DAC Smartcard / IrDa Peripheral Bus Modem Control (max 24MHz) • LQFP48, LQFP/BGA64, LQFP100 1 x SPI ® 1 x 12-bit ADC 16 channels 1 x SPI ARM up to 1 x USART/LIN Smartcard/IrDa 2 Modem Control Temperature Sensor 2 x I C Universität Dortmund DSP Applications • Audio applications • Networking • MPEG Audio • Cable modems • Portable audio • ADSL • Digital cameras • VDSL • Wireless • Cellular telephones • Base station Embeded computing needs lots of DSP capabilities Universität Dortmund DSP architectures Application: y[j] = ∑n-1 x[j-i]*a[i] i=0 ∀i: 0≤i ≤ n-1: yi[j] = yi-1[j] + x[j-i]*a[i] Architecture: Example: Data path ADSP210x P a - Parallelism D x - Dedicated registers x[j-i] a[i] AX AY MX MY MF Address- AF MR:=0; A1:=1; A2:=n-2; registers A0, A1, A2 MX:=x[n-1]; MY:=a[0]; .. for ( j:=1 to n) +,-,.. i+1, j-i+1 * x[j-i]*a[i] {MR:=MR+MX*MY; Address AR +,- MY:=a[A1]; MX:=x[A2]; generation A1++; A2--} unit (AGU) MR yi-1[j] Universität Dortmund DSP - Features (1) • Multiply/accumulate (MAC) and zero-overhead loop (ZOL) instructions (as shown) • Heterogeneous registers (as shown) • Separate address generation units (AGUs) (as in ADSP 210x) Universität Dortmund Single Issue vs VLIW instr op instr op instr op instr op instr op op op instr nop op op instr op Compiler instr op instr op op nop instr op instr op nop op instr op instr op op op instr op execute instr op 1 instr/cycle instr op 3 ops/cycle instr op execute 1 instr/cycle 3-issue VLIW Single Issue CPU 2/25/2016 Embedded Computer Architecture H. Corporaal 14 and B. Mesman Universität Dortmund ARM Processors Families 15 Cortex-M4 . ARMv7E-M Architecture . Thumb-2 only . DSP extensions . Optional FPU (Cortex-M4F) . Otherwise, same as Cortex-M3 . Implements full Thumb-2 instruction set . Saturated math (e.g. QADD) . Packing and unpacking (e.g. UXTB) . Signed multiply (e.g. SMULTB) . SIMD (e.g. ADD8) Cortex M3 Total 60k* Gates University Program Material Copyright © ARM Ltd 2012 16 Binary Upwards Compatibility ARMv7-M Architecture ARMv6-M Architecture University Program Material Copyright © ARM Ltd 2012 17 Cortex-M4 DSP instructions . Remember VLIW? University Program Material Copyright © ARM Ltd 2012 18 Universität Dortmund Multi-processors SoCs for Embedded Computing Graphics: © Alexandra Nolte, Gesine Marwedel, 2003 Universität Dortmund Application pull 3D gaming 1TOPS/W 3D TV 3D ambient Structured interaction decoding Ubiquitous 3D projectednavigation Autonomous display driving HMI by motion StructuredGesture detection encoding 100GOPS/W Expression recognition Gbit radio Collision H264 Adaptive avoidance encoding route Gesture LanguageEmotion dictation recognition recognition 10GOPS/W UWB A/V Sign 5 GOPS/W Image streaming recognition recognition 802.11n Mobile Si Xray Base-band H264 decoding Auto Fully recognition personalization (security) 2005 2007 2009 2011 2013 2015 [IMEC] Year of Introduction Universität Dortmund Power Bottleneck • Power trend 100 300 Dynamic Power 250 1 200 Possible trajectory for 10-2 Gate- 150 Oxide high-k Leakage dielectrics Sub- 100 10-4 Threshold Power Consumption Leakage 50 Physical Gate Length [nm] Length Gate Physical 10-6 0 1990 1995 2000 2005 2010 2015 2020 Power density trend 150 125 Leakage Power Power Density 100 2 75 Dynamic Power (Watts/cm ) 50 25 0 250nm 180nm 130nm 90nm 65nm [STM ASIC] Universität Dortmund Multi-Core & Power Power Cache Power = 1/4 4 Performance Performance = 1/2 3 Large Core 2 2 Small Core 1 1 1 1 C1 C2 4 4 Multi-Core: 3 3 Cache Power efficient 2 2 Better power and C3 C4 thermal management 1 1 Universität Dortmund µArchitecture Techniques 100% Multi-threading Increase on-die Memory Single Thread Full HW Utilization 75% Pentium® M ST Wait for Mem 50% Multi-Threading MT1 Wait for Mem Pentium® III MT2 Wait 25% 486 Pentium® Pentium® 4 Cache % of Total Area MT3 0% 1u 0.5u 0.25u 0.13u 65nm Improved performance, no impact on thermals & power delivery Chip Multi-processing 3,5 3 Multi Core C1 C2 2,5 2 Cache Large 1,5 Single Core Core C3 C4 Relative Performance 1 1 2 3 4 Die Area, Power Universität Dortmund Integrated SoC Mobile • High-speed SMP for “almost sequential” GP • “Processor arrays” for domain-specific throughput computing (100x GOPS/W) ultra parallel… 24 Universität Dortmund H-SOC in 2013 – Apple A7 Used in IPad AIR & IPhone 5s Universität Dortmund H-SOC in 2015/16 Tegra K1 Universität Dortmund Heterogeneous Computing in K1 Visual Analytics & Computational Photography Universität Dortmund Accelerated (Heterogeneous) Embedded Computing Graphics: © Alexandra Nolte, Gesine Marwedel, 2003 Universität Dortmund Hardware Execution Model Lane 0 Lane 0 Lane 0 Lane 1 Lane 1 Lane 1 CPU Lane 15 Lane 15 Lane 15 Core 0 Core 1 Core 15 GPU CPU Memory GPU Memory • GPU is built from multiple parallel cores, each core contains a multithreaded SIMD processor with multiple lanes but with no scalar processor • CPU sends whole “grid” over to GPU, which distributes thread blocks among cores (each thread block executes on one core) – Programmer unaware of number of cores 29 Universität Dortmund CPUs vs GPUs ALU ALU Control ALU ALU CPU GPU Cache DRAM DRAM © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007- 2010 30 ECE 408, University of Illi i U b Ch i Universität Dortmund CUDA Programmer's View of GPUs A GPU contains multiple SIMD Units. Universität Dortmund CUDA Programmer's View of GPUs A GPU contains multiple SIMD Units. All of them can access global memory. Universität Dortmund Simplified CUDA Programming Model • Computation performed by a very large number of independent small scalar threads (CUDA threads or microthreads) grouped into thread blocks. // C version of DAXPY loop. void daxpy(int n, double a, double*x, double*y) { for (int i=0; i<n; i++) y[i] = a*x[i] + y[i]; } // CUDA version. __host__ // Piece run on host processor. int nblocks = (n+255)/256; // 256 CUDA threads/block daxpy<<<nblocks,256>>>(n,2.0,x,y); __device__ // Piece run on GP-GPU. void daxpy(int n, double a, double*x, double*y) { int i = blockIdx.x*blockDim.x + threadId.x; if (i<n) y[i]=a*x[i]+y[i]; } 33 Universität Dortmund Thread Hierarchy in CUDA Grid contains Thread Blocks Thread Block contains Threads Universität Dortmund Sharing memory • Mobile GPUs share memory with CPU Converging also for general computing: “Heterogeneous System Architecture” Universität Dortmund Energy Efficiency Again What if workload is MP+GPU not Friendly to MultiProc or GPU? MP Efficient software design needed, otherwise, the price for software flexibility cannot be paid. © Hugo De Man (IMEC) Philips, 2007 Universität Dortmund FPGA Reconfigurable computing Computer architecture combining some of the flexibility of software with the high performance of hardware by processing with very flexible high speed computing fabrics like field-programmable gate arrays (FPGAs).