4X ARM Cortex A57 • 4X ARM Cortex A53 • Adreno 430 GPU • Hexagon V56 DSP • 20Nm Process Technology

Embedded Systems Research & Future Developments Henk Corporaal TU/e Electrical Engineering Faculty Electronic Systems Group June 2014 Our Group: Electronic Systems medical Part of TU/e – Electrical Engineering imaging Mission • Scientific basis for design trajectories • From function to realization • Iteration-free design design sensing and actuation circuit chip computer architecture computer fabrication systems-on-chip automotive networked systems HC – FHI DevClub 2 Our industrial cooperation HC – FHI DevClub 3 What is an Embedded System • Wikipedia: • “An embedded system is a computer system with a dedicated function within a larger mechanical or electrical system, often with real-time constraints …. Embedded systems contain processing cores …. Embedded systems control many devices in common use today” • Everywhere: from credit cards, smartphones, cars, households, etc. • ARM sells 30 M per day! • Billions sold every year HC – FHI DevClub 4 Overview • Design Challenges • Processor Trends • CPUs • GPUs • Research • Multiprocessor design • GPU programming • Low power HC – FHI DevClub 5 Major challenges • Ultra low power • Guarantee Real-Time requirements • Deal with tremendous complexity • Very complex processing platforms − Multi-processor − Heterogeneous − Huge software stack • Programming difficulty HC – FHI DevClub 6 Your future phone requirements Requirements: - 4G Communication - 2D/3D Graphics - Audio/Video Codec - Image/Video Processing > 1 TOPS < 1 pJ/op < 1 Watt HC – FHI DevClub 7 Intrinsic Computational Efficiency (ICE) 1pJ/op accelerator HC – FHI DevClub 8 Design complexity problem complexity 103 Process technology + 58% 102 HW gap HW design productivity +21 % SW gap 101 SW productivity + 8 % 4 8 12 16 year HC – FHI DevClub 9 Dealing with complexity: the design pyramid Idea Requirements Abstraction Construct Specification aspect models level and evaluate X Architecture properties Implementation Y Realization Solution HC – FHI DevClub 10 Design problems: Complexity ! idea video control comm. abstraction level abstraction design wall #concepts ? convergence of design space design flows different realizations HC – FHI DevClub 11 Processor Trends By Cedric Nugteren, 2013 HC – FHI DevClub 12 Haswell (4th gen. Intel Core processor) HC – FHI DevClub 13 Intel Xeon Phi • Intel® Xeon Phi™ Coprocessor 7120X • 61 cores • 1.238 GHz, 1.333 GHZ Turbo • 512bit vector instructions • 30.5M Cache (512k / core) • 300W TDP • 16 GB on-board memory • Price: $4129 HC – FHI DevClub 14 ARM Cortex A57: 64-bit HC – FHI DevClub 15 Your S4, full of ARMs Exynos 5 Octa processor • Big – Little concept • Quad ARM Cortex-A15 • Quad ARM Cortex-A7 • + PowerVR GPU or Qualcomm Snapdragon S600 with quad ARM HC – FHI DevClub 16 Qualcomm Snapdragon 810 • 4x ARM Cortex A57 • 4x ARM Cortex A53 • Adreno 430 GPU • Hexagon V56 DSP • 20nm process technology • 4K capture and playback with H.264 and H.265 • Up to 55 MP Camera, dual ISP (Image Signal Processor) HC – FHI DevClub 17 PS4 / Xbox • 8 AMD Jaguars • L1 instr: 32kB, 2-way, L1 data: 32 KB, 8-way; 2MB 16-way L2/quad; • 1152 MAD (MultiplyAdd units): 1.84 TFlop • 8 GB GDDR5: 176 GB/s • 28nm HC – FHI DevClub 18 Nvidia GPU overview: going many core 8192 8800 GTX 9800 GTX GTX 480 GTX 580 GTX 680 GTX Titan GTX 780 Ti G80 G92 GF100 GF110 GK104 GK110 GK110 4096 GTX 280 2048 GT200 1024 512 256 128 64 jan jan jan jan jan jan jan jan jan - - - - - - - - - 11 13 06 07 08 09 10 12 14 Core count GFLOPS Power [W] HC – FHI DevClub 19 Maxwell – GM107 (2014) • 28nm • SM: • 4x 32 cores • 4x 8 LD/ST unit • 4x 8 SFU • 4 warp schedulers − 2 instructions / warp • 32 threads per warp 1 clock cycle / warp HC – FHI DevClub 20 Very Hot: NVIDIA K1 mobile platform • Big-Little: 4 big +1little ARM A15s • Keppler GPU with • 192 cores • Supports 2160p @ 30fps Keppler 192 cores ARM big - little HC – FHI DevClub 21 Major challenges • Ultra low power • Guarantee Real-Time requirements • Deal with tremendous complexity • Very complex processing platforms − Multi-processor − Heterogeneous − Huge software stack • Programming difficulty HC – FHI DevClub 22 Our TU/e group solutions • Real-time Multi-processing out of the box • from Model to Multi-Processor / System-on-Chip: CompSoC, MAMPSx • Composable and Predictable design • From C to GPUs & FPGAs • Characterize loop-nests: Species • from C to efficient GPU code • from C to verilog FPGA accelerators • Ultra Low Power • SIMD (vector processor) with very low Vdd • Avoid (external) memory traffic: Reuse data • Use of accelerators (Heterogeneous Processing Hardware) HC – FHI DevClub 23 MPSoC out of the box HC – FHI DevClub 24 Application Modeling: Synchronous Data Flow Tokens MPEG-4 SP: VLD IDCT Rates Actors 99 99 MC RC Channels • Conservative, worst-case abstraction • Strong analysis & synthesis support • Low implementation overhead … • … but may lead to over-allocation of resources HC – FHI DevClub 25 Modeling: Scenario-Aware Data Flow MPEG-4 SP: Tokens VLD IDCT Actors Rates x x ... State machine MC RC Channels I P99 P99 x = {0, 30, 40, 50, 60, 70, 80, 99} P0 • Dynamics captured in scenarios • Efficient implementations • Design and Modeling Tools available: SDF3 HC – FHI DevClubPAGE 26 26 MAMPSx Multiprocessor Framework for Real-Time correct-by-construction Systems HC – FHI DevClub 27 GPU Programming Electronic Systems HC – FHI DevClubPAGE 28 28 GPUs: hundreds of processors HC – FHI DevClub 29 However: Programming hell, from C to CUDA HC – FHI DevClub 30 Programming heaven ? FPGA Or FPGA? HC – FHI DevClub 31 Good news: Lots of similarity HC – FHI DevClub 32 Algorithmic Species HC – FHI DevClub 33 Our performance (compared to other fully automatic compilers) HC – FHI DevClub 34 Ultra Low Power HC – FHI DevClub 36 SIMD + low Vdd + New Memory Architecture • Exploiting Data Locality • Extreme low Voltage • Massive parallelism HC – FHI DevClub 37 Results on the ICE Graph 1pJ/op HC – FHI DevClub 38 Can we do better? • Yes ! How? • Supporting special functionality • Accelerators • Change your software • Advanced code (loop) transformations HC – FHI DevClub 39 E.g. Data Reuse in Convolutional Neural Networks CNN used for e.g. Recognizing Traffic Signs C2 S2 feature maps feature maps n1 C1 S1 10 x 10 5 x 5 input feature maps feature maps n2 32 x 32 28 x 28 14 x 14 output sign 30 50 60 70 80 90 100 5x5 1x1 convolution 2x2 convolution subsampling 5x5 2x2 5x5 convolution subsampling convolution feature extraction classification HC – FHI DevClub 40 Data reuse: Bandwith reduction 327x 1000 6 original 100 10 #External memory accesses accesses x 10 memory #External 1 On-chip memory size • However: Huge on-chip memory required HC – FHI DevClub 41 After using our code transformer: 1000 6 original loop n model simulated 100 loop m loop q loop r 10 #External memory accesses accesses x 10 memory #External 1 On-chip memory size • 64 times memory reduction / same BW reduction HC – FHI DevClub 42 Convolutional Neural Network FPGA prototype • Synthesized with Vivado HLS (AutoESL) C -> HDL • Extreme (Linear) speedup (with number of MACC PEs) Accelerator CNN Layer 3 FSL MACC PE MACC PE in_img MACC PE MACC PE In weight Ctrl MACC PE MACC PE bias MACC PE MACC PE MACC PE MACC PE out_img Out Ctrl Activation Select saturate DDR LUT HC – FHI DevClub 43 Summary • Embedded System become extremely complex • Processor trends • Advanced methods and tooling needed for • Programming efficiency • Design Space Exploration • Predictable Design: guaranteeing timing properties • Correct by Construction • Researching solutions • Automated MPSoC design • Automated C to target platform • Ultra low power • We FPGA prototype everything HC – FHI DevClub 44 Cheap prototype boards HC – FHI DevClub 45 Minnowboard MAX 99 to 139 USD HC – FHI DevClub 46 Minnowboard MAX Number of cores Single/Dual core 64-bit Intel® Atom™ E38xx Series SoC @ 1.46 / 1.33 GHz RAM 1/2 GB DDR3 RAM Flash 8 MByte SPI Flash External SD card Caches 32KByte I-Cache / 24KByte D-Cache 512 KByte L2 cache GPU Intel integrated HD Graphics I/O DVI over HDMI connector Audio SATA2 3Gb/sec 2 USB (host) & 1 USB-B (device; slave) Serial debug Serial (UART 0) to USB conversion (mini-USB-B port) 10/100/1000 Ethernet RJ-45 connector 8 Buffered GPIO (General Purpose I/O) pins SPI & I2C Special features 7-Issue slot VLIW from SiliconHive (INTEL) Price 99 USD / 139 USD HC – FHI DevClub 47 Raspberry Pi 35 USD HC – FHI DevClub 48 Raspberry Pi Number of cores 1 Broadcom BCM2835 (CPU: ARM11 @ 700 MHZ + GPU + DSP) RAM 512 MByte Flash External SD card Caches 16 KB (Instruction) / 16 KB (Data) L1 Cache GPU Broadcom VideoCore IV I/O 2 USB ports Audio out HDMI video out Composite video out 10/100 Mbit Ethernet GPIO Special features Very cheap, relatively small Price 35 USD HC – FHI DevClub 49 Jetson TK1 (using TEGRA K1) 192 USD HC – FHI DevClub 50 Jetson TK1 Number of cores 4-Plus-1 quad-core ARM: 4 Cortex A15+ A7 CPU @ 2.3 GHz RAM 2 GByte with 64 bit width Flash 16 GB 4.51 eMMC memory & External SD card Caches 32 KB I-cache, 32 KB D-cache 4 Mbyte L2 cache GPU Kepler GPU with 192 CUDA cores @ 950 MHz I/O Half mini-PCIE slot Full size SD/MMC connector Full-size HDMI port 1x USB 2.0, 1x USB 3.0 , 1x RS232 Audio & Gigabit ethernet Via expansion port: GPIO, UART, SPI, I2C Special features Very fast mobile GPU platform, OpenCL programmable Price 192 USD HC – FHI DevClub 51 Google Tango (using TEGRA K1) 1024 USD HC – FHI DevClub 52 Google Tango Number of cores 4-Plus-1 quad-core ARM: 4 Cortex A15 + A7 CPU @ 2.3 GHz RAM 4GB of RAM Flash 128GB Caches 32 KB I-cache, 32 KB D-cache 4 Mbyte L2 cache GPU Kepler

4X ARM Cortex A57 • 4X ARM Cortex A53 • Adreno 430 GPU • Hexagon V56 DSP • 20Nm Process Technology

Enabling the Use of Low Power Mobile and Embedded Technologies For

Stochastic Modeling and Performance Analysis of Multimedia Socs

An Emerging Architecture in Smart Phones

Comparative Study of Various Systems on Chips Embedded in Mobile Devices

Overview of CPU Power Consumption and Management in Smartphones

DISI - University of Trento

M-Line BROCHURE

Introducing Slambench, a Performance and Accuracy Benchmarking Methodology for SLAM

QTEE STOR, Is Shown in Figure 4

Report on Tuned Linux-ARM Kernel and Delivery of Kernel Patches to the Linux Kernel Version 1.0

Arxiv:1910.06663V1 [Cs.PF] 15 Oct 2019

Arm-Based Computing Platform Solutions Accelerating Your Arm Project Development