Embedded Systems Research & Future Developments
Henk Corporaal
TU/e Electrical Engineering Faculty Electronic Systems Group
June 2014 Our Group: Electronic Systems medical Part of TU/e – Electrical Engineering imaging Mission • Scientific basis for design trajectories • From function to realization
• Iteration-free design
design
sensing and actuation circuit
computer architecture computer fabrication
systems-on-chip
automotive networked systems HC – FHI DevClub 2 Our industrial cooperation
HC – FHI DevClub 3 What is an Embedded System
• Wikipedia: • “An embedded system is a computer system with a dedicated function within a larger mechanical or electrical system, often with real-time constraints …. Embedded systems contain processing cores …. Embedded systems control many devices in common use today”
• Everywhere: from credit cards, smartphones, cars, households, etc. • ARM sells 30 M per day! • Billions sold every year
HC – FHI DevClub 4 Overview
• Design Challenges
• Processor Trends • CPUs • GPUs
• Research • Multiprocessor design • GPU programming • Low power
HC – FHI DevClub 5 Major challenges
• Ultra low power
• Guarantee Real-Time requirements
• Deal with tremendous complexity • Very complex processing platforms − Multi-processor − Heterogeneous − Huge software stack • Programming difficulty
HC – FHI DevClub 6 Your future phone requirements
Requirements: - 4G Communication - 2D/3D Graphics - Audio/Video Codec - Image/Video Processing > 1 TOPS < 1 pJ/op < 1 Watt HC – FHI DevClub 7 Intrinsic Computational Efficiency (ICE)
1pJ/op
accelerator
HC – FHI DevClub 8 Design complexity problem
complexity
103 Process technology + 58%
102 HW gap
HW design productivity +21 % SW gap 101 SW productivity + 8 %
4 8 12 16 year
HC – FHI DevClub 9 Dealing with complexity: the design pyramid
Idea Requirements Abstraction Construct Specification aspect models level and evaluate X Architecture properties
Implementation Y
Realization Solution
HC – FHI DevClub 10 Design problems: Complexity !
idea
video control comm. abstractionlevel design wall
#concepts ?
convergence
of design space design flows
different realizations
HC – FHI DevClub 11 Processor Trends By Cedric Nugteren, 2013
HC – FHI DevClub 12 Haswell (4th gen. Intel Core processor)
HC – FHI DevClub 13 Intel Xeon Phi
• Intel® Xeon Phi™ Coprocessor 7120X • 61 cores • 1.238 GHz, 1.333 GHZ Turbo • 512bit vector instructions • 30.5M Cache (512k / core) • 300W TDP • 16 GB on-board memory
• Price: $4129
HC – FHI DevClub 14 ARM Cortex A57: 64-bit
HC – FHI DevClub 15 Your S4, full of ARMs
Exynos 5 Octa processor • Big – Little concept • Quad ARM Cortex-A15 • Quad ARM Cortex-A7 • + PowerVR GPU
or Qualcomm Snapdragon S600 with quad ARM
HC – FHI DevClub 16 Qualcomm Snapdragon 810
• 4x ARM Cortex A57 • 4x ARM Cortex A53 • Adreno 430 GPU • Hexagon V56 DSP • 20nm process technology
• 4K capture and playback with H.264 and H.265 • Up to 55 MP Camera, dual ISP (Image Signal Processor)
HC – FHI DevClub 17 PS4 / Xbox
• 8 AMD Jaguars • L1 instr: 32kB, 2-way, L1 data: 32 KB, 8-way; 2MB 16-way L2/quad; • 1152 MAD (MultiplyAdd units): 1.84 TFlop • 8 GB GDDR5: 176 GB/s
• 28nm HC – FHI DevClub 18 Nvidia GPU overview: going many core
8192 8800 GTX 9800 GTX GTX 480 GTX 580 GTX 680 GTX Titan GTX 780 Ti G80 G92 GF100 GF110 GK104 GK110 GK110 4096 GTX 280 2048 GT200
1024
512
256
128
64
jan
jan jan jan jan jan jan jan jan
-
------
11
13 06 07 08 09 10 12 14
Core count GFLOPS Power [W]
HC – FHI DevClub 19 Maxwell – GM107 (2014)
• 28nm • SM: • 4x 32 cores • 4x 8 LD/ST unit • 4x 8 SFU
• 4 warp schedulers − 2 instructions / warp
• 32 threads per warp 1 clock cycle / warp
HC – FHI DevClub 20 Very Hot: NVIDIA K1 mobile platform
• Big-Little: 4 big +1little ARM A15s • Keppler GPU with • 192 cores • Supports 2160p @ 30fps
Keppler 192 cores
ARM big - little
HC – FHI DevClub 21 Major challenges
• Ultra low power
• Guarantee Real-Time requirements
• Deal with tremendous complexity • Very complex processing platforms − Multi-processor − Heterogeneous − Huge software stack • Programming difficulty
HC – FHI DevClub 22 Our TU/e group solutions
• Real-time Multi-processing out of the box • from Model to Multi-Processor / System-on-Chip: CompSoC, MAMPSx • Composable and Predictable design
• From C to GPUs & FPGAs • Characterize loop-nests: Species • from C to efficient GPU code • from C to verilog FPGA accelerators
• Ultra Low Power • SIMD (vector processor) with very low Vdd • Avoid (external) memory traffic: Reuse data • Use of accelerators (Heterogeneous Processing Hardware)
HC – FHI DevClub 23 MPSoC out of the box
HC – FHI DevClub 24 Application Modeling: Synchronous Data Flow
Tokens MPEG-4 SP:
VLD IDCT Rates Actors 99 99
MC RC Channels
• Conservative, worst-case abstraction • Strong analysis & synthesis support • Low implementation overhead … • … but may lead to over-allocation of resources
HC – FHI DevClub 25 Modeling: Scenario-Aware Data Flow
MPEG-4 SP: Tokens
VLD IDCT Actors Rates x x ... State machine
MC RC Channels I P99 P99
x = {0, 30, 40, 50, 60, 70, 80, 99} P0
• Dynamics captured in scenarios • Efficient implementations • Design and Modeling Tools available: SDF3
HC – FHI DevClubPAGE 26 26 MAMPSx Multiprocessor Framework for Real-Time correct-by-construction Systems
HC – FHI DevClub 27 GPU Programming
Electronic Systems HC – FHI DevClubPAGE 28 28 GPUs: hundreds of processors
HC – FHI DevClub 29 However: Programming hell, from C to CUDA
HC – FHI DevClub 30 Programming heaven ? FPGA
Or FPGA?
HC – FHI DevClub 31 Good news: Lots of similarity
HC – FHI DevClub 32 Algorithmic Species
HC – FHI DevClub 33 Our performance (compared to other fully automatic compilers)
HC – FHI DevClub 34 Ultra Low Power
HC – FHI DevClub 36 SIMD + low Vdd + New Memory Architecture • Exploiting Data Locality
• Extreme low Voltage
• Massive parallelism
HC – FHI DevClub 37
Results on the ICE Graph
1pJ/op
HC – FHI DevClub 38 Can we do better?
• Yes ! How?
• Supporting special functionality • Accelerators
• Change your software • Advanced code (loop) transformations
HC – FHI DevClub 39 E.g. Data Reuse in Convolutional Neural Networks
CNN used for e.g. Recognizing Traffic Signs
C2 S2 feature maps feature maps n1 C1 S1 10 x 10 5 x 5 input feature maps feature maps n2 32 x 32 28 x 28 14 x 14 output sign 30 50 60 70 80 90 100 5x5 1x1 convolution 2x2 convolution subsampling
5x5 2x2 5x5 convolution subsampling convolution feature extraction classification HC – FHI DevClub 40 Data reuse: Bandwith reduction 327x
1000 6
original
100
10
#External memory accesses accesses 10 x memory #External 1
On-chip memory size • However: Huge on-chip memory required HC – FHI DevClub 41 After using our code transformer:
1000
6 original
loop n model
simulated 100 loop m loop q
loop r
10 #External memory accesses accesses 10 x memory #External 1
On-chip memory size
• 64 times memory reduction / same BW reduction HC – FHI DevClub 42 Convolutional Neural Network FPGA prototype • Synthesized with Vivado HLS (AutoESL) C -> HDL
• Extreme (Linear) speedup (with number of MACC PEs)
Accelerator CNN Layer 3 FSL MACC PE MACC PE in_img MACC PE MACC PE In weight Ctrl MACC PE MACC PE bias MACC PE MACC PE MACC PE MACC PE out_img Out Ctrl Activation Select saturate DDR LUT
HC – FHI DevClub 43 Summary
• Embedded System become extremely complex • Processor trends • Advanced methods and tooling needed for • Programming efficiency • Design Space Exploration • Predictable Design: guaranteeing timing properties • Correct by Construction
• Researching solutions • Automated MPSoC design • Automated C to target platform • Ultra low power • We FPGA prototype everything HC – FHI DevClub 44 Cheap prototype boards
HC – FHI DevClub 45 Minnowboard MAX
99 to 139 USD
HC – FHI DevClub 46 Minnowboard MAX
Number of cores Single/Dual core 64-bit Intel® Atom™ E38xx Series SoC @ 1.46 / 1.33 GHz RAM 1/2 GB DDR3 RAM Flash 8 MByte SPI Flash External SD card Caches 32KByte I-Cache / 24KByte D-Cache 512 KByte L2 cache GPU Intel integrated HD Graphics I/O DVI over HDMI connector Audio SATA2 3Gb/sec 2 USB (host) & 1 USB-B (device; slave) Serial debug Serial (UART 0) to USB conversion (mini-USB-B port) 10/100/1000 Ethernet RJ-45 connector 8 Buffered GPIO (General Purpose I/O) pins SPI & I2C Special features 7-Issue slot VLIW from SiliconHive (INTEL)
Price 99 USD / 139 USD HC – FHI DevClub 47 Raspberry Pi
35 USD
HC – FHI DevClub 48 Raspberry Pi
Number of cores 1 Broadcom BCM2835 (CPU: ARM11 @ 700 MHZ + GPU + DSP) RAM 512 MByte Flash External SD card Caches 16 KB (Instruction) / 16 KB (Data) L1 Cache GPU Broadcom VideoCore IV I/O 2 USB ports Audio out HDMI video out Composite video out 10/100 Mbit Ethernet GPIO Special features Very cheap, relatively small Price 35 USD
HC – FHI DevClub 49 Jetson TK1 (using TEGRA K1)
192 USD
HC – FHI DevClub 50 Jetson TK1
Number of cores 4-Plus-1 quad-core ARM: 4 Cortex A15+ A7 CPU @ 2.3 GHz RAM 2 GByte with 64 bit width Flash 16 GB 4.51 eMMC memory & External SD card Caches 32 KB I-cache, 32 KB D-cache 4 Mbyte L2 cache GPU Kepler GPU with 192 CUDA cores @ 950 MHz I/O Half mini-PCIE slot Full size SD/MMC connector Full-size HDMI port 1x USB 2.0, 1x USB 3.0 , 1x RS232 Audio & Gigabit ethernet Via expansion port: GPIO, UART, SPI, I2C Special features Very fast mobile GPU platform, OpenCL programmable Price 192 USD
HC – FHI DevClub 51 Google Tango (using TEGRA K1)
1024 USD
HC – FHI DevClub 52 Google Tango
Number of cores 4-Plus-1 quad-core ARM: 4 Cortex A15 + A7 CPU @ 2.3 GHz RAM 4GB of RAM Flash 128GB Caches 32 KB I-cache, 32 KB D-cache 4 Mbyte L2 cache GPU Kepler GPU with 192 CUDA cores @ 950 MHz I/O Motion tracking camera Depth sensing WIFI Bluetooth Low Energy 4G Special features Kinect like vision (depth) and powerful processor Price 1024 USD
HC – FHI DevClub 53 Bugblat PIF FPGA extension board for Raspberry Pi
35 USD
HC – FHI DevClub 54 Bugblat PIF
Number of cores No CPU cores; but (Lattice) FPGA with 6864 LUTs RAM 26x 9 Kbits SRAM blocks (in FPGA) Flash 256 Kbits in FPGA Caches N/A GPU N/A I/O 47 GPIO pins Red and green leds Special features FPGA expansion board for Raspberry Pi Price 34.99 USD
HC – FHI DevClub 55 Arndale board
259 USD
HC – FHI DevClub 56 Arndale board
Number of cores [email protected] GHz dual core subsystem with 64/128 bit SIMD NEON (Octo 4+4 ARM A15-A9 version available) RAM 32-bit 800Mhz DDR3(L)/DDR3 2 GBytes Flash External Memory card Caches 32KB(instruction)/32KB(DATA) L1 Cache 1MB L2 Cache GPU ARM Mali T-604 GPU I/O SATA 1.0/2.0/3.0 interface ITU 601 camera Interface HDMI 1.4 interfaces with on-chip PHY USB 2 & 3 Four channel high-speed UART Three channel high-speed SPI Three channel 24-bit I2S audio interface Four channel I2C interface Special features First OpenCL programmable GPU: Mali T-604 GPU board Price 259 USD HC – FHI DevClub 57 Arduino Duemilanove (2009)
15 EURO
HC – FHI DevClub 58 Arduino Duemilanove (2009)
CPU Single core 8-bit AVR micro controller 32-bit version exits RAM 2 kB Flash 32 kB Caches N/A GPU N/A I/O 23 GPIO / UART / ADC / SPI (Serial Peripheral Interface) / I2C / PWM Special features Arduino compatible, USB programmable Price 15 EUR
HC – FHI DevClub 59 LPC Xpresso
20 EURO
HC – FHI DevClub 60 LPC Xpresso
CPU Single core ARM Cortex M0 (from NXP) RAM 8 kB Flash 32 kB Caches N/A GPU N/A I/O GPIO, SSP (Synchr Serial Port), I2C, UART, ADC Special features Integrated JTAG debugger Price 20 EUR
HC – FHI DevClub 61 Mbed LPC1768
55 EURO
HC – FHI DevClub 62 Mbed LPC1768
CPU Single core ARM Cortex M3 RAM 32 kB Flash 512 kB Caches N/A GPU N/A I/O Ethernet, USB Host/Device, SPI, I2C, UART, CAN, PWM, ADC, GPIO Special features Online Compiler (Drag-and-Drop source code) / Built-in USB drag 'n' drop FLASH programmer Price 55 EUR
HC – FHI DevClub 63 Helios
99 EURO
HC – FHI DevClub 64 Helios
Number of cores Single core 8 bit AVR microcontroller RAM 2.5 Kbytes Flash 32 Kbytes Caches N/A GPU N/A I/O GPIO (General Purpose IO) UART SPI (Serial Peripheral Interface) I2C (Inter-Intergrated Circuit) ADC (anlog to Digital Converter) PWM (Pulse Width Modulation) WIFI Led display RGB LED Special features Color sensor and temperature sensor and Real Time Clock Arduino compatible Price 99 EUR (or free as gadget on Electronics & Automation 2013) HC – FHI DevClub 65 Freescale Freedom board
7,25 EURO
HC – FHI DevClub 66 Freescale Freedom board
Number of cores Single core Cortex-M0+ @ 48 MHz RAM 16 KB SRAM Flash 128KByte Caches N/A GPU N/A I/O USB OTG Capacitive touch “slider” MMA8451Q accelerometer tri-color LED PWM ADC SPI/I2C/Uart GPIO Special features Arduino compatible header, on-board accelerometer, very cheap
Price 7,25 EUR HC – FHI DevClub 67 Zedboard (Xilinx)
319 USD
HC – FHI DevClub 68 Zedboard
Number of cores Dual ARM® Cortex™-A9 MPCore™ @ 667 MHz + FPGA: 85k LUTs + 220 (18-bit) DSPs RAM 512 MB DDR3 64 KB of scratch RAM Flash 256 Mb Quad-SPI Flash & External SD Card Caches L1 32KB I-Cache/32KB D-Cache, 512KB L2 Cache GPU N/A I/O Onboard USB-JTAG Programming 10/100/1000 Ethernet USB OTG 2.0 and USB-UART PS & PL I/O expansion (FMC, Pmod™ Compatible, XADC) Multiple displays (1080p HDMI, 8-bit VGA, 128 x 32 OLED) I2S Audio CODEC Special features Dual core ARM & FPGA System on Chip, on-board OLED
Price 319 USD (academic price) normal price 495 USD HC – FHI DevClub 69 DE1-SOC (Altera)
150 USD
HC – FHI DevClub 70 DE1-SOC
Number of cores Dual ARM® Cortex™-A9 @ 800 MHz (925 MHz max) + FPGA 85k LUTs + 160 (64-bit Floating Point) DSPs RAM 1GB (2x256Mx16) DDR3 SDRAM on HPS 64MB (32Mx16) SDRAM on FPGA 64 KB of scratch RAM Flash External MicroSD card Caches 32 KB of L1 instruction cache, 32 KB of L1 data cache (per core) 512 KB of shared L2 cache GPU N/A (or on FPGA) I/O Two Port USB 2.0 Host (ULPI interface with USB type A connector) & USB to UART 10/100/1000 Ethernet PS/2 mouse/keyboard IR Emitter/Receiver One 10-pin ADC Input Header One LTC connector (SPI Master ,one I2C and one GPIO interface ) 24-bit VGA DAC & 24-bit Audio CODEC, Line-in, line-out, and microphone-in jacks TV Decoder (NTSC/PAL/SECAM) and TV-in connector HC – FHI DevClub 71 Special features Dual core ARM & FPGA System on Chip, on-board Cyclone V GX starter kit (Altera FPGA)
180 USD
HC – FHI DevClub 72 Cyclone V GX starter kit
Number of cores No CPU cores FPGA only, 115 kLUTs + 150 (64-bit Floating Point) DSPs RAM 512 MByte LPDDR2, 32 bits data bus 512 KByte SRAM, 16 bits data bus Flash External SD Card & EPCQ256 on FPGA Caches N/A GPU N/A I/O UART to USB HSMC x 1, including 4-lanes 3.125G transceiver 2x20 GPIO Header Arduino header, including analog pins SMA x 4 one-lane 3.125G transceiver HDMI TX, compatible with DVI v1.0 and HDCP v1.4 24-bit CODEC, Line-in, line-out, and microphone-in jacks 8 Channel ADC Special features FPGA with 5x 3.125 Gbit/sec transceivers
Price 180 USD HC – FHI DevClub 73 Beagleboard XM
149 USD
HC – FHI DevClub 74 Beagleboard XM
Number of cores TI DM3730 ARM Cortex-A8 @ 1GHz TMS320C64x+ @ 800 MHz (DSP, VLIW, 8 FU, 6 ALU, 2 MUL) RAM 512 MB LPDDR RAM 64 Kbyte SRAM inside ARM Flash External SD card Caches L1 Data & instruction 32 Kbyte for ARM L2 256 Kbyte for ARM L1 Data (80 KByte) & Instruction (32KByte) for DSP L2 112KByte for DSP GPU PowerVR SGX 2D/3D I/O DVI-D & S-Video USB OTG (mini AB) 4 USB ports Ethernet port MicroSD/MMC card slot Stereo in and out jacks RS-232 port Camera port Expansion port Special features DSP, ARM & GPU on single chip, special camera interface Price 149 USD HC – FHI DevClub 75