Embedded Systems Research & Future Developments

Henk Corporaal

TU/e Electrical Engineering Faculty Electronic Systems Group

June 2014 Our Group: Electronic Systems medical Part of TU/e – Electrical Engineering imaging Mission • Scientific basis for design trajectories • From function to realization

• Iteration-free design

design

sensing and actuation circuit

computer architecture computer fabrication

systems-on-chip

automotive networked systems HC – FHI DevClub 2 Our industrial cooperation

HC – FHI DevClub 3 What is an

• Wikipedia: • “An embedded system is a computer system with a dedicated function within a larger mechanical or electrical system, often with real-time constraints …. Embedded systems contain processing cores …. Embedded systems control many devices in common use today”

• Everywhere: from credit cards, smartphones, cars, households, etc. • ARM sells 30 M per day! • Billions sold every year

HC – FHI DevClub 4 Overview

• Design Challenges

• Processor Trends • CPUs • GPUs

• Research • Multiprocessor design • GPU programming • Low power

HC – FHI DevClub 5 Major challenges

• Ultra low power

• Guarantee Real-Time requirements

• Deal with tremendous complexity • Very complex processing platforms − Multi-processor − Heterogeneous − Huge software stack • Programming difficulty

HC – FHI DevClub 6 Your future phone requirements

Requirements: - 4G Communication - 2D/3D Graphics - Audio/Video Codec - Image/Video Processing  > 1 TOPS < 1 pJ/op  < 1 Watt HC – FHI DevClub 7 Intrinsic Computational Efficiency (ICE)

1pJ/op

accelerator

HC – FHI DevClub 8 Design complexity problem

complexity

103 Process technology + 58%

102 HW gap

HW design productivity +21 % SW gap 101 SW productivity + 8 %

4 8 12 16 year

HC – FHI DevClub 9 Dealing with complexity: the design pyramid

Idea Requirements Abstraction Construct Specification aspect models level and evaluate X Architecture properties

Implementation Y

Realization Solution

HC – FHI DevClub 10 Design problems: Complexity !

idea

video control comm. abstractionlevel design wall

#concepts ?

convergence

of design space design flows

different realizations

HC – FHI DevClub 11 Processor Trends By Cedric Nugteren, 2013

HC – FHI DevClub 12 Haswell (4th gen. Intel Core processor)

HC – FHI DevClub 13 Intel Xeon Phi

• Intel® Xeon Phi™ 7120X • 61 cores • 1.238 GHz, 1.333 GHZ Turbo • 512bit vector instructions • 30.5M Cache (512k / core) • 300W TDP • 16 GB on-board memory

• Price: $4129

HC – FHI DevClub 14 ARM Cortex A57: 64-bit

HC – FHI DevClub 15 Your S4, full of ARMs

Exynos 5 Octa processor • Big – Little concept • Quad ARM Cortex-A15 • Quad ARM Cortex-A7 • + PowerVR GPU

or Snapdragon S600 with quad ARM

HC – FHI DevClub 16 810

• 4x ARM Cortex A57 • 4x ARM Cortex A53 • 430 GPU • Hexagon V56 DSP • 20nm process technology

• 4K capture and playback with H.264 and H.265 • Up to 55 MP Camera, dual ISP (Image Signal Processor)

HC – FHI DevClub 17 PS4 / Xbox

• 8 AMD Jaguars • L1 instr: 32kB, 2-way, L1 data: 32 KB, 8-way; 2MB 16-way L2/quad; • 1152 MAD (MultiplyAdd units): 1.84 TFlop • 8 GB GDDR5: 176 GB/s

• 28nm HC – FHI DevClub 18 GPU overview: going many core

8192 8800 GTX 9800 GTX GTX 480 GTX 580 GTX 680 GTX Titan GTX 780 Ti G80 G92 GF100 GF110 GK104 GK110 GK110 4096 GTX 280 2048 GT200

1024

512

256

128

64

jan

jan jan jan jan jan jan jan jan

-

------

11

13 06 07 08 09 10 12 14

Core count GFLOPS Power [W]

HC – FHI DevClub 19 Maxwell – GM107 (2014)

• 28nm • SM: • 4x 32 cores • 4x 8 LD/ST unit • 4x 8 SFU

• 4 warp schedulers − 2 instructions / warp

• 32 threads per warp 1 clock cycle / warp

HC – FHI DevClub 20 Very Hot: NVIDIA K1 mobile platform

• Big-Little: 4 big +1little ARM A15s • Keppler GPU with • 192 cores • Supports 2160p @ 30fps

Keppler 192 cores

ARM big - little

HC – FHI DevClub 21 Major challenges

• Ultra low power

• Guarantee Real-Time requirements

• Deal with tremendous complexity • Very complex processing platforms − Multi-processor − Heterogeneous − Huge software stack • Programming difficulty

HC – FHI DevClub 22 Our TU/e group solutions

• Real-time Multi-processing out of the box • from Model to Multi-Processor / System-on-Chip: CompSoC, MAMPSx • Composable and Predictable design

• From C to GPUs & FPGAs • Characterize loop-nests: Species • from C to efficient GPU code • from C to verilog FPGA accelerators

• Ultra Low Power • SIMD (vector processor) with very low Vdd • Avoid (external) memory traffic: Reuse data • Use of accelerators (Heterogeneous Processing Hardware)

HC – FHI DevClub 23 MPSoC out of the box

HC – FHI DevClub 24 Application Modeling: Synchronous Data Flow

Tokens MPEG-4 SP:

VLD IDCT Rates Actors 99 99

MC RC Channels

• Conservative, worst-case abstraction • Strong analysis & synthesis support • Low implementation overhead … • … but may lead to over-allocation of resources

HC – FHI DevClub 25 Modeling: Scenario-Aware Data Flow

MPEG-4 SP: Tokens

VLD IDCT Actors Rates x x ... State machine

MC RC Channels I P99 P99

x = {0, 30, 40, 50, 60, 70, 80, 99} P0

• Dynamics captured in scenarios • Efficient implementations • Design and Modeling Tools available: SDF3

HC – FHI DevClubPAGE 26 26 MAMPSx Multiprocessor Framework for Real-Time correct-by-construction Systems

HC – FHI DevClub 27 GPU Programming

Electronic Systems HC – FHI DevClubPAGE 28 28 GPUs: hundreds of processors

HC – FHI DevClub 29 However: Programming hell, from C to CUDA

HC – FHI DevClub 30 Programming heaven ? FPGA

Or FPGA?

HC – FHI DevClub 31 Good news: Lots of similarity

HC – FHI DevClub 32 Algorithmic Species

HC – FHI DevClub 33 Our performance (compared to other fully automatic compilers)

HC – FHI DevClub 34 Ultra Low Power

HC – FHI DevClub 36 SIMD + low Vdd + New Memory Architecture • Exploiting Data Locality

• Extreme low Voltage

• Massive parallelism

HC – FHI DevClub 37

Results on the ICE Graph

1pJ/op

HC – FHI DevClub 38 Can we do better?

• Yes ! How?

• Supporting special functionality • Accelerators

• Change your software • Advanced code (loop) transformations

HC – FHI DevClub 39 E.g. Data Reuse in Convolutional Neural Networks

CNN used for e.g. Recognizing Traffic Signs

C2 S2 feature maps feature maps n1 C1 S1 10 x 10 5 x 5 input feature maps feature maps n2 32 x 32 28 x 28 14 x 14 output sign 30 50 60 70 80 90 100 5x5 1x1 convolution 2x2 convolution subsampling

5x5 2x2 5x5 convolution subsampling convolution feature extraction classification HC – FHI DevClub 40 Data reuse: Bandwith reduction 327x

1000 6

original

100

10

#External memory accesses accesses 10 x memory #External 1

On-chip memory size • However: Huge on-chip memory required HC – FHI DevClub 41 After using our code transformer:

1000

6 original

loop n model

simulated 100 loop m loop q

loop r

10 #External memory accesses accesses 10 x memory #External 1

On-chip memory size

• 64 times memory reduction / same BW reduction HC – FHI DevClub 42 Convolutional Neural Network FPGA prototype • Synthesized with Vivado HLS (AutoESL) C -> HDL

• Extreme (Linear) speedup (with number of MACC PEs)

Accelerator CNN Layer 3 FSL MACC PE MACC PE in_img MACC PE MACC PE In weight Ctrl MACC PE MACC PE bias MACC PE MACC PE MACC PE MACC PE out_img Out Ctrl Activation Select saturate DDR LUT

HC – FHI DevClub 43 Summary

• Embedded System become extremely complex • Processor trends • Advanced methods and tooling needed for • Programming efficiency • Design Space Exploration • Predictable Design: guaranteeing timing properties • Correct by Construction

• Researching solutions • Automated MPSoC design • Automated C to target platform • Ultra low power • We FPGA prototype everything HC – FHI DevClub 44 Cheap prototype boards

HC – FHI DevClub 45 Minnowboard MAX

99 to 139 USD

HC – FHI DevClub 46 Minnowboard MAX

Number of cores Single/Dual core 64-bit Intel® ™ E38xx Series SoC @ 1.46 / 1.33 GHz RAM 1/2 GB DDR3 RAM Flash 8 MByte SPI Flash External SD card Caches 32KByte I-Cache / 24KByte D-Cache 512 KByte L2 cache GPU Intel integrated HD Graphics I/O DVI over HDMI connector Audio SATA2 3Gb/sec 2 USB (host) & 1 USB-B (device; slave) Serial debug Serial (UART 0) to USB conversion (mini-USB-B port) 10/100/1000 Ethernet RJ-45 connector 8 Buffered GPIO (General Purpose I/O) pins SPI & I2C Special features 7-Issue slot VLIW from SiliconHive (INTEL)

Price 99 USD / 139 USD HC – FHI DevClub 47

35 USD

HC – FHI DevClub 48 Raspberry Pi

Number of cores 1 Broadcom BCM2835 (CPU: ARM11 @ 700 MHZ + GPU + DSP) RAM 512 MByte Flash External SD card Caches 16 KB (Instruction) / 16 KB (Data) L1 Cache GPU Broadcom VideoCore IV I/O 2 USB ports Audio out HDMI video out Composite video out 10/100 Mbit Ethernet GPIO Special features Very cheap, relatively small Price 35 USD

HC – FHI DevClub 49 Jetson TK1 (using K1)

192 USD

HC – FHI DevClub 50 Jetson TK1

Number of cores 4-Plus-1 quad-core ARM: 4 Cortex A15+ A7 CPU @ 2.3 GHz RAM 2 GByte with 64 bit width Flash 16 GB 4.51 eMMC memory & External SD card Caches 32 KB I-cache, 32 KB D-cache 4 Mbyte L2 cache GPU Kepler GPU with 192 CUDA cores @ 950 MHz I/O Half mini-PCIE slot Full size SD/MMC connector Full-size HDMI port 1x USB 2.0, 1x USB 3.0 , 1x RS232 Audio & Gigabit ethernet Via expansion port: GPIO, UART, SPI, I2C Special features Very fast mobile GPU platform, OpenCL programmable Price 192 USD

HC – FHI DevClub 51 Google (using TEGRA K1)

1024 USD

HC – FHI DevClub 52 Google Tango

Number of cores 4-Plus-1 quad-core ARM: 4 Cortex A15 + A7 CPU @ 2.3 GHz RAM 4GB of RAM Flash 128GB Caches 32 KB I-cache, 32 KB D-cache 4 Mbyte L2 cache GPU Kepler GPU with 192 CUDA cores @ 950 MHz I/O Motion tracking camera Depth sensing WIFI Bluetooth Low Energy 4G Special features Kinect like vision (depth) and powerful processor Price 1024 USD

HC – FHI DevClub 53 Bugblat PIF FPGA extension board for Raspberry Pi

35 USD

HC – FHI DevClub 54 Bugblat PIF

Number of cores No CPU cores; but (Lattice) FPGA with 6864 LUTs RAM 26x 9 Kbits SRAM blocks (in FPGA) Flash 256 Kbits in FPGA Caches N/A GPU N/A I/O 47 GPIO pins Red and green leds Special features FPGA expansion board for Raspberry Pi Price 34.99 USD

HC – FHI DevClub 55

259 USD

HC – FHI DevClub 56 Arndale board

Number of cores [email protected] GHz dual core subsystem with 64/128 bit SIMD NEON (Octo 4+4 ARM A15-A9 version available) RAM 32-bit 800Mhz DDR3(L)/DDR3 2 GBytes Flash External Memory card Caches 32KB(instruction)/32KB(DATA) L1 Cache 1MB L2 Cache GPU ARM Mali T-604 GPU I/O SATA 1.0/2.0/3.0 interface ITU 601 camera Interface HDMI 1.4 interfaces with on-chip PHY USB 2 & 3 Four channel high-speed UART Three channel high-speed SPI Three channel 24-bit I2S audio interface Four channel I2C interface Special features First OpenCL programmable GPU: Mali T-604 GPU board Price 259 USD HC – FHI DevClub 57 Duemilanove (2009)

15 EURO

HC – FHI DevClub 58 Arduino Duemilanove (2009)

CPU Single core 8-bit AVR micro controller 32-bit version exits RAM 2 kB Flash 32 kB Caches N/A GPU N/A I/O 23 GPIO / UART / ADC / SPI (Serial Peripheral Interface) / I2C / PWM Special features Arduino compatible, USB programmable Price 15 EUR

HC – FHI DevClub 59 LPC Xpresso

20 EURO

HC – FHI DevClub 60 LPC Xpresso

CPU Single core ARM Cortex M0 (from NXP) RAM 8 kB Flash 32 kB Caches N/A GPU N/A I/O GPIO, SSP (Synchr Serial Port), I2C, UART, ADC Special features Integrated JTAG debugger Price 20 EUR

HC – FHI DevClub 61 LPC1768

55 EURO

HC – FHI DevClub 62 Mbed LPC1768

CPU Single core ARM Cortex M3 RAM 32 kB Flash 512 kB Caches N/A GPU N/A I/O Ethernet, USB Host/Device, SPI, I2C, UART, CAN, PWM, ADC, GPIO Special features Online Compiler (Drag-and-Drop source code) / Built-in USB drag 'n' drop FLASH programmer Price 55 EUR

HC – FHI DevClub 63 Helios

99 EURO

HC – FHI DevClub 64 Helios

Number of cores Single core 8 bit AVR RAM 2.5 Kbytes Flash 32 Kbytes Caches N/A GPU N/A I/O GPIO (General Purpose IO) UART SPI (Serial Peripheral Interface) I2C (Inter-Intergrated Circuit) ADC (anlog to Digital Converter) PWM (Pulse Width Modulation) WIFI Led display RGB LED Special features Color sensor and temperature sensor and Real Time Clock Arduino compatible Price 99 EUR (or free as gadget on Electronics & Automation 2013) HC – FHI DevClub 65 Freescale Freedom board

7,25 EURO

HC – FHI DevClub 66 Freescale Freedom board

Number of cores Single core Cortex-M0+ @ 48 MHz RAM 16 KB SRAM Flash 128KByte Caches N/A GPU N/A I/O USB OTG Capacitive touch “slider” MMA8451Q accelerometer tri-color LED PWM ADC SPI/I2C/Uart GPIO Special features Arduino compatible header, on-board accelerometer, very cheap

Price 7,25 EUR HC – FHI DevClub 67 Zedboard ()

319 USD

HC – FHI DevClub 68 Zedboard

Number of cores Dual ARM® Cortex™-A9 MPCore™ @ 667 MHz + FPGA: 85k LUTs + 220 (18-bit) DSPs RAM 512 MB DDR3 64 KB of scratch RAM Flash 256 Mb Quad-SPI Flash & External SD Card Caches L1 32KB I-Cache/32KB D-Cache, 512KB L2 Cache GPU N/A I/O Onboard USB-JTAG Programming 10/100/1000 Ethernet USB OTG 2.0 and USB-UART PS & PL I/O expansion (FMC, Pmod™ Compatible, XADC) Multiple displays (1080p HDMI, 8-bit VGA, 128 x 32 OLED) I2S Audio CODEC Special features Dual core ARM & FPGA System on Chip, on-board OLED

Price 319 USD (academic price) normal price 495 USD HC – FHI DevClub 69 DE1-SOC (Altera)

150 USD

HC – FHI DevClub 70 DE1-SOC

Number of cores Dual ARM® Cortex™-A9 @ 800 MHz (925 MHz max) + FPGA 85k LUTs + 160 (64-bit Floating Point) DSPs RAM 1GB (2x256Mx16) DDR3 SDRAM on HPS 64MB (32Mx16) SDRAM on FPGA 64 KB of scratch RAM Flash External MicroSD card Caches 32 KB of L1 instruction cache, 32 KB of L1 data cache (per core) 512 KB of shared L2 cache GPU N/A (or on FPGA) I/O Two Port USB 2.0 Host (ULPI interface with USB type A connector) & USB to UART 10/100/1000 Ethernet PS/2 mouse/keyboard IR Emitter/Receiver One 10-pin ADC Input Header One LTC connector (SPI Master ,one I2C and one GPIO interface ) 24-bit VGA DAC & 24-bit Audio CODEC, Line-in, line-out, and microphone-in jacks TV Decoder (NTSC/PAL/SECAM) and TV-in connector HC – FHI DevClub 71 Special features Dual core ARM & FPGA System on Chip, on-board Cyclone V GX starter kit (Altera FPGA)

180 USD

HC – FHI DevClub 72 Cyclone V GX starter kit

Number of cores No CPU cores FPGA only, 115 kLUTs + 150 (64-bit Floating Point) DSPs RAM 512 MByte LPDDR2, 32 bits data bus 512 KByte SRAM, 16 bits data bus Flash External SD Card & EPCQ256 on FPGA Caches N/A GPU N/A I/O UART to USB HSMC x 1, including 4-lanes 3.125G transceiver 2x20 GPIO Header Arduino header, including analog pins SMA x 4 one-lane 3.125G transceiver HDMI TX, compatible with DVI v1.0 and HDCP v1.4 24-bit CODEC, Line-in, line-out, and microphone-in jacks 8 Channel ADC Special features FPGA with 5x 3.125 Gbit/sec transceivers

Price 180 USD HC – FHI DevClub 73 Beagleboard XM

149 USD

HC – FHI DevClub 74 Beagleboard XM

Number of cores TI DM3730 ARM Cortex-A8 @ 1GHz TMS320C64x+ @ 800 MHz (DSP, VLIW, 8 FU, 6 ALU, 2 MUL) RAM 512 MB LPDDR RAM 64 Kbyte SRAM inside ARM Flash External SD card Caches L1 Data & instruction 32 Kbyte for ARM L2 256 Kbyte for ARM L1 Data (80 KByte) & Instruction (32KByte) for DSP L2 112KByte for DSP GPU PowerVR SGX 2D/3D I/O DVI-D & S-Video USB OTG (mini AB) 4 USB ports Ethernet port MicroSD/MMC card slot Stereo in and out jacks RS-232 port Camera port Expansion port Special features DSP, ARM & GPU on single chip, special camera interface Price 149 USD HC – FHI DevClub 75