物聯網平台

‣Chi-Sheng Shih Outline

‣ IoT Platforms

‣ Core of IoT Platforms

‣ Micro-Controller based Platforms

‣ Micro-Processor based Platforms

‣ Execution Model

!2 IoT Platforms

!3 Core of platforms -

‣ Microprocessor ‣ is an IC which has only CPU inside. ‣ does not have RAM, ROM, and other peripheral on the chip. ‣ A system designer needs to add peripherals externally to make them functional. ‣ Examples: Intel i3, i5, i7 processors, ARM700/710/720, ARM810, ARM920T, ARM940T, and Cortex-M0 processors.

!4 Core of platforms - Micro-controller

‣ has a processing unit, in addition with a fixed amount of RAM, ROM, and other peripherals all embedded on a single chip. ‣ is designed to perform specific tasks where the relationship between input and output are well defined. ‣ The amount of resources required for the applications are defined and can be packed together in a single chip. ‣ Examples: ‣ Micro-controllers for keyboard, mouse, washing machines, digicam, etc. ‣ Micro-controllers based on ‣ Cortex-M0: Cypress PSoC4, Infineon XMC 1000, STMicroelectronics STM32F0, NXP LPC 1100 ‣ ARM7: ATmel AT91SAM7, NXP LPC2100/LPC2200, ‣ ARM9: ATmel AT91SAM9, NXP LPC 2900

!5 Core of the Platforms - SoC

‣ System-on-Chip (SoC) ‣ is a total solution for specific applications, ‣ has processing units (including CPU, GPU, FPGA or DSP), application specific peripherals (including modem, GPS, etc) ‣ There is no clear line between SoC and MCU. ‣ MCUs are often used for applications required limited computation power to reduce energy consumption. Controller for CNC machines are examples. ‣ SoC are often used for applications required full computers to accomplish. SmartPhones are examples.

!6 Micro-Controller Based Platform

!7 -based Platform

TinyDuino PanStamps Arduino Uno

RFduino XinoRF Arduino Yun

!8 Arduino ‣ Arduino Uno: ‣ MCU: ATmega328P ‣ Arduino Mega: ‣ MCU: ATmega2560 ‣ Flash: 32KB ‣ Flash: 256KB ‣ SRAM: 2KB ‣ SRAM: 8KB ‣ EEPROM: 1KB ‣ EEPROM: 4KB ‣ Clock: 16Mhz ‣ Clock: 16Mhz ‣ Digital I/O: 14 ‣ Digital I/O: 54 ‣ Analog: 6 ‣ Analog: 16

!9 Arduino Family

LiLyPad

!10 Performance and Power Consumption

‣ The device achieves a throughput of 1 MIPS per MHz. ‣ ATmega 328 can operate at up to 20Mhz. ‣ Can we make it to run for six month on battery?

‣ Arduino Uno uses 45mA@5V and can run for less than one day on 9V battery (1,200mAh).

‣ To run for 6 months, the board can only drain 0.05mA@5V. On Arduino Pro Mini:

‣ Unmodified: 9.9mA@Active, 3.14mA@Sleep

‣ No Power LED: 16.9mA@Active, 0.0232mA@Sleep

‣ No Power LED, No power regulator: 12.7mA, 0.0058mA@Sleep http://www.home-automation-community.com/arduino-low-power-how-to-run-atmega328p-for-a-year-on-coin-cell-battery/

!11 Curie

‣ Announced on CSE 2015 to be used for wearable devices. ‣ Using Quark SE core ‣ Pentium x86 ISA compatible without x87 floating point unit ‣ 32-bit Processor with 32-bit Data Bus. ‣ 32 MHz clock frequency ‣ 384 KB of on-die flash. ‣ 80 KB of on-die SRAM ‣ Sensors: 6-axis accelerometer, ‣ Communication: BLE (Bluetooth low energy) ‣ Peripheral: USB, UART, I2C, SPA, GPIO, RTC, ADC

!12 !13 MediaTek LinkIt One

‣ ARM Cortex-M4 with floating point architecture ‣ Comprehensive peripheral interface support, with a common Hardware Abstraction Layer API ‣ FreeRTOS with additional middleware components supporting

!14 Head to head

!15 General Purpose Processor Based Platform

!16 BeagleBone Black

‣ BeagleBone Black is a low-cost, community- supported development platform for developers and hobbyists. ‣ Boot Linux in under 10 seconds and get started on development in less than 5 minutes with just a single USB cable. ‣ Processor: AM335x 1GHz ARM® Cortex-A8 ‣ 512MB DDR3 RAM ‣ 4GB 8-bit eMMC on-board flash storage ‣ 3D graphics accelerator ‣ NEON floating-point accelerator ‣ 2x PRU 32-bit micro-controllers

!17 Raspberry Pi 3

‣ Hardware Spec: ‣ A 1.2GHz 64-bit quad-core ARMv8 CPU ‣ 802.11n Wireless LAN ‣ Bluetooth 4.1 ‣ Bluetooth Low Energy (BLE) ‣ 1GB RAM ‣ 4 USB ports ‣ 40 GPIO pins ‣ Full HDMI port ‣ Ethernet port ‣ Combined 3.5mm audio jack and composite video ‣ Camera interface (CSI) ‣ Display interface (DSI) ‣ Micro SD card slot (now push-pull rather than push-push) ‣ VideoCore IV 3D graphics core

!18 RASPBERRY PI ZERO

‣ Hardware Specification ‣ BCM2835, 1Ghz, Single-core CPU ‣ By Broadcomm ‣ Using ARM1176 (ARMv6) ‣ 512MB RAM ‣ Mini HDMI and USB On-The-Go ports ‣ Micro USB power ‣ HAT-compatible 40-pin header ‣ Composite video and reset headers

!19 Intel Galileo ‣ Galileo (released in 2013) ‣ Arduino-certified development boards based on Intel X86 architecture ‣ Gen1: Intel Quark X1000 32-bit 400Mhz SoC ‣ Gen2: Intel Quark X1000 32-bit 400Mhz SoC + Ethernet PoE+12-bit PWM+USB UART adapter ‣ Single Core, single thread, Pentium instruction set ‣ Industry standard I/O interface: ACPI, PCI express, Micro SD, USB 2.0

!20 Edison

‣ Release in 2014 ‣ Hardware Specification: ‣ Intel Atom Tangier: 2 x Atom core at 500Mhz ‣ 1 x Intel Quark at 100Mhz (for RTOS) ‣ 1GB RAM ‣ 4GB Flash ‣ Onboard WiFi ‣ Bluetooth 4.0 ‣ USB ‣ Software: ‣ Yocto Linux supporting Arduino IDE, Eclipse (C, C++, Python), Intel XDK, and Wolfram.

!21 Execution Model

!22 Execution Model

‣ On general purpose computers, ‣ We assume that computers mostly connect to power sources or have time to shut down the computers. ‣ Data and binary code are loaded into memory to speed up the performance. ‣ On embedded systems and MCU-based systems, ‣ many of them are NOT connected to power sources and can be shut down or reboot without notice at any time. ‣ Only data are loaded to memory; binary code and state information are stored on flash or storage. ‣ Access latency on DRAM and SSD differs for 100x to 1000x. ‣ The difference greatly prolong the execution time and enlarge power consumption.

!23 Example ‣ The SHIMMER wearable sensor platform: ‣ TI MSP430 mciroconbroller ‣ 802.15.4 by Chipcon CC2420 ‣ Storage: MicroSD 2GB ‣ Battery: 250mAh rechargeable batter ‣ Triaxial MEMS accelerometer ‣ Gyroscope, ECG, EMG, and other sensors ‣ On such platforms, ‣ Microcontroller is not the major energy consumer. ‣ Sensor such as gyroscope and radio are. ‣ What if your code run 10x or 100x slower? Konrad Lorincz et al. “Mercury: a wearable sensor network platform for high-fidelity motion analysis”. In: Sensys ’09 Proceedings of the 7th ACM Conference on Embedded Net- worked Sensor Systems (Nov. 2009), pp. 183–196.

!24 Lifetime for Wearable Devices ‣ With the node continuously sampling and logging accelerometer and gyroscope data, maintaining time sync, but performing no data transfers to the base station, the achievable lifetime with a 250 mAh battery is 12.5 h. ‣ Adding activity filter on CPU can reduce the amount of transmission and prolong the lifetime up to 17h for 50% activity, to 89h (4x) when downloading features only.

!25 Sensor node VMs - Motivation

‣ Target device class:

• MSP430, Cortex M0, AVR

• Up to a few hundred KB of flash

• Up to 10s KB RAM ‣ Advantages of running a VM:

• Using higher level languages instead of C reduces development cost.

• IoT applications are expected to consist of many different hardware platforms. Platform independence reduces deployment cost.

• VMs can offer a safe execution environment.

!26 Sensor node VMs - Related Works

‣ Application specific

• Maté: first sensor node VM

• VM*: Java based framework to build VMs tailored to specific applications ‣ General purpose

• Java: NanoVM, TakaTuka, Darjeeling, SwissQM, leJOS

• Python: pyMite, Micropython

• LISP: SensorScheme

• Others: DVM, TinyVM ‣ They differ in their system requirements and the features they do provide. Sensor node JVMs all sacrifice some features, most commonly reflection, and support for floating point operations.

!27 Sensor node VM performance

‣ Not many VMs publish detailed performance figures. ‣ For the ones that do, the exact performance depends on the code being executed, ‣ but the general picture is clear: all are 1 to 2 orders of magnitude slower than native code.

Performance vs VM Source Platform Instruction set native C Delft University of Darjeeling ATmega128 JVM 30x-113x slower Technology Mica2 (AVR) TakaTuka University of Freiburg JVM 230x slower JCreate (MSP) TinyVM Yonsei University, Seoul ATmega128 nesC 14x-72x slower

DVM UCLA ATmega 128L SOS 108x slower

N. Brouwers et al, “Darjeeling, a feature-rich VM for the resource poor,” Sensys '09 SensorScheme University of Twente F. Aslam et al, "Introducing TakaTuka:MSP430 a Java virtualmachine for motes", Sensys '08 LISP 4x-105x slower K. Hong et al, “TinyVM: an energy-efficient execution infrastructure for sensor networks,” Technical Report CS-TR-9003 Department of Computer Science Yonsei University, Seoul, Korea R. Balani et al, “Multi-level Software Reconfiguration for Sensor Networks,” EMSOFT ’06 L. Evers, “Concise and Flexible Programming of Wireless Sensor Networks”, Phd thesis, University of Twente, 2010 !28 Does the slowdown matter?

‣ At 10-230x slowdown:

• The energy consumption of running application code 94600 increases accordingly. (one of the main reasons to use tiny devices is their power consumption)

• Time critical tasks such as periodic sensing and data 71800 processing may not finish in time.

‣ For the example on the right, the 1292000 ‘compute features’ and ‘activity filter’ become a large/the largest Lorenz et al., “Mercury: a wearable sensor network platform for component when multiplied by 10 to high-fidelity motion analysis”, Sensys '09 100. Doing the FFT on the node will most likely be impossible.

!29 Improving performance by compiling to native code ‣ Compiling bytecode to native code has been a common technique on desktops, especially since the JVM was released. ‣ Three types

• Offline (before sending code to the device): better code can be generated, but at the cost of losing platform independence.

• Just-in-time (at run-time): common on desktops, but impractical on a sensor node since many can only execute code from flash memory.

• Therefore we use Ahead-of-time compilation: compile the whole application to native code, on the device, at load time.

!30 Challenges

• Restricted flash memory means the size of the compiler shouldn’t be much larger than the interpreter

• Restricted RAM means we cannot store complex data structures to analyze the byte code

!31 AOT compiler for JVM bytecode ‣ We build on earlier work done by Joshua Ellul:

• When the JVM bytecode is loaded, replace each instruction with a native equivalent,

• using the native stack as JVM operand stack,

• then do some simple but effective peephole optimizations. 1) Initial translation 2) Peephole optimisation Operation JVM Native Before Cost After Cost adding two shorts IADD POP R10 PUSH R10 4 bytes, 0 bytes, POP R11 POP R10 4 cycles 0 cycles POP R12 POP R13 ADD R10, R12 ADC R11, R13 PUSH R10 4 bytes, MOV R10, R12 2 bytes, PUSH R11 POP R12 4 cycles 1 cycle PUSH R10 duplicate the top IDUP POP R10 stack element POP R11 MOV R10, R12 4 bytes, MOVW R10, R12 2 bytes PUSH R11 MOV R11, R13 2 cycles 1 cycle, PUSH R10 PUSH R11 PUSH R10

J. Ellul, “Run-time compilation techniques for wireless sensor networks”, Phd thesis, University of Southampton, 2012. !32 AOT performance

Performance Code size ‣ This approach significantly improves performance: 4.27x slower than native C C 1x 1x

10x - 100x ‣ But the trade-off is a significant increase JVM 0.5x in code size: 3.15x larger than native C slower

• Since flash memory is scarce on a AOT 4.27x 3.15x sensor node, this reduces the size of

the programs we can load on a device.

‣ Causes of JVM slowdown Eliminated by Ellul’s AOT • Interpreter overhead { All experiments done • using the Avrora cycle-accurate simulator, • simulating an ATmega 128 CPU, • Stack overhead • for a set of 7 different benchmarks: bubble sort, heap sort, binary search, FFT, - Push / pop overhead MD5, RC5, XXTEA

- Load / store overhead Reduced by our optimizations

• Instruction set mismatch

!33 Overhead after basic AOT

‣ The remaining overhead can be measured by looking at the type of instructions executed in the AOT compiled version, compared to the native C version. ‣ We group them into three categories:

• PUSH / POP: Stack overhead.

- Each instruction pops its operands from the stack, and pushes its result. Only a limited number can be eliminated by the peephole optimizer.

• LOAD / STORE: Stack overhead.

- The same variables are often used multiple times in succession, but since each instruction consumes it’s operands, we repeatedly have to load and store the same values.

• Others: Due to instruction set mismatch. - This includes many cases that can be expressed more efficiently in native AVR, such as looping over arrays. - In our optimizations we will only target one case: bit shifts by a constant number of bits. - JVM shifts take the number of bits to shift by as an operand. - Flexible, but inefficient if the number is fixed.

!34 Overhead after basic AOT

Example Java code: do {A>>>=1;} while(A>B);

cycles total: 48 push/pop: 17 load/store: 16 instr. set: 9 total 42 overhead: native cost: 6

!35 Overview of Proposed optimizations

do {A>>>=1;} while(A>B); ‣ Combined, they reduce ~80% of remaining overhead 1. Push / pop overhead: a) improved peephole optimizer b) stack caching 2. Load / store overhead: a) popped value caching b) mark loops Before After total: 48 8 3. Instruction set overhead: push/pop: 17 0 load/store: 16 2 constant shift optimization instr. set: 9 0 total overhead: 42 2 ‣ However, they do increase the native cost: 6 6 code size. 48 -> 8 cycles (only 2 more than native C)

!36 Experimental setup

‣ Simulation output

‣ : disassembly of native C ‣ AOT compiler built on top of Darjeeling VM darjeeling.S code • JVM for resource-constrained devices ‣ • Modifies JVM bytecode to make it more profilerdata.xml: cycles spent per suitable for sensor nodes flash address Avrora cycle-accurate simulator, extended by ‣ ‣ rtcdata.xml: trace of AOT compilation adding traces to monitor the compilation process and runtime performance showing how each instruction is translated

• trace AOT compilation by writing debug output to specific memory location ‣ stdoutlog.txt: to monitor benchmark monitored by Avrora success and timer output to count total • keep track of cycles spent in every memory cycles spent in native and AOT compiled location code ‣ Simulating an ATmega128 CPU ‣ jlib_bm.debug: modified JVM • 4KB ram, 128KB flash, 32 8-bit registers bytecode after Darjeeling's transformations ‣ Set of 7 benchmark with different characteristics

• bubble sort, heap sort, binary search, FFT, ‣ MD5, RC5, XXTEA Combined by F# scripts to form a detailed performance report, split by opcode

!37 Results: Performance overhead per instruction type 350 push/pop mov(w) 300 load/store other 250 total

200

150

100

50

Overhead (% of native C run time) 0

-50 simple impr. stack pop. val. mark const peephole peephole caching caching loops shift (average of 7 benchmarks) !38 Results per benchmark 500 bubble sort heap sort 450 binary search fft xxtea 400 md5 rc5 350

300

250

200

150

Overhead (% of optimised native C run time) 100

50

0 simple improved stack popped value mark loops const shift peephole peephole caching caching

!39 Overhead per benchmark for 1-7 pinned registers 180 bubble sort heap sort 160 binary search fft xxtea 140 md5 rc5

120

100

80

60

40 Overhead (% of optimised native C run time)

20

0 1 2 3 4 5 6 7

!40 Xxtea overhead per instruction type for 1-7 pinned registers ‣ xxtea has lower average stack depth. More pinned registers can reduce load/store overhead but increase the cost of spilling push/pop stack values. 70 push/pop move load/store 60 other total

50

40

30

20

Overhead (% of optimised native C run time) 10

0

1 2 3 4 5 6 7

!41 Results: Code size overhead per instruction type 250 push/pop mov(w) load/store 200 other total 150

100

50

0 Increase in code size as a % of native C -50 simple impr. stack pop. val. mark const peephole peephole caching caching loops shift (average of 7 benchmarks) !42 Conclusion

‣ From a code size perspective, the interpreter can’t be beaten, but it suffers from an often unacceptable performance penalty. ‣ Previous work on Ahead-of-Time compilation to native code improves performance, but at Performance Code size

the expense of a large increase in code size, C 1x 1x reducing the amount of code we can load in 10x - 100x the limited memory available on a sensor JVM 0.5x node. slower ‣ Our optimizations further improve the AOT 4.27x 3.15x performance of AOT compiled bytecode, and Optimised significantly reduce the code size overhead, AOT 1.68x 1.88x leading to code that is only

• 68% slower than native C, and • 88% larger than native C.

!43 Questions?

https://github.com/wukong-m2m/wukong- darjeeling/tree/aot-compiler

!44 Causes of JVM slowdown

1. Fetch • Read instruction from memory • from Flash, not RAM! For each JVM instruction: 2. Decode • Need to find the right label in a switch statement with about 200 cases ‣ Interpreter loop overhead • probably uses a jump table 3. Execute ‣ Stack overhead • Load operands from stack into registers • Do the operation ‣ Instruction set mismatch • Store result back on the operand stack 4. Loop • Update pc • Should we terminate?

!45 Causes of JVM slowdown

Native AVR a+=b; RAM CPU regs

a ‣ Interpreter loop overhead LD r1, &a b ‣ Stack overhead LD r2, &b ‣ Instruction set mismatch ADD r1, r2 a ST &a, r1

Assuming a and b aren’t in registers already, and a needs to be stored back to memory.

!46 Causes of JVM slowdown

JVM a+=b; RAM CPU regs

a push ILOAD_0 ‣ Interpreter loop overhead b push ILOAD_1

‣ Stack overhead pop pop IADD ‣ Instruction set mismatch push pop a ISTORE_0

Pushes and pops aren’t to the real stack, but to the JVM stack: push a -> operand_stack[sp++]=a

!47 Causes of JVM slowdown

‣ Interpreter loop overhead • JVM is very simple, but some things can’t be expressed efficiently. For example, lack of ‣ Stack overhead pointers slows down array processing: • AVR: LDI Rd, X+ • Native code would use the AVR’s “auto ‣ Instruction set mismatch increment load” instruction: loads byte at the location pointed to by register X into Rd, and increments X by 1. 2 cycles.

• JVM: ALOAD, ILOAD, IALOAD, IINC • Stack overhead: ALOAD and ILOAD • IALOAD needs to calculate the address for every access

!48 Overview of Proposed optimizations

Overhead ‣ Combined, they reduce ~80% of remainingOptimisation overhead reduction as % native C 1. Push / pop overhead: Baseline 327% a) improved peephole optimizerimproved peephole -86% (243%) b) stack caching stack caching -61% (182%)

2. Load / store overhead: popped value caching -45% (137%) a) popped value caching mark loops -40% (97%) b) mark loops constant shi -29% (68%) 3. Instruction set overhead: constant shift optimization

!49 1a: improved peephole optimizer

‣ Earlier optimizer only replaces (blocks of) consecutive push/ pop pairs ‣ More general: a push/pop pair can be replaced, iff:

• the target register for the pop is not used between the push and pop:

Improved peephole optimisation Before Cost After Cost PUSH Rx 4 bytes, 0 bytes, 4 cycles 0 cycles POP Rx

PUSH Rsrc 4 bytes, MOV Rdest, Rsrc 2 bytes, Rdest> POP Rdest

!50 1a: improved peephole optimizer

48 -> 41 cycles

!51 1b: stack caching

JVM a+=b; ‣ M. A. Ertl, “Stack caching for interpreters,” presented PLDI ’95. In memory stack Stack caching Originally aimed at Forth interpreters Stack slot 1: @SP Stack slot 1: cpu R3 • Usually the last pushed value will be popped again soon. Stack slot 2: @SP+2 Stack slot 2: cpu R2 • Keep the top of the stack in registers, so we can avoid memory accesses. Stack slot 3: @SP+4 Stack slot 3: cpu R1

• Only push to the real stack if we run Stack slot 4: @SP+6 Stack slot 4: @SP out of registers.

• Keep a ‘cache state’ which tells how cache state=3 much of the stack is in registers. ADD ADD pop rA • add rA, rB For interpreters: can’t make stack pop rB state too complicated, or the add rA, rB cachestate := 2 overhead will be bigger than the push rA gains, but for AOT it is only used at load time to generate better code

!52 1b: stack caching

JVM a+=b;

RAM CPU regs RAM CPU regs In memory stack Stack caching load a load a a : @SP a : cpu R1 push ILOAD_0 ILOAD_0

load b load b b : @SP b : cpu R1 push ILOAD_1 push ILOAD_1 a : @SP+2 a : @SP pop

pop IADD a+b : @SP pop IADD a+b : cpu R1 push

pop store a ISTORE_0 load a ISTORE_0

(assuming just 1 register available for stack caching)

Before After

!53 1b: stack caching

‣ AOT compiler implementation:

• add a ‘cache manager’ component

- uint8_t sc_getfreereg();

- uint8_t sc_pop();

- void sc_push(uint8_t reg);

• cache manager will use registers when possible, or emit a real push or pop if necessary

• code generation is now split between the old instruction generators, and the cache manager

• note that the cache manager is only needed at load time, so we can spend more time manipulating it

!54 1b: stack caching

*. : register is used by the current instruction 41 -> 31 cycles IntX : register is on the operand stack, at position X (higher = deeper)

!55 2a: popped value caching

‣ Stack caching removes most push / pop overhead, but not Unnecessary if load / store overhead. Since each operand is popped off a is already the stack, JVM code often repeatedly loads the same RAM CPU regs in a register. variables. load a ‣ Some loads may be unnecessary if the value is already ILOAD_0 a : R1 in a register. load b b : R2 ‣ We add a ‘value tag’ to the cache state of each register to ILOAD_1 remember which value is in a register, after it has been a : R1 popped from the stack IADD a+b : R1 • the ‘value tag’ indicates what value is currently held there. for example: “local short 0”, or “constant int 42”

• we can eliminate loads if the required value is already in store a ISTORE_0 a register

• need to clear the cache on every branch target, (non- taken branches are ok) After this fragment, - the instruction could be reached from two places, a and b are no longer on the stack so we can’t assume anything about the contents of but both are still in registers. the registers

!56 2a: popped value caching

31 -> 27 cycles

!57 2b: mark loops

‣ Popped value caching helps, but needs to reset its cache on every branch target.

• Inner loops are often the most performance critical. Value tags get cleared • Each iteration will often use the same variables, but will at each branch target

have to reload them every time. 1: BRTARGET(1)

foriterationeach ‣ We add an extra instruction to the VM: MARKLOOP, placed Needto warmup cache 2: … 3: … at the beginning and end of every inner loop. 4: … 5: … • MARKLOOP contains a list of variables used in the loop. 6: … 7: GOTO 1 • The VM can decide to pin a number of variables to registers, for the duration of the loop.

• A loop prologue and epilogue are added to load and store the variables, if they are live at the loop start or end resp.

• Loads and stores now access those registers, instead of main memory, but they are no longer available for normal stack caching.

!58 2b: mark loops

inner loop

setup: 12 cycles { loop: 27 -> 17 cycles !59 3: constant shifts

‣ Most transformations from JVM to more optimal native code are often either too complex, or too specific and only help very specific cases Java: ‣ Constant shifts however, are very common: 6 a>>>1 out of our 7 benchmarks. JVM: • They appear both as real constant shifts, SLOAD_0 and as multiplication/division by powers of SCONST_1 two. SUSHR

• They are also easy to recognize and Unnecessary constant load transform: Implemented as a loop, which is inefficient if the number to shift by - We replace all constant loads directly is already known. followed by a bit shift instruction with a special case.

!60 3: constant shifts

‣ Transformations from JVM to more optimal native code are often either too hard or too specific

‣ Constant shifts however, are very common: 6 out of our 7 benchmarks, and easy to recognise and transform

‣ We replace all constant loads directly followed by a bit shift with a special case, skipping normal code generator.

‣ avr-gcc doesn’t optimize all cases

inner loop

17 -> 8 cycles (only 2 more than native C) { !61 Mark loops trade off

‣ This optimization has a tradeoff: registers used to pin variables can’t be used for stack caching.

• More pinned registers: cheap access to more pinned variables

• Less free registers: stack cache may have to spill to memory more often

!62 Summary

‣ Large number of IoT platforms are available for different needs. ‣ There is no single solution for all IoT applications. ‣ Choosing the right platforms is critical. ‣ Knowing the platforms is the first step on designing IoT systems. ‣ How to program IoT applications on same types of platforms remains open.

!63