Worst-Case Execution Time C Analysis

Jan Gustafsson, Docent Mälardalen Real-Time Research Center (MRTC) Västerås, Sweden [email protected]

2

What C are we talking about? Program timing is not trivial!

! ”Computation time” - a key component int f(int x) { in the analysis of real-time systems return 2 * x; ! You have seen it in formulas such as: } Simpler questions Harder questions Worst-Case Worst-Case Response Time Period Execution Time ! What is the program ! What is the execution doing? time of the program?

! Will it always do the ! Will it always take the Ri = Ci + ∑ "Ri / Tj# Cj same thing? same time to execute? j∈hp(i) ! How important is ! How important is Where do these C values come from? the result? execution time?

3 4

Program timing WCET - definition

! Most computer programs have varying ! Worst-Case Execution Time = WCET execution time " The longest calculation time possible " For one program/task when run in isolation " Due to input values " Other interesting measures: BCET, ACET " Due to software characteristics " Due to hardware characteristics ! The goal of a WCET analysis is to derive ! Example: some timed program runs a safe upper bound on a program’s WCET BCET WCET Most runs have Is this the longest similar execution time execution time... safe safe possible lower upper Some take much ... or can we get execution times longer time (why?) timing timing #program runs #programruns even longer ones? bounds bounds

0 time 0 execution time

5 Presentation outline

! fundamentals ! WCET analysis Embedded " Measurements " Static analysis systems " Flow analysis, low-level analysis, and calculation " Hybrid approaches fundamentals ! WCET analysis tools ! The SWEET approach to WCET analysis ! (Multi-core + WCET analysis?) ! WCET analysis demo (SWEET)

Embedded computers Embedded systems everywhere

! An integrated part of a larger system ! Today, all advanced " Example 1: A microwave oven contains at products contain least one embedded embedded computers! " Example 2: A modern car can contain more than 100 embedded processors " Our society depends on that they function correctly ! Interacts with the user, the environment, and with other computers " Often limited or no user interface " Often with timing constraints input result

Embedded systems software Embedded system hardware

! Amount of software can vary from ! Huge variety of embedded extremely small to very large system processors " Gives characteristics to the product " Not just one main processor type as for PCs ! Often developed with target " Additionally, same CPU can be used with various hardware in mind hardware configurations (memories, devices, …) " Often limited resources (memory / speed) ! The hardware is often tailored " Often direct accesses to different HW devices specifically to the application " Not always easily portable to other HW " E.g., using a DSP processor for signal ! Many different programming languages processing in a mobile telephone " C still dominates, but often special purpose languages ! Cross-platform development ! Many different software development tools " E.g., develop on PC and download " Not just GCC and/or Microsoft Visual Studio final application to target HW

11 Some interesting figures Real-time systems

! 4 billion embedded processors sold in 2008 ! Computer systems where the timely " Global market worth €60 billion behavior is a central part of the function " Predicted annual growth rate of 14% " Forecasts predict more than 40 billion embedded devices in 2020 " Containing one or more embedded computers ! Embedded processors clearly dominate yearly " Both soft- and hard real-time, or a mixture… Timing of radio Timing of music production communication, motor playing from MP3 file control, rudder and flaps control,…

Timing of network communication, motor Timing of radio control, ABS brakes, "Desktop" communication, "Embedded" anti-slip control,… 2% speech 98% recognition,…

Source: http://www.artemis-ju.eu/embedded_systemsTimely Software - a Challenge 13

Uses of reliable WCET bounds

! Hard real-time systems " WCET needed to guarantee behavior ! Real-time scheduling " Creating and verifying schedules WCET " Large part of RT research assume the existence of reliable WCET bounds analysis ! Soft real-time systems " WCET useful for system understanding ! Program tuning " Critical loops and paths ! Interrupt latency checking

Obtaining WCET bounds Measuring for the WCET

!Measurement !Methodology: " Industrial practice " Determine ”worst-case input” or run as many inputs as possible " Execute and measure !Static analysis the time " Research front " Add a safety margin

18 Measurement issues 1 Measurement issues 2

! What is the worst-case input? ! How to measure the execution time? " In general, the problem of determining the worst case input " Option 1: SW methods value combination to an arbitrary program is very hard. ! Operating system clocks ! Alternative: run all inputs? ! Simulators ! High-water marking " Typically not possible, since the number of input combinations typically is huge. " Option 2: HW + SW methods " For example: 10 variables of size 32 bits => number of ! Add instrumentation code 10 ! Use oscilloscopes, logic analyzers, emulators necessary measurement runs = 4 294 967 295 Buzzer logic analyzers or debug support " Also keep in mind that the program state is a part of the input ! The instrumentation may affect the ! In practice: run as many inputs as possible timing ! Instrumentation code is often left in " There are some ideas how to test extreme cases and corners shipped code " Has the worst-case path really been taken? No guarantee! ! How much instrumention output can be handled? LEDs

19 20

SW measurement methods Using an oscilloscope ! Operating system clocks ! Common equipment for HW debugging " Commands such as time, date and clock " Used to examine electrical output " Note that all OS-based solutions require signals of HW precise HW timing facilities (and an OS) ! Observes the voltage or signal ! Cycle-level simulators waveform on a particular pin " Software simulating CPU " Usually only two to four inputs

" Correctness vs. hardware? ! To measure time spent in a routine: ! High-water marking 1. Set I/O pin high when entering routine " Keep system running 2. Set the same I/O pin low before exiting " Record maximum time 3. Oscilloscope measures the amount of observed for task time that the I/O pin is high 4. This is the time spent in the routine " Keep in shipping systems, read at service intervals

21 22

Using a logic analyzer HW measurement tools Target ! Equipment designed for ! In-circuit emulators (ICE) board troubleshooting digital " Special CPU version revealing internals hardware ! High visibility & bandwidth ! High cost + supportive HW required ! Have dozens or even HW Debugger hundreds of inputs ! Processors with debug support " Each one keeping track on " Designed into processor whether the electrical signal ! Use a few dedicated processor pins it is attached to is currently at logic level 1 or 0 " Using standardized interfaces " Result can be displayed ! Nexus debug interfaces, JTAG, against a timeline Embedded Trace Macrocell, … " Can be programmed to start " Supportive SW & HW required capturing data at particular input patterns " Common on modern chip

23 24 Fundamental problems when using measurements ! Measured time never exceeds WCET BCET WCET Static safe safe possible lower upper execution times WCET timing timing bounds bounds 0 time analysis Measurement will result Only this measurement in a value ≤ WCET is safe (= WCET)! How do we know that we catched the WCET? " A safety margin must be added, but how much is enough?

Again: Causes of Static WCET analysis Execution Time Variation

! Do not run the program – analyze it! ! Execution characteristics foo(x,i): " Using models based on the static properties of the of the software while(i < 100) software and the hardware if (x > 5) then " A program can often execute ! Guaranteed safe WCET bounds x = x*2; in many different ways else " Provided all models, input data and analysis methods are correct " Input data dependencies x = x+2; end " Application characteristics ! Trying to be as tight as possible if (x < 0) then BCET WCET ! Timing characteristics b[i] = a[i]; end All derived of the hardware safe i = i+1; possible safe bounds will lower " Clock frequency end execution times upper be ≥ WCET timing timing " CPU characteristics bounds bounds " Memories used 0 time " …

WCET analysis phases

program 1. Flow analysis " Bound the number of times Flow different program parts may analysis be executed (mostly SW analysis) Compiler Flow 2. Low-level analysis Object Low level Code analysis " Bound the execution time of different program parts analysis (combined SW & HW analysis) Target Calculation Hardware 3. Calculation " Combine flow- and low-level Actual WCET analysis results to derive an WCET bound upper WCET bound Reality Analysis

30 Flow Analysis The control-flow graph

foo() Program ! Provides bounds on the number foo(x,i): A: while(i < 100) of times different program parts A B: if (x > 5) then Each block Flow may be executed will run as a analysis " Valid for all possible executions C: x = x*2; B unit else ! Examples of provided info: Low level D: x = x+2; C D " Bounds of loop iterations analysis end " Bounds on recursion depth E: if (x < 0) then E Flows as " Infeasible paths edges Calculation F: b[i] = a[i]; F ! Info provided by: end " Static program analysis G: i = i+1; WCET G Estimate " Manual annotations end end 31

Flow info characteristics foo() Example: Loop bounds Loop bound: 100 foo(x,i): A Structurally possible foo(x,i): ! Loop bound: executions (infinite) A: while(i < 100) A: while(i < 100) " Depends on possible values B: if (x > 5) then B Basic finiteness B: if (x > 5) then of input variable i C: x = x 2; * C: x = x*2; ! E.g. if 1 ≤ i ≤ 10 holds for input else C D Statically allowed value i, then loop bound is 100 else D: x = x+2; Actual feasible " In general, a very difficult D: x = x+2; end E paths problem E: if (x < 0) then #F < 10 end " However, solvable for many F: b[i] = a[i]; F E: if (x < 0) then types of loops end F: b[i] = a[i]; ! Requirement for basic G: i = i+1; G WCET found here = end end WCET found here = desired result finiteness (a WCET) overestimation G: i = i+1; " All loops must be Example program end end upper bound Control flow graph Relation between possible executions and flow info

Example: Infeasible path Example: Triangular Loop

foo(x,i): ! Infeasible path: ! Two loops: A: while(i < 100) " Loop A bound: 100 " Path A-B-C-E-F-G B: if (x > 5) then can not be executed triangle(a,b): " Local B bound: 100 C: x = x*2; since C implies ¬F A: loop(i=1..100) ! Block C: else " If (x > 5) then it is not B: loop(j=i..100) " By loop bounds: D: x = x+2; 100 * 100 = 10 000 possible that (x*2) < 0 C: a[i,j]=... end " But actually: E: if (x < 0) then ! Limits statically end loop 100+...+1 = 5 050 F: b[i] = a[i]; allowed executions end loop ! Limits statically end " Might tighten the allowed executions G: i=i+1; WCET estimate " Might tighten the end WCET estimate

36 The mapping problem The mapping problem (cont)

! Flow analysis easier on source code level ! Embedded compilers often do a lot of code optimizations " Semantics of code clearer " Important to fit code and data into limited memory resources " Easier for programmer/tool to derive flow info ! Optimizations may significantly change code (and data) layout ! Low-level analysis requires binary code " After optimizations flow info may no longer be valid " The code executed by the processor ! Solutions: ! Question: How to safely map flow source code level " Use special compiler also mapping flow info (not common) flow information to binary code? " Use compiler debug info for mapping (only works with little/no optimizations) " Perform flow analysis on binaries (most common) int$i=0;$ ...$ Flow analysis: ...$$ 0111111010010111$ Where is $int$i=0;$ $int$i=0;$ Loop condition Loop bound while(i<100)$ 0110010100101001$ the loop? $...$$ $...$$ taken 101 times $while(i<100)${$ $do${$ (header): 101 {$ 1001010100111010$ Loop condition $$...$$ 1001010011111110$ $$$...$$ $$$...$$ Compiler: i=0 taken 100 times $$i++;$ 1010010101010100$ always holds at $$$i++;$ $$$i++;$ }$ 1001010101010101$ first execution $}$ $}$while(i<100)$ ...$ ...$ of loop condition $...$ $...$ Source code Executable Before optimization After optimization

38

int twice(int a) { int temp; temp = 2 * a; Embedded SW Tool Chain The SW building tools return temp; Affects } timingAffects C Runtime timing Affects ! The compiler: Affects timing C Source Affects " Translates an source code file to an object code file timing Object File C Library timing ! Only translates one source code file at the time C Source " Often makes some type of code optimizations Compiler Object File OS ! Increase execution speed, reduce memory size, … C Source !Different optimizations give different object code layouts Object File Linker Start-up twice: ! The linker: mov ip, sp Affects stmfd sp!, {fp,ip,lr,pc} timing Executable " Combines several object code files into one executable sub fp, ip, #4 ! Places code, global data, stack, etc in different memory parts sub sp, sp, #8 str r0, [fp, #-16] ! Resolves function calls and jumps between object files 1000110000011110 WCET ldr r3, [fp, #-16] " Can also perform some code transformations mov r3, r3, asl #1 1000110000100000 str r3, [fp, #-20] 1010110000100000 ldr r3, [fp, #-20] 1010110000011110 Affects ! Both tools may affect the program timing! mov r0, r3 1010110000100011 timing ldmea fp, {fp,sp,pc} 1010111100011001

40

Example: compiling & linking Common additional files

/****************** ! C Runtime code: * File: main.c Contains object *****************/ code for main.c " Whatever needed but not supported by the HW ! 32-bit arithmetic on a 16-bit machine int foo(); Compiler main.o Object code contains an ! Floating-point arithmetic unresolved call to foo int main() { ! Complex operations (e.g., modulo, variable-length shifts) return 1 + foo(); " } Comes with the compiler Linker a.exe " May have a large footprint ! Bigger for simpler machines /****************** ! Tens of bytes of data and tens of kilobytes of code

* File: foo.c The main.o and *****************/ foo.o object code Compiler foo.o files are combined ! OS code: int foo() { " In many ES the OS code is linked together with the rest return 1; The call to foo of the object code files to form a single binary image } Contains object in main has code for foo.c been resolved

41 42 Common additional files

! Startup code: " A small piece of assembly code that prepares the way for the execution of software written in a high-level language ! For example, setting up the system stack Low-level " Many ES compilers provide a file named startup.asm, crt0.s, … holding startup code analysis ! C Library code: " A full ANSI-C compiler must provide code that implements all ANSI-C functionality ! E.g., functions such as printf, memmove, strcpy " Many ES compilers only support subset of ANSI-C " Comes with the compiler (often non-standard)

43

Low-Level Analysis Some HW model details

! Determine execution time bounds ! Much effort required to safely model CPU internals Program for program parts " Pipelines, branch predictors, superscalar, out-of-order, … ! Much effort to safely model memories " Focus of most WCET-related research Flow " memories must be modelled in detail analysis ! Using a model of the target HW " Other types of memories may also affect timing " The model does not need to model all ! For complex CPUs many features must be Low level analysis HW details analyzed together " However, it should safely account for " Timing of instructions gets history dependant all possible HW timing effects ! Developing a safe HW timing model troublesome Calculation ! Works on the binary, linked code " May take many months (or even years) " All things affecting timing must be accounted for WCET " The executable program Estimate

Hardware time variability Hardware time variability

! Simpler 4-, 8- & 16-bit processors (H8300, 8051, …): ! Advanced 32- & 64-bit processors (PowerPC 7xx, " Instructions might have varying execution time due to Pentium, UltraSPARC, ARM11, …): argument values " Many performance enhancing features affect timing " Varying data access time due to different memory areas ! Pipelines, out-of-order exec, branch pred., caches, " Analysis rather simple, timing fetched from HW manual speculative exec. ! Instruction timing gets very history dependent ! Simpler 16- & 32-bit processors, with a (scalar) pipe- " Some processors suffer from timing anomalies line and maybe a cache (ARM7, ARM9, V850E, …): ! E.g., a cache miss might give shorter overall " Instruction timing dependent on previously program execution time than a cache hit executed instructions and accessed data: " Features and their timing interact ! State of and cache ! Most features must be analyzed together " Varying access times due to cache hits and misses " Hard to create a correct and safe " Varying pipeline overlap between instructions hardware timing model! " Hardware features can be analyzed in isolation ! Multi-cores - discussed later Example: CPU pipelines CPU pipelines ! Idea: Overlap the CPU stages of the instruct- ! Observation: Most instructions go through ions to achieve speed-up same stages in the CPU 1 2 3 4 5 6 7 8 9 10 ! No pipelining: IF ! Example: Classic RISC 5-stage pipeline " Next instruction ID cannot start before EX previous one has MEM IF ID EX MEM WB finished all its stages WB ! Pipelining: 1 2 3 4 5 6 Instruction decode Memory access IF Determine operation to be Load/store values " In principle: speedup = pipeline length performed (i.e., extract from/to memory if ID opcode and arguments) needed " However, often dependencies EX between instructions MEM Instruction fetch (IF) Example: RAW Execute Write back Get the next instruction WB Perform the actual Write the result into dependency I1. add $r0, $r1, $r2 from memory to operation (e.g, an add) the target register (its address is held by PC) I2 depends on I2. sub $r3, $r0, $r4 May cause completion of pipeline stall data write of I1 49 50

Pipeline Variants Example: No Pipeline

! None: Simple CPUs (68HC11, 8051, …) foo(x,i): ! Scalar: Single pipeline (ARM7,ARM9,, …) A: while(i < 100) (7 cycles) B: if (x > 5) then (5 c) ! VLIW: Multiple pipelines, static, compiler C: x = x*2; (12 c) " Constant time scheduled (DSPs, , Crusoe, …) else for each block ! Superscalar: Multiple pipelines, out-of-order D: x = x+2; (2 c) in the code end (PowerPC 7xx, Pentium, UltraSPARC, ...) " Object code E: if (x < 0) then (4 c) 1 2 3 4 5 6 7 8 9 10 11 Blue instruction not shown IF occupies EX stage F: b[i] = a[i]; (8 c) for 2 extra cycles ID end EX This stalls both G: i = i+1; (2 c) MEM subsequent instructions end WB

51 52

Example: No pipeline Example: Simple Pipeline

foo() 1 2 3 4 5 6 7 foo(x,i): foo(x,i): A IF tA = 7 A: while(i < 100) A: while(i < 100) EX t =7 A A δAB = -2 M B: if (x > 5) then B: if (x > 5) then F C: x = x 2; C: x = x 2; B t = 5 * B tB=5 * B 1 2 3 4 5 else else IF t =12 C D t =2 EX D: x = x+2; C D D: x = x+2; M end end F E: if (x < 0) then tE=4 E E: if (x < 0) then 1 2 3 4 5 6 7 8 9 10 IF F: b[i] = a[i]; F: b[i] = a[i]; EX tAB = 10 F t =8 end F end M F G: i = i+1; G: i = i+1; G tG=2 end end δAB = 10 - (7 + 5) = -2 end Example: Pipeline result Pipeline Interactions

foo() IF foo(x,i): EX Pairwise overlap: speed-up A: while(i < 100) M tA=7 A F B: if (x > 5) then δAB=-2 IF IF δGA=-1 C: x = x*2; t =5 EX EX B B M M else δBC=-2 δBD=-1 IF F IF F t =12 EX EX D: x = x+2; C C D tD=2 M M =-1 =-2 F F end δCE δDE IF IF t =4 E: if (x < 0) then E E EX EX M M δEF=-2 F F F: b[i] = a[i]; δEG=-1 F tF=8 end IF δFG=-1 Interaction across more than G: i = i+1; EX G tG=2 M two blocks also possible! end F Can be both speed-up or slow-down end 55 56

The Example: Cache analysis

The CPU executes fib: What instructions will CPU instructions. It also needs Caches increase mov #1, r5 cause cache misses? to access data to perform CPU average speed, but give more variable mov #0, r6 Cache operations upon Faster memory execution time access mov #2, r7 Cache misses takes time br fib_0 much more time Caches store Cache than cache hits! frequently used fib_1: instructions and data memory Main memory (for faster access) mov r5,r8 Many variants exists: add r6,r5 instruction caches, data caches, mov r8,r6 ! Performed on the unified caches, add #1,r7 cache hierarchies, … fib_0: object code cmp r7,r1 Main memory has larger Main memory Larger ! Only direct-mapped storage capacity but storage bge fib_1 capacity much longer access fib_2: instruction cache in time than caches mov r5,r1 this example jmp [r31]

58

Example: Cache analysis Example: Cache analysis Size of instruction Starting fib: address fib: mov #1, r5 2 1000 mov #1, r5 2 1000 mov #0, r6 2 1002 mov #0, r6 2 1002 mov #2, r7 2 1004 ! Information mov #2, r7 2 1004 br fib_0 2 1006 needed for br fib_0 2 1006 fib_1: fib_1: mov r5,r8 2 1008 instruction mov r5,r8 2 1008 add r6,r5 2 1010 add r6,r5 2 1010 ! Mapping to mov r8,r6 2 1012 cache mov r8,r6 2 1012 add #1,r7 2 1014 analysis add #1,r7 2 1014 instruction fib_0: fib_0: cmp r7,r1 2 1016 cmp r7,r1 2 1016 cache bge fib_1 2 1018 bge fib_1 2 1018 fib_2: fib_2: mov r5,r1 2 1020 mov r5,r1 2 1020 jmp [r31] 2 1022 jmp [r31] 2 1022

59 60 Example: Cache analysis Example: Cache analysis

fib: fib: mov #1, r5 miss mov #1, r5 miss mov #0, r6 hit mov #0, r6 hit mov #2, r7 hit mov #2, r7 hit br fib_0 hit First br fib_0 hit fib_1: iterationfib_1: of mov r5,r8 miss the loop mov r5,r8 miss hit add r6,r5 hit add r6,r5 hit hit mov r8,r6 hit mov r8,r6 hit hit add #1,r7 hit add #1,r7 hit hit fib_0: fib_0: cmp r7,r1 miss cmp r7,r1 miss hit bge fib_1 hit bge fib_1 hit hit fib_2: fib_2: mov r5,r1 mov r5,r1 hit jmp [r31] jmp [r31] hit Remaining iterations 61 62

Cache & Pipeline analysis Analysis of complex CPUs

! Pipeline analysis might ! Example: Out-of-order processor foo: " Instructions may executes in take cache analysis info RS holds pending parallel in functional units instructions results as input foo_0: " Functional units often replicated info " Dynamic scheduling of If all operands " Instructions gets annotated instructions and the FU are foo_1: ready instr. in with cache hit/miss info " Do not need to follow RS is put in FU issuing order " These misses/hits 1032: cmp r6,r1 foo_2: ! Very difficult analysis 1034: blt foo_5 info affect pipeline timing " Track all possible pipeline foo_3: states, iterate until fixed point ! Complex HW requires info " Require integrated pipeline/icache FUs and CU forward results /dcache/branch-prediction analysis back to RSs integrated cache & 1020:icache miss foo_4: info ! Been done for PowerPC 755 1022:icache hit pipeline analysis " Up to 1000 states per instruction! foo_5: info

63

Low-level analysis correctness?

! Abstract model of the hardware is used ! Modern hardware often very complex " Combines many features " Pipelining, caches, branch prediction, out-of-order... Calculation ! Have all effects been ? accounted for? " Manufactures keep hardware internals secret " Bugs in hardware manuals " Bugs relative hardware specifications

65 Example: Combined flow analysis Calculation and low-level analysis result

foo() Program ! Derive an upper bound on the foo(x,i): ”Loop bound: 100” program’s WCET A: while(i < 100) tA=7 A " Given flow and timing information B: if (x > 5) then Flow C: x = x*2; analysis ! Several approaches used: B tB=5 " Tree-based else D: x = x+2; t =12 C D t =2 Low level " Path-based C D analysis end " Constraint-based (IPET) ”C and F can’t be E: if (x < 0) then t =4 E ! Properties of approaches: E taken together” Estimate F: b[i] = a[i]; calculation " Flow information handled end F tF=8 " Object code structure allowed G: i = i+1; WCET " Modeling of hardware timing end G tG=2 Estimate " Solution complexity end

Tree-Based Calculation Tree-Based Calculation foo(x): foo(x): A: loop(i=1..100) A: loop(i=1..100) (7 c) ! Use syntax-tree B: if (x > 5) then ! Use constant B: if (x > 5) then (5 c) C: x = x*2 C: x = x*2 (12 c) of program else time for nodes else D: x = x+2 D: x = x+2 (2 c) end end ! Traverse tree foo ! Leaf nodes have foo E: if (x < 0) then E: if (x < 0) then (4 c) () bottom-up F: b[i] = a[i]; definite time F: b[i] = a[i]; (8 c) end end loop G: bar (i) ! loop : 100 G: bar (i) (20 c) end loop Rules for () end loop internals

header if(x>5) if(x<0) bar(i) header if(x>5) if(x<0) bar(i) (7 ) (5 ) (4) (20)

x=x/2 x=x+2 b[i]=a[i] x=x/2 x=x+2 b[i]=a[i] (12) (2) (8) Worst-Case Execution Time Analysis 2769 maj 2015 Worst-Case Execution Time Analysis 2770 maj 2015

Tree-Based: IF statement Tree-Based: LOOP foo(x): foo(x): A: loop(i=1..100) A: loop(i=1..100) ! For a decision B: if (x > 5) then ! Loop: sum the B: if (x > 5) then C: x = x*2 C: x = x*2 statement: max else children else D: x = x+2 D: x = x+2 end end of children foo ! Multiply by loop foo E: if (x < 0) then E: if (x < 0) then () () ! Add time for F: b[i] = a[i]; bound F: b[i] = a[i]; end end decision loop : 100 G: bar (i) loop : 100 G: bar (i) () end loop ∑ 56 * 100 end loop itself

header if(x>5) if(x<0) bar(i) header if(x>5) if(x<0) bar(i) (4) 12 (20) (4) 12 (20) (7 ) (5) ∑ 17 ∑ (7 ) (5) ∑ 17 ∑

x=x/2 x=x+2 b[i]=a[i] x=x/2 x=x+2 b[i]=a[i] (12) (2) (8) (12) (2) (8) Worst-Case Execution Time Analysis 2771 maj 2015 Worst-Case Execution Time Analysis 2772 maj 2015 foo(x,i): A: while(i < 100) Tree-Based: Final result Path-Based Calc B: if (x > 5) then C: x = x*2; foo(x): else A: loop(i=1..100) foo() D: x = x+2; ! The function B: if (x > 5) then end C: x = x*2 E: if (x < 0) then else tA=7 A foo() will take F: b[i] = a[i]; D: x = x+2 end end 5600 cycles in foo B tB=5 G: i = i+1; E: if (x < 0) then ∑ 5600 end the worst case F: b[i] = a[i]; t =12 C D t =2 end C D ! Find longest path loop : 100 G: bar (i) ∑ 56 * 100 end loop " One loop at a time tE=4 E ! Prepare the loop if(x<0) bar(i) F t =8 header if(x>5) F " (4) 12 (20) Remove back edges (7 ) (5) ∑ 17 ∑ G t =2 " Redirect to special G continue nodes x=x/2 x=x+2 b[i]=a[i] (12) (2) (8) end continue Worst-Case Execution Time Analysis 2773 maj 2015

foo(x,i): A: while(i < 100) Path-Based Calculation Path-Based Calc B: if (x > 5) then C: x = x*2; else foo() foo() D: x = x+2; ! Longest path: end E: if (x < 0) then tA=7 A tA=7 A C and F can " A-B-C-E-F-G F: b[i] = a[i]; never execute " 7+5+12+4+8+2= together end B t =5 B t =5 B 38 cycles B G: i = i+1; end tC=12 C D tD=2 tC=12 C D tD=2 ! Infeasible path: ! Total time: " A-B-C-E-F-G tE=4 E tE=4 E " Ignore, look for next " 100 iterations

F tF=8 " 38 cycles per iteration F tF=8 " Total: 3800 cycles G tG=2 G tG=2

end continue end continue

foo(x,i): A: while(i < 100) Path-Based Calc B: if (x > 5) then Example: IPET Calculation C: x = x*2; else X foo() D: x = x+2; foo foo() end ! IPET = Implicit path X t =7 fooA XGA E: if (x < 0) then A A tA=7 A C and F can F: b[i] = a[i]; enumeration technique XA never execute XAB together end tB=5 B t =5 " Execution paths not B B G: i = i+1; X XB explicitly represented BC X end t =12 BD t =2 C C D D tC=12 C D tD=2 ! Infeasible path: X X ! Program model: C X D " A-B-C-E-F-G XCE DE tE=4 E " Ignore, look for next " Nodes and edges E tE=4 XE XEF ! New longest path: " Timing info (tentity) t =8 XEG F F F tF=8 " A-B-C-E-G ! Node times: basic blocks XF " 30 cycles ! Edge times: overlap XFG G t =2 G tG=2 ! Total time: G " Execution count (Xentity) " Total: 3000 cycles XG X end continue end end IPET Calculation IPET Calculation

X foo() foo() X =1 X =1 foo foo ! foo X WCET= fooA XGA !Solution methods: Xend=1 A A X XA=100 max (X t ) A X " Integer linear programming Σ entity * entity AB XA=XfooA+XGA B " Constraint satisfaction B XB=100 " Where each Xentity X XB X +X =X BC satisfies constraints AB Aend A XBD C D !Solution: C D X =0 X +X =X X X XC=100 D BC BD B C X D ! Constraints: XCE DE " Counts for XE=XCE+XDE E X E XE=100 " Start & end condition E nodes and edges WCET=3000 XEF " Program structure XA<=100 XEG F " A WCET bound F X =0 X F " Loop bounds X F X +X <=X FG C F A G G " Other flow information XG=100 XG X end end Xend=1 end

Hybrid methods

! Combines measurement and static analysis ! Methodology: Hybrid " Partition code into smaller parts " Identify & generate instrumentation points (ipoints) for code parts methods " Run program and generate ipoint traces " Derive time interval/distribution and flow info for code parts based on ipoint traces " Use code part’s time interval/distribution and flow info to create a program WCET estimate ! Basis for RapiTime WCET analysis tool!

Example: loop bound derivation Example: function time derivation int foo(int x) { Instrumentation code int foo(int x) { ! Example trace: write_to_port(’A’); write_to_port(’A’,TIME); ! 3 example traces: " ,, int i = 0; int i = 0; ,, while(i < x) { " Run1: ABBBABBBBA while(i < x) { , write_to_port(’B’); " Run2: ABBAAABBA i++; i++; " Run3: ABBBBBBA } ! Result (based on } write_to_port(’B’,TIME); } ! Result (based on } provided trace): provided traces): " Min time foo: 84 Instrumentation code (156-72=84) " Lower loop bound: 0 Instrumentation Realized as a short " code extended Max time foo: 190 Valid for an " Upper loop bound: 6 assembler snippet with TIME macro (2191-2001=190) entry of foo()

83 84 Notes: Hybrid methods

# Testing and instrumentation already used in industry! " Known testing coverage criteria can be used # No hardware timing model needed! WCET " Relatively easy to adapt analysis to new hardware targets – Is the resulting WCET estimate safe? " Have all costly software paths been executed? analysis " Have all hardware effects been provoked/captured? – How much do instrumentation affect execution time? tools " Will timing behavior differ if they are removed? " Often constraints on where instrumentation points can be placed " Often limits on the amount of instrumentation points possible " Often limits on the bandwidth available for traces extraction – Are task /interrupts detected? " If not, derived timings may include them!

85

WCET Analysis Tools The Bound-T WCET tool ! A commercial WCET analysis tool ! Several more or less complete tools " Provided by Tidorum Ltd, www.tidorum.fi ! Commercial tools: " Decodes instructions, construct CFGs, " aiT from AbsInt call-graphs, and calculates WCET from " Bound-T from TidoRum the executable " RapiTime from Rapita Systems ! A variety of ! Research tools: CPUs supported: " SWEET – Swedish Execution Time tool " Including the " OTAWA (France) Renesas H8/3297 " TUBound (Austria) " Porting made as MSc " Chronos (Singapore) thesis project at MDH

87

WCET tool differences Supported CPUs (2008)

! Used static and/or hybrid methods Tool Hardware platforms ! User interface aiT Motorola PowerPC MPC 555, 565, and 755, Motorola ColdFire MCF " Graphical and/or textual 5307, ARM7 TDMI, HCS12/STAR12, TMS320C33, C166/ST10, Renesas M32C/85, Infineon TriCore 1.3 ! Flow analysis performed Bound-T Intel-8051, ADSP-21020, ATMEL ERC32, Renesas H8/300, " Manual annotations supported ATMEL AVR and ATmega, ARM7 ! How the mapping problem is solved RapiTime Motorola PowerPC family, HCS12 family, ARM, NECV850, MIPS3000 " Decoding binaries SWEET ARM9, NECV850E " Integrated with compiler Heptane Pentium1, StrongARM 1110, Renesas H8/300 ! Supported processors and compilers Vienna M68000, M68360, Infineon C167, PowerPC, Pentium ! Low-level analysis performed Florida MicroSPARC I, Intel Pentium, StarCore SC100, Atmel Atmega, PISA/ MIPS " Type of hardware features handled Chalmers PowerPC ! Calculation method used

89 90 Industrial usage

! Static/hybrid WCET analysis are today used in real industrial settings ! Examples of industrial usage: The SWEET " Avionics: Airbus, aiT approach to " Automotive: Ford, aiT " Avionics: BAE Systems, RapiTime WCET analysis " Automotive: BMW, RapiTime " Space systems: SSF, Bound-T ! However, most companies are still highly unaware of the concepts of “WCET analysis” and/or “schedulability analysis”

91

The MDH WCET project Technology transfer to industry (and academia) ! Researching on static WCET analysis " Developing the SWEET (SWEdish ! Evaluation of WCET analysis in industrial settings Execution Time) analysis tool " Targeting both WCET tool providers and industrial users ! Research focus: " Using state-of-the-art WCET analysis tools ! Applied as MSc thesis works: " Flow analysis " Enea OSE, using SWEET & aiT " Technology transfer to industry " Volcano Communications, using aiT " International collaboration " Bound-T adaption to Lego Mindstorms and " Parametrical WCET analysis Renesas H8/300. Used in MDH RT courses " Early-stage WCET analysis* " CC-Systems, using aiT & measurement tools " WCET analysis for multi-core* " Volvo CE using aiT & SWEET ! Previous research focus: " …. " Low-level analysis ! Articles and MSc thesis reports " Calculation available on the MRTC web

* = new project activities

Flow analysis Where SWEET comes in… C Runtime ! Main focus of the MDH WCET analysis group " Motivated by our industrial case studies C Library C Source Object File ! We perform many types of advanced x = 1..10 program analyses: C Source Compiler Object File OS " Program slicing (dependency analysis) x > 5 Object File " Value analysis (abstract interpretation) x = 1..4 C Source Linker Other Lib " Abstract execution ALF x = 6..10 A B Input value Binary ... Executable constraints ALF reader ! Both loop bounds and x < 3 infeasible paths are derived LOW-SWEET Hardware C D WCET ! Analysis made on x = 1..2 Flow Low-level analysis analysis x = 3..8 Calculation WCET ALF intermediate code E Path A-C is Estimate " ~ “high level assembler” infeasible! SWEET

95 Slicing for flow analysis Value analysis

! Observation: some variables and statements ! Based on abstract interpretation (AI) do not affect the execution flow of the program " Calculates safe approximations of possible values = they will never be used to determine the outcome of conditions for variables at different program points ! Idea: remove variables and statements which are " E.g. interval analysis gives i = [5..100] at p i=5; guaranteed to not affect execution flow " E.g. congruence analysis gives i = 5 + 2* at p max=100; " Subsequent flow analyses should provide same result but with shorter analysis time ! Builds upon well known while(i<=max) { ! Based on well-known program slicing techniques program analysis techniques // point p " Reduces up to 94% 1. a[0] = 42; 1. " Used e.g. for checking array bound violations i=i+2; of total program 2. i = 1; 2. i = 1; } size for some of 3. j = 5; 3. j = 5; ! Requires abstract versions of all our benchmarks 4. n = 2 * j; 4. n = 2 * j; 5. while (i <= n) { 5. while (i <= n) { ALF instructions 6. a[i] = i * i; 6. " These abstract instructions work on abstract values 7. i = i + 2; 7. i = i + 2; (representing set of concrete values) instead of normal ones 8. } 8. }

Loop bound analysis by AI Abstract Execution (AE)

! Observation: the number of possible program states within a loop provides a loop bound !Derives loop bounds and infeasible paths " Assuming that the loop terminates !Based on Abstract Interpretation (AI) ! Loop bound = product of possible i=5; " AI gives safe (over)approximation of possible values values of variables within the loop max=99; of each variable at different program points " Each variable can hold a set of values ! Example: while(i<=max) { i = [1..4] " Interval analysis gives // point p !“Executes” program using abstract values i = [5..100] and max=[100..100] at p i=i+2; " Not using traditional AI fixpoint calculation " Congruence analysis gives } i = 5 + 2* and max=100+0* at p !Result: an (over)approximation of the " The produce of possible values become: possible execution paths size(i) * size(max) = ((100-5)/2) * (100-100)/1) = 45 * 1 = 45 which is an upper loop bound " All feasible paths will be included in the result ! Analysis bounds some but not all loops " Might potentially include some infeasible paths " Infeasible paths found are guaranteed to be infeasible

99

Loop bound analysis by AE

i = INPUT; Loop Abstract Abstract [1..4] iteration state at p state at q // i = [1..4] [3..6] 1 i = [1..4] while(i < 10) { ┴ [5..8] // point p 2 i = [3..6] ┴ [7..9] [10..10] Multi-core [9..9] [10..11] ... 3 i = [5..8] ┴ [11..11] i = i + 2; 4 i = [7..9] i = [10..10] + WCET } 5 i = [9..9] i = [10..11] Result // point q 6 ┴ i = [11..11] Min iterations: 3 Max iterations: 5 analysis? ! Result includes all possible loop executions ! Three new abstract states generated at q " Could be merged to one single abstract state: i=[10..11]

Trends in Embedded HW Multi-core architectures ! Several (simple) CPUs on one chip ! Trend: Large variety of ES HW platforms " Increased performance & lower power core core core " Not just one main processor type as for PCs " “SoC”: System-on-a-Chip possible " Many different HW configurations (memories, devices, …) ! Explicit parallelism L1 cache L1 cache L1 cache " Challenge: How to make WCET analysis portable between " Not hidden as in superscalar architectures platforms? ! Likely that CPUs will be less complex L2 cache than current high-end processors ! Trend: Increasingly complex HW " Good for WCET analysis! Timer Serial features to boost performance ! However, risk for more shared RAM " Taken from the high-performance CPUs resources: buses, memories, … Network etc. " Pipelines, caches, branch predictors, " Bad for WCET analysis! Devices superscalar, out-of-order, … " Unrelated threads on other cores might use shared resources Multicore chip " Challenge: How to create safe and tight HW timing models? ! Analysis of multi-core is simplified if predictable sharing ! Trend: Multi-core architectures of common resources is somehow enforced

Example: shared Example: shared memory

! Example, dual core processor with private L1 ! ES often programmed using shared memory model caches and shared memory bus for all cores " t1 and t2 may communicate/synchronize using shared variables " Each core runs its own code and task ! Problem: int t1_code { int t2_code { ! Problem: if(...) { ... " When t1 writes g, memory block of g is loaded into core1’s d-cache while(...) { ... " " Whenever t1 needs something from Similarly, when t2’s writes g, memory } ... block of g moved to t2’s d-cache (and memory it may or may not collide with ... } int t1_code { int t2_code { t2’s accesses on the memory bus } } t1’s block is invalidated) if(...) { ...... while(...) { " Depends on what t1 and t2 accesses ! May give a large overhead g=5; ... and when they accesses it " Much time can be spent moving memory } g++; " Large parallel state space to explore blocks in between caches (ping-pong) ...; } } } " Hidden from programmer - HW makes ! Possible solution: sure that cache/memory content is ok " Use deterministic (but potentially pessi- " False sharing – when tasks accesses mistic) bus schedule, like TDMA different variables, but variables are " Worst-case memory bus delay can then located in same memory block be bounded ! Possible solutions: " Constrain task’s accesses to shared TDMA bus memory (e.g. single-shot task model) schedule

106

Example: multithreading Trends in Embedded SW

! Common on high-order multi-cores and GPUs ! Traditionally: embedded SW written in C ! Core run multiple threads of execution in parallel and assembler, close to hardware " Parts of core that store state of threads (registers, PC, ..) replicated " Core’s execution units and caches shared between threads ! Trend: size of embedded SW increases ! Benefits int t1_code { " SW now clearly dominates ES development cost ...; " Hides latency – when one } stalls another may execute instead int t3_code { " Hardware used to dominate ...; " Better utilization of core’s computing int t2_code { } ...; ! Trend: more ES development by high-level resources – one thread usually only } use a few of them at the same time programming languages and tools ! Problems " Object-oriented programming languages " Hard to get timing predictability " Model-based tools " Instructions executing and cache content depends dynamically on " Component-based tools state of threads, scheduler, etc.

107 Increase in embedded SW size Higher-level prog. languages ! More and more functionality required ! " Most easily realized in software Typically object-oriented: C++, Java, C#, … ! Software gets more and more complex ! Challenges for WCET analysis: " Harder to identify the timing critical part of the code " Higher use of dynamic data structures " Source code not always available for all parts of the ! In traditional ES programming all data is statically system, e.g. for SW developed by subcontractors allocated during compile time ! Challenges for WCET analysis: " Dynamic code, e.g., calls to virtual methods !Hard to analyze statically (actual method called " Scaling of WCET analysis methods to larger code sizes may not be known until run-time) ! Better visualization of results (where is the time spent?) " Better adaptation to the SW development process " Dynamic middleware: ! Today’s WCET analysis works on the final executable ! Run-time system with GC ! Challenge: how to provide reasonable precise WCET ! Virtual machines with JIT compilation estimates at early development stages

Model-based design Component-based design

! More embedded system code generated by ! Very trendy within software engineering higher-level modeling and design tools ! General idea: " RT-UML, Ascet, Targetlink, Scade, ... model " Package software into reusable ! The resulting code structure components depends on the code generator " Build systems out of prefabricated " Often simpler than handwritten code components, which are “glued together” ... generated ! Possible to integrate such tools label rerun: if(flag1 || flag2) ... code ! WCET analysis challenges: with WCET analysis tools else " The analysis can be automated goto rerun; " How to reuse WCET analysis results ... " E.g., loop bounds can be provided when some settings have changed? directly by the modeling tool ….´ executable " How to analyze SW components 10010101001110101100101001 ! Hard to provide reliable timing on 10010101001110101100101001 10100101010101001010010100 when not all information is available? modeling level 10010101010101010100101010 .... " Are WCET analysis results composable?

Compiler interaction ! Today – commercial WCET analysis tools analyses binaries ! Another possibility – interaction with the compiler " Easier to identify data objects and to understand what the program is intended to do The End! ! There exists many compilers for embedded systems " Very fragmented market " Each specialized on a few particular targets " Targeting code size and execution speed For more information: ! Integration with WCET analysis tools opens new possibilities: www.mrtc.mdh.se/projects/wcet " Compile for timing predictability " Compile for small WCET

114