High Performance Embedded Systems

Parallelism

Electronics Engineering Department Electronics Master Program

June 2020 Parallel Programming Analogy

2 Source: Wikipedia.org Amdahl’s Law

• Basic Exercise: Your program takes 20 days to run o 95% can be parallelized o 5% cannot (serial) o What is the fastest this code can run? ✓ As many CPU’s as you want!

1 1 day! Or a of = 20 1−푝 3 Outline

• Flynn’s Taxonomy

• The Scalar

• Multicore Architectures

• Pipelining

• Instruction-Level Parallelism (ILP)

- Level Parallelism (TLP)

4 Flynn’s Taxonomy

• It is a classification of architectures, proposed by Michael J. Flynn in 1966.

• He proposed four different classes for processors architectures:

o Single-instruction, single-data (SISD) o Single-instruction, multiple-data (SIMD) o Multiple-instruction, single-data (MISD) o Multiple-instruction, multiple-data (MIMD) 5 Single-instruction, Single-data (SISD)

• Sequential => no parallelism in either the instruction or data streams.

• A single stream of instructions operates on a single set of data.

• SISD can have concurrent processing characteristics.

6 Single-instruction, Multiple-data (SIMD)

• Same simultaneous instructions, own data!

• All PU perform the same operations on their data in a lockstep.

• A single program is possible for all processing elements.

7 SIMD vs SISD Example

• Multiply registers R1 (variable) and R2 (constant) • Storage result in RAM

SISD SIMD

8 Multiple-instruction, Single-data (MISD)

• Different simultaneous instructions, same data!

• Pipelined architectures fit this category under the condition above.

• Computer vision example: Multiple filters applied to same input image.

9 Multiple-instruction, Multiple-data (MIMD)

• Different simultaneous instructions, different data!

• PU are asynchronous and independent.

• Application fields: Computer-Aided Design (CAD), simulation, modeling, Embedded Systems, etc.

• Multi-core processors are MIMD!!!

10 Outline

• Flynn’s Taxonomy

• The

• Multicore Architectures

• Pipelining

• Instruction-Level Parallelism (ILP)

• Thread- Level Parallelism (TLP)

11 Scalar Processor

• The most basic implementation for processors is the scalar processor.

• Executes at most one single instruction at a time. The instruction can involve either an integer or floating-point operation.

• A scalar processor is classified as a SISD processor (Single Instructions, Single Data) in Flynn's taxonomy. 12 Scalar Processor

• Not used nowadays on its basic conceptual definition.

• Does it have any advantage? • Simplicity! • Do not have data or control hazards i.e. there is not hardware conflicts trying to execute simultaneous instructions. • is straightforward.

13 Can we do better than the scalar processor?

Yes!

14 Superscalar Processor

• Allows implementation of Instruction Level Parallelism (ILP) with a single processor.

• Executes more than one instruction during a clock cycle by simultaneously dispatching multiple instructions to different execution units on the processor.

• Allows more throughput at a given than a scalar processor.

15 Superscalar Processor

• Allows implementation of Instruction Level Parallelism (ILP) with a single processor.

• All general-purpose CPUs developed since about 1998 are superscalar. Even our RPI4’s Processor, the ARM Cortex-A72.

• Characteristics: ➢ CPU dynamically checks for data dependencies between instructions at run time. ➢ Instructions are issued from a sequential instruction stream. ➢ CPU can execute multiple instructions per clock cycle 16 But… How can we do even better?

Parallelization!

• The overall objective is to speed up execution of the program.

• There is another reason on embedded systems: concurrency of physical processes.

• Parallelization enables a better energy efficiency => Interesting for Embedded Systems!

17 Outline

• Flynn’s Taxonomy

• The Scalar Processor

• Multicore Architectures

• Pipelining

• Instruction-Level Parallelism (ILP)

• Thread- Level Parallelism (TLP)

18 Why do we care about the Architecture?

Parallel speed up is affected by: • Hardware configuration for each application. • Memory or CPU architecture. • Numbers of cores/processor. • Network speed and architecture.

19 Why do we care about the Architecture?

• A multicore machine is a combination of several processors on a single chip. • Homogeneous multicore processor: composed of multiple instances of the same processor type on a single chip. • Heterogeneous multicore processor: composed of a variety of processor types on a single chip. • Multicore architectures are a powerful ally on embedded systems

since they allow to provide real-time requirements more easily. 20 Single-core Architecture

21 Source: www.cs.cmu.edu Why multi-core ? • Difficult to make single-core clock frequencies even higher (physical restrictions) • Deeply pipelined circuits: – heat problems – difficult design and verification – large design teams necessary – server farms need expensive air-conditioning • Many new applications are multithreaded • General trend in (shift towards more parallelism)

22 CMOS General Power Consumption • There are 2 main components: • Static power: 푃푠푡푎푡푖푐 = 퐼푠푡푎푡푖푐 ∗ 푉푑푑 • Transistors are not switching • Leakage current due to tunneling through thin gate oxide

2 3 • Dynamic Power: 푃푑푦푛푎푚푖푐 = 퐶퐿 ∗ 퐶 푉푑푑 ∗ 푓 ∗ 푁 Where: ▪ 퐶퐿: Load capacitance ▪ 퐶: Chip internal capacitance ▪ 푓: Frequency of operation ▪ 푁: Number of bits that are switching ▪ 푉푑푑: Supply voltage 23 Multi-Core Energy Consumption

Processor Broadcom BMC 2711 Intel 4 Number of Cores 4 1 Frequency [MHz] 1500 2992 RAM LPDDR4 (4GB) DDR (2GB) L1 [kb] 32 16 Compiler Version GCC 8.3.0 GCC 4.1.3 Power Consumption [W] 4 84 CoreMark Score 33072.76 5007.06

24 Multi-core Architecture

• Multiple processors in a single die.

Core 1 Core 2 Core 3 Core 4

25 Source: www.cs.cmu.edu Multi-core Architecture

• Multiple processors in a single die.

Core 1 Core 2 Core 3 Core 4

Thread 1 Thread 2 Thread 3 Thread 4

26 Source: www.cs.cmu.edu Can we run several threads on each core?

27 Multi-core Architecture

• Multiple processors in a single die.

Core 1 Core 2 Core 3 Core 4

several threads several threads several threads Thread 8

28 Source: www.cs.cmu.edu Multiprocessor Memory Types:

• Creates a single group of memory modules and multiple processors.

• Any processor is able to directly access any memory module by means of an interconnection network.

• The group of memory modules outlines a universal address space that is shared between the processors.

• Advantage: easy to program => communications among processors through the global memory store. 29 Multiprocessor Memory Types:

• Clones the memory/processor pairs, known as a processing element (PE), and links them by using an interconnection network.

• Each PE can communicate with others by sending messages.

• Advantage: each processor has its own memory => no memory concurrency across processors.

30 Interaction with the Operating System

• OS perceives each core as a separate processor

• OS scheduler maps threads/processes to different cores

• Most major OS support multi-core today: o Windows, Linux, Mac OS X,

31 Outline

• Flynn’s Taxonomy

• The Scalar Processor

• Multicore Architectures

• Pipelining

• Instruction-Level Parallelism (ILP)

• Thread- Level Parallelism (TLP)

32 Pipelining • During the execution of instructions, the data path has idle hardware sections. • Pipelining focus on extending the data path in order to allow more than one instruction execution using the idle hardware. • Latches are used between each data path stage to hold the data for each instruction

33 Pipelining: Reservation Table

• Consider 5 different tasks named: A, B, C, D and E. • At clock 9, we have:

34 Hazards

• There are two main pipeline hazards on the example exposed: o Data hazard o Control hazard

35 Pipeline Hazards: Data Hazard

• Consider the following stage of the reservation table:

What happen if B required the registers values wrote by A?

36 Pipeline Hazards: Data Hazard

• There are different methods to overcome this situation: • Explicit pipeline:

37 Pipeline Hazards: Data Hazard

• Interlock:

38 Pipeline Hazards: Data Hazard

• Out-of-order execution:

39 Pipeline Hazards: Control Hazard • Consider the following case: ➢ If A is a conditional instruction it should reach the memory stage before the next instruction executes to decide whether it should be executed or not.

40 Pipeline Hazards: Control Hazard

• Common solutions for control hazards:

o Delayed branch: ➢ Add no-ops or other instructions at compile time until the condition information is available to decide if the next instruction should be executed or not.

o : ➢ Hardware executes the instruction it expects to execute. If it was wrong, it undoes any side effects.

41 Outline

• Flynn’s Taxonomy

• The Scalar Processor

• Multicore Architectures

• Pipelining

• Instruction-Level Parallelism (ILP)

• Thread- Level Parallelism (TLP)

42 Instruction-level Parallelism

• Parallelism at the machine-instruction level => Emphasis on hardware (low-level parallelization).

• The processor can re-order, pipeline instructions, split them into microinstructions, do aggressive branch prediction, etc.

• Instruction-level parallelism enabled rapid increases in processor speeds over the last (20) years.

43 Instruction-level Parallelism

• A processor supporting ILP is able to perform multiple independent operations in each .

• A single instruction may take several clock cycles.

• There are 4 major forms of ILP: - CISC - Sub-word parallelism - Superscalar - VLIW

44 Instruction-level Parallelism: CISC

45 Instruction-level Parallelism: CISC

• CISC => Complex Instruction Set Computer.

• Implement complex and commonly specialized instructions in a single assembly line.

46 Instruction-level Parallelism: CISC • CISC Example: TMS320c54x DSP

47 Instruction-level Parallelism: CISC • CISC Example: TMS320c54x DSP

48 Instruction-level Parallelism: Sub-word Parallelism

49 Instruction-level Parallelism: Sub-word Parallelism

• It makes no sense to spend an entire 64-bit long data word to compute a single 8-bit data (e.g. pixel value).

• A wide ALU is divided into narrower slices allowing to execute simultaneous arithmetic or logical operations over multiple sub- word values.

• The same operation is executed on all sub-words.

• Implemented as vector units or processors (NEON). 50 Instruction-level Parallelism: Sub-word Parallelism

• Example: SSE - Streaming SIMD Extension

51 Instruction-level Parallelism: Superscalar Parallelism

52 Instruction-level Parallelism: Superscalar Parallelism • Superscalar processors use conventional sequential instruction sets. • Hardware can simultaneously dispatch multiple instructions to distinct hardware units when it detects that such simultaneous dispatch will not change the behavior of the program such as out- of-order execution.

53 Instruction-level Parallelism: Superscalar Parallelism • Includes additional functional hardware units (ALU, multiplier, registers, etc). • Dynamically determinates the instruction execution order. • Issues more than one instruction per clock cycle. • Popular on desktop and server architectures. • Execution time may be difficult to predict or can not be repeatable when using threads/interrupts => big problem for embedded 54 systems. Instruction-level Parallelism: Superscalar Parallelism • Usually implements pipelining to take advantage of the processing stages of the data path.

55 Superscalar vs Pipelining

• The superscalar architecture executes multiple instructions in parallel by using multiple execution units.

• A pipeline architecture executes multiple instructions in the same in parallel by dividing the execution unit into different phases.

56 Instruction-level Parallelism: VLIW

57 Instruction-level Parallelism: VLIW

• VLIW => Very Long Instruction Word

• A fixed number of operations are formatted as one big instruction by the compiler. This big instruction is called a bundle.

• The points to 1 bundle (not 1 operation).

• More predictable and repeatable timing => Interesting for time- critical Embedded Systems.

58 Instruction-level Parallelism: VLIW

How VLIW designs reduce hardware complexity (in theory)?

• No dependence checking for instructions within a bundle

• Simpler instruction dispatch • No out-of-order execution, no instruction grouping • No structural hazard checking logic

Compiler figures all this out!

59 Instruction-level Parallelism: VLIW

• Requires additional functional hardware units. • Do not determinates the execution order but specifies on a single instruction different operations to be executed on different functional units. • VLIW instruction set combines multiple independent operations into a single instruction.

60 Instruction-level Parallelism: VLIW

Compiler support to increase ILP:

• Compiler detects hazards: • Structural hazards: • No 2 operations to the same functional unit • No 2 operations to the same memory bank • Data hazards • No data hazards among instructions in a bundle • Control hazards • Predicated execution

• Static branch prediction 61 Instruction-level Parallelism: VLIW

VLIW Example: TMS320C674x

62 TMS320C674x DSP Decode stage Instruction-level Parallelism: VLIW vs Superscalar

Superscalar => Complex hardware for VLIW => More functional units instruction scheduling => Larger code size => Requires complex compiler • Out-of-order execution • dependence checking logic between • More design effort or poor-quality code if parallel instructions good compiler optimizations aren’t • Functional unit hazard checking implemented. • Possible consequences:# o lower cycle times • Simpler hardware o More chip real estate • VLIW are more predictable o More power consumption • Predicated execution to avoid branches

Superscalars can more efficiently execute pipeline- dependent code. 63 Instruction-level Parallelism: VLIW vs Superscalar

64 Outline

• Flynn’s Taxonomy

• The Scalar Processor

• Multicore Architectures

• Pipelining

• Instruction-Level Parallelism (ILP)

• Thread- Level Parallelism (TLP)

65 Thread-Level Parallelism (TLP) • This is parallelism on a coarser scale than the ILP

• Server can serve each client in a separate thread (Web server, database server)

• A computer game can do AI, graphics, and physics in three separate threads

• Single-core superscalar processors cannot fully exploit TLP

• Multi-core architectures are the next step in processor evolution: explicitly exploiting TLP 66 Review: Flynn’s Taxonomy and Memory

• SIMD – Single instruction, multiple data – Modern graphics cards

• MIMD – Multiple instructions, multiple data – Modern multiprocessors: Different cores execute different threads

Shared Memory Distributed Memory 67 Thread-Level Parallelism (TLP) Methods

• Simultaneous Multithreading (SMT) – Single core, multiple functional units – Not a “true” parallel processor

• Multi-core Parallelism – Threads run independent from each other

68 Simultaneous Multithreading (SMT)

69 Simultaneous Multithreading (SMT)

• Problem addressed: The processor pipeline can get stalled:

o Waiting for the result of a long floating point (or integer) operation. o Waiting for data to arrive from memory.

70 Source: www.cs.cmu.edu Simultaneous Multithreading (SMT)

• Permits multiple independent threads to execute SIMULTANEOUSLY on the SAME core.

• Weaving together multiple “threads” on the same core.

• How is this done? Example: if one thread is waiting for a floating- point operation to complete, another thread can use the integer units.

71 Source: www.cs.cmu.edu Without SMT

• Only a single thread can run at any given time

72 Source: www.cs.cmu.edu Without SMT

• Only a single thread can run at any given time

73 Source: www.cs.cmu.edu Simultaneous Multithreading (SMT)

• Both threads can run concurrently

74 Source: www.cs.cmu.edu Simultaneous Multithreading (SMT)

• But: Can’t simultaneously use the same functional unit

This scenario is impossible with SMT on a single core (assuming a single integer unit) 75 Source: www.cs.cmu.edu Simultaneous Multithreading (SMT)

• But: Can’t simultaneously use the same functional unit

This scenario is impossible with SMT on a single core (assuming a single integer unit) 76 Source: www.cs.cmu.edu Simultaneous Multithreading (SMT) • SMT is not a true parallel processor.

• Enables better threading (e.g. up to 30%)

• OS and applications perceive each simultaneous thread as a separate “virtual processor”

• The chip has only a single copy of each resource

• Compare to multi-core: each core has its own copy of resources

77 Source: www.cs.cmu.edu Multi-core Parallelism

78 Source: www.cs.cmu.edu Multi-core Parallelism

79 Source: www.cs.cmu.edu Multi-core Parallelism + SMT • Cores can be SMT-enabled (or not) • The different combinations: – Single-core, non-SMT: standard uniprocessor – Single-core, with SMT – Multi-core, non-SMT – Multi-core, with SMT: nowadays

• The number of SMT threads: 2, 4, or sometimes 8 simultaneous threads

• Intel calls them “hyper-threads”. Pentium 4 first hyper-thread µP. 80 Source: www.cs.cmu.edu Multi-core Parallelism + SMT

81 Source: www.cs.cmu.edu Comparison: Multi-core vs SMT • Multi-core: – Since there are several cores: o Each is smaller and not as powerful (but also easier to design and manufacture) o However, great with thread-level parallelism

• SMT o Can have one large and fast superscalar core o Great performance on a single thread o Mostly still only exploits instruction-leve parallelism

82 Source: www.cs.cmu.edu The • If simultaneous multithreading only: – all caches shared

• Multi-core chips: – L1 caches private – L2 caches private in some architectures and shared in others

• Memory is always shared

83 Source: www.cs.cmu.edu Private vs Shared Caches • Advantages of private: o They are closer to core, so faster access o Reduces contention

• Advantages of shared: o Threads on different cores can share the same cache data o More cache space available if a single (or a few) high- performance thread runs on the system

84 Source: www.cs.cmu.edu The Problem • Since we have private caches: How to keep the data consistent across caches? • Each core should perceive the memory as a monolithic array, shared by all the cores.

85 Source: www.cs.cmu.edu The Cache Coherence Problem • Suppose variable x initially contains 15213

86 Source: www.cs.cmu.edu The Cache Coherence Problem • Core 1 reads “x”

87 Source: www.cs.cmu.edu The Cache Coherence Problem • Core 2 reads “x”

88 Source: www.cs.cmu.edu The Cache Coherence Problem • Core 2 has a stale copy

89 Source: www.cs.cmu.edu Solutions for Cache Coherence

• This is a general problem with multiprocessors, not limited just to multi-core

• There exist many solution algorithms, coherence protocols, etc.

• A simple solution: invalidation-based protocol with snooping

90 Source: www.cs.cmu.edu Inter-core

91 Source: www.cs.cmu.edu Solutions for Cache Coherence

• Invalidation: • If a core writes to a data item, all other copies of this data item in other caches are invalidated.

• Snooping: • All cores continuously “snoop” (monitor) the bus connecting the cores.

92 Source: www.cs.cmu.edu Solutions for Cache Coherence

93

Source: https://image.slideserve.com/ The Cache Coherence Problem • Revisited: Cores 1 and 2 have both read x

94 Source: www.cs.cmu.edu The Cache Coherence Problem: Invalidation • Core 1 writes to x, setting it to 21660

95 Source: www.cs.cmu.edu The Cache Coherence Problem: Invalidation • Core 2 reads x. Cache misses, and loads the new copy.

96 Source: www.cs.cmu.edu The Cache Coherence Problem: Update Protocol

• Core 1 writes x=21660.

97 Source: www.cs.cmu.edu Invalidation vs update

• Multiple writes to the same location – invalidation: only the first time – update: must broadcast each write (which includes new variable value)

• Invalidation generally performs better: generates less bus traffic

98 Source: www.cs.cmu.edu References

[1] Marco Madrigal Solano. Lectures Notes HPEC, Tecnológico de Costa Rica, Course: Sistemas empotrados de Alto Desempeño. [2] W. Wolf. High-Performance Embedded Computing: Architectures, Applications and Methodologies. Elsevier, United States of America, 2007. [3] E. Ashford and S. Arunkumar Introduction to Embedded Systems, 2017 [4] Tammy Noergaard, Embedded Systems Architecture, 2005

Lectures notes and materials are available in TEC-Digital and web portals

www.ie.tec.ac.cr/sarriola/HPEC 99 http://www.ie.tec.ac.cr/joaraya/HPEC/ 100