High Performance Embedded Systems
Parallelism
Electronics Engineering Department Electronics Master Program
June 2020 Parallel Programming Analogy
2 Source: Wikipedia.org Amdahl’s Law
• Basic Exercise: Your program takes 20 days to run o 95% can be parallelized o 5% cannot (serial) o What is the fastest this code can run? ✓ As many CPU’s as you want!
1 1 day! Or a speedup of = 20 1−푝 3 Outline
• Flynn’s Taxonomy
• The Scalar Processor
• Multicore Architectures
• Pipelining
• Instruction-Level Parallelism (ILP)
• Thread- Level Parallelism (TLP)
4 Flynn’s Taxonomy
• It is a classification of computer architectures, proposed by Michael J. Flynn in 1966.
• He proposed four different classes for processors architectures:
o Single-instruction, single-data (SISD) o Single-instruction, multiple-data (SIMD) o Multiple-instruction, single-data (MISD) o Multiple-instruction, multiple-data (MIMD) 5 Single-instruction, Single-data (SISD)
• Sequential => no parallelism in either the instruction or data streams.
• A single stream of instructions operates on a single set of data.
• SISD can have concurrent processing characteristics.
6 Single-instruction, Multiple-data (SIMD)
• Same simultaneous instructions, own data!
• All PU perform the same operations on their data in a lockstep.
• A single program counter is possible for all processing elements.
7 SIMD vs SISD Example
• Multiply registers R1 (variable) and R2 (constant) • Storage result in RAM
SISD SIMD
8 Multiple-instruction, Single-data (MISD)
• Different simultaneous instructions, same data!
• Pipelined architectures fit this category under the condition above.
• Computer vision example: Multiple filters applied to same input image.
9 Multiple-instruction, Multiple-data (MIMD)
• Different simultaneous instructions, different data!
• PU are asynchronous and independent.
• Application fields: Computer-Aided Design (CAD), simulation, modeling, Embedded Systems, etc.
• Multi-core processors are MIMD!!!
10 Outline
• Flynn’s Taxonomy
• The Scalar Processor
• Multicore Architectures
• Pipelining
• Instruction-Level Parallelism (ILP)
• Thread- Level Parallelism (TLP)
11 Scalar Processor
• The most basic implementation for processors is the scalar processor.
• Executes at most one single instruction at a time. The instruction can involve either an integer or floating-point operation.
• A scalar processor is classified as a SISD processor (Single Instructions, Single Data) in Flynn's taxonomy. 12 Scalar Processor
• Not used nowadays on its basic conceptual definition.
• Does it have any advantage? • Simplicity! • Do not have data or control hazards i.e. there is not hardware conflicts trying to execute simultaneous instructions. • Compiler is straightforward.
13 Can we do better than the scalar processor?
14 Superscalar Processor
• Allows implementation of Instruction Level Parallelism (ILP) with a single processor.
• Executes more than one instruction during a clock cycle by simultaneously dispatching multiple instructions to different execution units on the processor.
• Allows more throughput at a given clock rate than a scalar processor.
15 Superscalar Processor
• Allows implementation of Instruction Level Parallelism (ILP) with a single processor.
• All general-purpose CPUs developed since about 1998 are superscalar. Even our RPI4’s Processor, the ARM Cortex-A72.
• Characteristics: ➢ CPU dynamically checks for data dependencies between instructions at run time. ➢ Instructions are issued from a sequential instruction stream. ➢ CPU can execute multiple instructions per clock cycle 16 But… How can we do even better?
Parallelization!
• The overall objective is to speed up execution of the program.
• There is another reason on embedded systems: concurrency of physical processes.
• Parallelization enables a better energy efficiency => Interesting for Embedded Systems!
17 Outline
• Flynn’s Taxonomy
• The Scalar Processor
• Multicore Architectures
• Pipelining
• Instruction-Level Parallelism (ILP)
• Thread- Level Parallelism (TLP)
18 Why do we care about the Architecture?
Parallel speed up is affected by: • Hardware configuration for each application. • Memory or CPU architecture. • Numbers of cores/processor. • Network speed and architecture.
19 Why do we care about the Architecture?
• A multicore machine is a combination of several processors on a single chip. • Homogeneous multicore processor: composed of multiple instances of the same processor type on a single chip. • Heterogeneous multicore processor: composed of a variety of processor types on a single chip. • Multicore architectures are a powerful ally on embedded systems
since they allow to provide real-time requirements more easily. 20 Single-core Architecture
21 Source: www.cs.cmu.edu Why multi-core ? • Difficult to make single-core clock frequencies even higher (physical restrictions) • Deeply pipelined circuits: – heat problems – difficult design and verification – large design teams necessary – server farms need expensive air-conditioning • Many new applications are multithreaded • General trend in computer architecture (shift towards more parallelism)
22 CMOS General Power Consumption • There are 2 main components: • Static power: 푃푠푡푎푡푖푐 = 퐼푠푡푎푡푖푐 ∗ 푉푑푑 • Transistors are not switching • Leakage current due to tunneling through thin gate oxide
2 3 • Dynamic Power: 푃푑푦푛푎푚푖푐 = 퐶퐿 ∗ 퐶 푉푑푑 ∗ 푓 ∗ 푁 Where: ▪ 퐶퐿: Load capacitance ▪ 퐶: Chip internal capacitance ▪ 푓: Frequency of operation ▪ 푁: Number of bits that are switching ▪ 푉푑푑: Supply voltage 23 Multi-Core Energy Consumption
Processor Broadcom BMC 2711 Intel Pentium 4 Number of Cores 4 1 Frequency [MHz] 1500 2992 RAM LPDDR4 (4GB) DDR (2GB) L1 Cache [kb] 32 16 Compiler Version GCC 8.3.0 GCC 4.1.3 Power Consumption [W] 4 84 CoreMark Score 33072.76 5007.06
24 Multi-core Architecture
• Multiple processors in a single die.
Core 1 Core 2 Core 3 Core 4
25 Source: www.cs.cmu.edu Multi-core Architecture
• Multiple processors in a single die.
Core 1 Core 2 Core 3 Core 4
Thread 1 Thread 2 Thread 3 Thread 4
26 Source: www.cs.cmu.edu Can we run several threads on each core?
27 Multi-core Architecture
• Multiple processors in a single die.
Core 1 Core 2 Core 3 Core 4
several threads several threads several threads Thread 8
28 Source: www.cs.cmu.edu Multiprocessor Memory Types: Shared Memory
• Creates a single group of memory modules and multiple processors.
• Any processor is able to directly access any memory module by means of an interconnection network.
• The group of memory modules outlines a universal address space that is shared between the processors.
• Advantage: easy to program => communications among processors through the global memory store. 29 Multiprocessor Memory Types: Distributed Memory
• Clones the memory/processor pairs, known as a processing element (PE), and links them by using an interconnection network.
• Each PE can communicate with others by sending messages.
• Advantage: each processor has its own memory => no memory concurrency across processors.
30 Interaction with the Operating System
• OS perceives each core as a separate processor
• OS scheduler maps threads/processes to different cores
• Most major OS support multi-core today: o Windows, Linux, Mac OS X,
31 Outline
• Flynn’s Taxonomy
• The Scalar Processor
• Multicore Architectures
• Pipelining
• Instruction-Level Parallelism (ILP)
• Thread- Level Parallelism (TLP)
32 Pipelining • During the execution of instructions, the data path has idle hardware sections. • Pipelining focus on extending the data path in order to allow more than one instruction execution using the idle hardware. • Latches are used between each data path stage to hold the data for each instruction
33 Pipelining: Reservation Table
• Consider 5 different tasks named: A, B, C, D and E. • At clock 9, we have:
34 Pipeline Hazards
• There are two main pipeline hazards on the example exposed: o Data hazard o Control hazard
35 Pipeline Hazards: Data Hazard
• Consider the following stage of the reservation table:
What happen if B required the registers values wrote by A?
36 Pipeline Hazards: Data Hazard
• There are different methods to overcome this situation: • Explicit pipeline:
37 Pipeline Hazards: Data Hazard
• Interlock:
38 Pipeline Hazards: Data Hazard
• Out-of-order execution:
39 Pipeline Hazards: Control Hazard • Consider the following case: ➢ If A is a conditional instruction it should reach the memory stage before the next instruction executes to decide whether it should be executed or not.
40 Pipeline Hazards: Control Hazard
• Common solutions for control hazards:
o Delayed branch: ➢ Add no-ops or other instructions at compile time until the condition information is available to decide if the next instruction should be executed or not.
o Speculative execution: ➢ Hardware executes the instruction it expects to execute. If it was wrong, it undoes any side effects.
41 Outline
• Flynn’s Taxonomy
• The Scalar Processor
• Multicore Architectures
• Pipelining
• Instruction-Level Parallelism (ILP)
• Thread- Level Parallelism (TLP)
42 Instruction-level Parallelism
• Parallelism at the machine-instruction level => Emphasis on hardware (low-level parallelization).
• The processor can re-order, pipeline instructions, split them into microinstructions, do aggressive branch prediction, etc.
• Instruction-level parallelism enabled rapid increases in processor speeds over the last (20) years.
43 Instruction-level Parallelism
• A processor supporting ILP is able to perform multiple independent operations in each instruction cycle.
• A single instruction may take several clock cycles.
• There are 4 major forms of ILP: - CISC - Sub-word parallelism - Superscalar - VLIW
44 Instruction-level Parallelism: CISC
45 Instruction-level Parallelism: CISC
• CISC => Complex Instruction Set Computer.
• Implement complex and commonly specialized instructions in a single assembly line.
46 Instruction-level Parallelism: CISC • CISC Example: TMS320c54x DSP
47 Instruction-level Parallelism: CISC • CISC Example: TMS320c54x DSP
48 Instruction-level Parallelism: Sub-word Parallelism
49 Instruction-level Parallelism: Sub-word Parallelism
• It makes no sense to spend an entire 64-bit long data word to compute a single 8-bit data (e.g. pixel value).
• A wide ALU is divided into narrower slices allowing to execute simultaneous arithmetic or logical operations over multiple sub- word values.
• The same operation is executed on all sub-words.
• Implemented as vector units or processors (NEON). 50 Instruction-level Parallelism: Sub-word Parallelism
• Example: SSE - Streaming SIMD Extension
51 Instruction-level Parallelism: Superscalar Parallelism
52 Instruction-level Parallelism: Superscalar Parallelism • Superscalar processors use conventional sequential instruction sets. • Hardware can simultaneously dispatch multiple instructions to distinct hardware units when it detects that such simultaneous dispatch will not change the behavior of the program such as out- of-order execution.
53 Instruction-level Parallelism: Superscalar Parallelism • Includes additional functional hardware units (ALU, multiplier, registers, etc). • Dynamically determinates the instruction execution order. • Issues more than one instruction per clock cycle. • Popular on desktop and server architectures. • Execution time may be difficult to predict or can not be repeatable when using threads/interrupts => big problem for embedded 54 systems. Instruction-level Parallelism: Superscalar Parallelism • Usually implements pipelining to take advantage of the processing stages of the data path.
55 Superscalar vs Pipelining
• The superscalar architecture executes multiple instructions in parallel by using multiple execution units.
• A pipeline architecture executes multiple instructions in the same execution unit in parallel by dividing the execution unit into different phases.
56 Instruction-level Parallelism: VLIW
57 Instruction-level Parallelism: VLIW
• VLIW => Very Long Instruction Word
• A fixed number of operations are formatted as one big instruction by the compiler. This big instruction is called a bundle.
• The program counter points to 1 bundle (not 1 operation).
• More predictable and repeatable timing => Interesting for time- critical Embedded Systems.
58 Instruction-level Parallelism: VLIW
How VLIW designs reduce hardware complexity (in theory)?
• No dependence checking for instructions within a bundle
• Simpler instruction dispatch • No out-of-order execution, no instruction grouping • No structural hazard checking logic
Compiler figures all this out!
59 Instruction-level Parallelism: VLIW
• Requires additional functional hardware units. • Do not determinates the execution order but specifies on a single instruction different operations to be executed on different functional units. • VLIW instruction set combines multiple independent operations into a single instruction.
60 Instruction-level Parallelism: VLIW
Compiler support to increase ILP:
• Compiler detects hazards: • Structural hazards: • No 2 operations to the same functional unit • No 2 operations to the same memory bank • Data hazards • No data hazards among instructions in a bundle • Control hazards • Predicated execution
• Static branch prediction 61 Instruction-level Parallelism: VLIW
VLIW Example: TMS320C674x
62 TMS320C674x DSP Decode stage Instruction-level Parallelism: VLIW vs Superscalar
Superscalar => Complex hardware for VLIW => More functional units instruction scheduling => Larger code size => Requires complex compiler • Out-of-order execution • dependence checking logic between • More design effort or poor-quality code if parallel instructions good compiler optimizations aren’t • Functional unit hazard checking implemented. • Possible consequences:# o lower cycle times • Simpler hardware o More chip real estate • VLIW are more predictable o More power consumption • Predicated execution to avoid branches
Superscalars can more efficiently execute pipeline- dependent code. 63 Instruction-level Parallelism: VLIW vs Superscalar
64 Outline
• Flynn’s Taxonomy
• The Scalar Processor
• Multicore Architectures
• Pipelining
• Instruction-Level Parallelism (ILP)
• Thread- Level Parallelism (TLP)
65 Thread-Level Parallelism (TLP) • This is parallelism on a coarser scale than the ILP
• Server can serve each client in a separate thread (Web server, database server)
• A computer game can do AI, graphics, and physics in three separate threads
• Single-core superscalar processors cannot fully exploit TLP
• Multi-core architectures are the next step in processor evolution: explicitly exploiting TLP 66 Review: Flynn’s Taxonomy and Memory
• SIMD – Single instruction, multiple data – Modern graphics cards
• MIMD – Multiple instructions, multiple data – Modern multiprocessors: Different cores execute different threads
Shared Memory Distributed Memory 67 Thread-Level Parallelism (TLP) Methods
• Simultaneous Multithreading (SMT) – Single core, multiple functional units – Not a “true” parallel processor
• Multi-core Parallelism – Threads run independent from each other
68 Simultaneous Multithreading (SMT)
69 Simultaneous Multithreading (SMT)
• Problem addressed: The processor pipeline can get stalled:
o Waiting for the result of a long floating point (or integer) operation. o Waiting for data to arrive from memory.
70 Source: www.cs.cmu.edu Simultaneous Multithreading (SMT)
• Permits multiple independent threads to execute SIMULTANEOUSLY on the SAME core.
• Weaving together multiple “threads” on the same core.
• How is this done? Example: if one thread is waiting for a floating- point operation to complete, another thread can use the integer units.
71 Source: www.cs.cmu.edu Without SMT
• Only a single thread can run at any given time
72 Source: www.cs.cmu.edu Without SMT
• Only a single thread can run at any given time
73 Source: www.cs.cmu.edu Simultaneous Multithreading (SMT)
• Both threads can run concurrently
74 Source: www.cs.cmu.edu Simultaneous Multithreading (SMT)
• But: Can’t simultaneously use the same functional unit
This scenario is impossible with SMT on a single core (assuming a single integer unit) 75 Source: www.cs.cmu.edu Simultaneous Multithreading (SMT)
• But: Can’t simultaneously use the same functional unit
This scenario is impossible with SMT on a single core (assuming a single integer unit) 76 Source: www.cs.cmu.edu Simultaneous Multithreading (SMT) • SMT is not a true parallel processor.
• Enables better threading (e.g. up to 30%)
• OS and applications perceive each simultaneous thread as a separate “virtual processor”
• The chip has only a single copy of each resource
• Compare to multi-core: each core has its own copy of resources
77 Source: www.cs.cmu.edu Multi-core Parallelism
78 Source: www.cs.cmu.edu Multi-core Parallelism
79 Source: www.cs.cmu.edu Multi-core Parallelism + SMT • Cores can be SMT-enabled (or not) • The different combinations: – Single-core, non-SMT: standard uniprocessor – Single-core, with SMT – Multi-core, non-SMT – Multi-core, with SMT: nowadays computers
• The number of SMT threads: 2, 4, or sometimes 8 simultaneous threads
• Intel calls them “hyper-threads”. Pentium 4 first hyper-thread µP. 80 Source: www.cs.cmu.edu Multi-core Parallelism + SMT
81 Source: www.cs.cmu.edu Comparison: Multi-core vs SMT • Multi-core: – Since there are several cores: o Each is smaller and not as powerful (but also easier to design and manufacture) o However, great with thread-level parallelism
• SMT o Can have one large and fast superscalar core o Great performance on a single thread o Mostly still only exploits instruction-leve parallelism
82 Source: www.cs.cmu.edu The Memory Hierarchy • If simultaneous multithreading only: – all caches shared
• Multi-core chips: – L1 caches private – L2 caches private in some architectures and shared in others
• Memory is always shared
83 Source: www.cs.cmu.edu Private vs Shared Caches • Advantages of private: o They are closer to core, so faster access o Reduces contention
• Advantages of shared: o Threads on different cores can share the same cache data o More cache space available if a single (or a few) high- performance thread runs on the system
84 Source: www.cs.cmu.edu The Cache Coherence Problem • Since we have private caches: How to keep the data consistent across caches? • Each core should perceive the memory as a monolithic array, shared by all the cores.
85 Source: www.cs.cmu.edu The Cache Coherence Problem • Suppose variable x initially contains 15213
86 Source: www.cs.cmu.edu The Cache Coherence Problem • Core 1 reads “x”
87 Source: www.cs.cmu.edu The Cache Coherence Problem • Core 2 reads “x”
88 Source: www.cs.cmu.edu The Cache Coherence Problem • Core 2 has a stale copy
89 Source: www.cs.cmu.edu Solutions for Cache Coherence
• This is a general problem with multiprocessors, not limited just to multi-core
• There exist many solution algorithms, coherence protocols, etc.
• A simple solution: invalidation-based protocol with snooping
90 Source: www.cs.cmu.edu Inter-core Bus
91 Source: www.cs.cmu.edu Solutions for Cache Coherence
• Invalidation: • If a core writes to a data item, all other copies of this data item in other caches are invalidated.
• Snooping: • All cores continuously “snoop” (monitor) the bus connecting the cores.
92 Source: www.cs.cmu.edu Solutions for Cache Coherence
93
Source: https://image.slideserve.com/ The Cache Coherence Problem • Revisited: Cores 1 and 2 have both read x
94 Source: www.cs.cmu.edu The Cache Coherence Problem: Invalidation • Core 1 writes to x, setting it to 21660
95 Source: www.cs.cmu.edu The Cache Coherence Problem: Invalidation • Core 2 reads x. Cache misses, and loads the new copy.
96 Source: www.cs.cmu.edu The Cache Coherence Problem: Update Protocol
• Core 1 writes x=21660.
97 Source: www.cs.cmu.edu Invalidation vs update
• Multiple writes to the same location – invalidation: only the first time – update: must broadcast each write (which includes new variable value)
• Invalidation generally performs better: generates less bus traffic
98 Source: www.cs.cmu.edu References
[1] Marco Madrigal Solano. Lectures Notes HPEC, Tecnológico de Costa Rica, Course: Sistemas empotrados de Alto Desempeño. [2] W. Wolf. High-Performance Embedded Computing: Architectures, Applications and Methodologies. Elsevier, United States of America, 2007. [3] E. Ashford and S. Arunkumar Introduction to Embedded Systems, 2017 [4] Tammy Noergaard, Embedded Systems Architecture, 2005
Lectures notes and materials are available in TEC-Digital and web portals
www.ie.tec.ac.cr/sarriola/HPEC 99 http://www.ie.tec.ac.cr/joaraya/HPEC/ 100