Superscalar Processor

High Performance Embedded Systems Parallelism Electronics Engineering Department Electronics Master Program June 2020 Parallel Programming Analogy 2 Source: Wikipedia.org Amdahl’s Law • Basic Exercise: Your program takes 20 days to run o 95% can be parallelized o 5% cannot (serial) o What is the fastest this code can run? ✓ As many CPU’s as you want! 1 1 day! Or a speedup of = 20 1−푝 3 Outline • Flynn’s Taxonomy • The Scalar Processor • Multicore Architectures • Pipelining • Instruction-Level Parallelism (ILP) • Thread- Level Parallelism (TLP) 4 Flynn’s Taxonomy • It is a classification of computer architectures, proposed by Michael J. Flynn in 1966. • He proposed four different classes for processors architectures: o Single-instruction, single-data (SISD) o Single-instruction, multiple-data (SIMD) o Multiple-instruction, single-data (MISD) o Multiple-instruction, multiple-data (MIMD) 5 Single-instruction, Single-data (SISD) • Sequential => no parallelism in either the instruction or data streams. • A single stream of instructions operates on a single set of data. • SISD can have concurrent processing characteristics. 6 Single-instruction, Multiple-data (SIMD) • Same simultaneous instructions, own data! • All PU perform the same operations on their data in a lockstep. • A single program counter is possible for all processing elements. 7 SIMD vs SISD Example • Multiply registers R1 (variable) and R2 (constant) • Storage result in RAM SISD SIMD 8 Multiple-instruction, Single-data (MISD) • Different simultaneous instructions, same data! • Pipelined architectures fit this category under the condition above. • Computer vision example: Multiple filters applied to same input image. 9 Multiple-instruction, Multiple-data (MIMD) • Different simultaneous instructions, different data! • PU are asynchronous and independent. • Application fields: Computer-Aided Design (CAD), simulation, modeling, Embedded Systems, etc. • Multi-core processors are MIMD!!! 10 Outline • Flynn’s Taxonomy • The Scalar Processor • Multicore Architectures • Pipelining • Instruction-Level Parallelism (ILP) • Thread- Level Parallelism (TLP) 11 Scalar Processor • The most basic implementation for processors is the scalar processor. • Executes at most one single instruction at a time. The instruction can involve either an integer or floating-point operation. • A scalar processor is classified as a SISD processor (Single Instructions, Single Data) in Flynn's taxonomy. 12 Scalar Processor • Not used nowadays on its basic conceptual definition. • Does it have any advantage? • Simplicity! • Do not have data or control hazards i.e. there is not hardware conflicts trying to execute simultaneous instructions. • Compiler is straightforward. 13 Can we do better than the scalar processor? Yes! Superscalar Processor 14 Superscalar Processor • Allows implementation of Instruction Level Parallelism (ILP) with a single processor. • Executes more than one instruction during a clock cycle by simultaneously dispatching multiple instructions to different execution units on the processor. • Allows more throughput at a given clock rate than a scalar processor. 15 Superscalar Processor • Allows implementation of Instruction Level Parallelism (ILP) with a single processor. • All general-purpose CPUs developed since about 1998 are superscalar. Even our RPI4’s Processor, the ARM Cortex-A72. • Characteristics: ➢ CPU dynamically checks for data dependencies between instructions at run time. ➢ Instructions are issued from a sequential instruction stream. ➢ CPU can execute multiple instructions per clock cycle 16 But… How can we do even better? Parallelization! • The overall objective is to speed up execution of the program. • There is another reason on embedded systems: concurrency of physical processes. • Parallelization enables a better energy efficiency => Interesting for Embedded Systems! 17 Outline • Flynn’s Taxonomy • The Scalar Processor • Multicore Architectures • Pipelining • Instruction-Level Parallelism (ILP) • Thread- Level Parallelism (TLP) 18 Why do we care about the Architecture? Parallel speed up is affected by: • Hardware configuration for each application. • Memory or CPU architecture. • Numbers of cores/processor. • Network speed and architecture. 19 Why do we care about the Architecture? • A multicore machine is a combination of several processors on a single chip. • Homogeneous multicore processor: composed of multiple instances of the same processor type on a single chip. • Heterogeneous multicore processor: composed of a variety of processor types on a single chip. • Multicore architectures are a powerful ally on embedded systems since they allow to provide real-time requirements more easily. 20 Single-core Architecture 21 Source: www.cs.cmu.edu Why multi-core ? • Difficult to make single-core clock frequencies even higher (physical restrictions) • Deeply pipelined circuits: – heat problems – difficult design and verification – large design teams necessary – server farms need expensive air-conditioning • Many new applications are multithreaded • General trend in computer architecture (shift towards more parallelism) 22 CMOS General Power Consumption • There are 2 main components: • Static power: 푃푠푡푎푡푖푐 = 퐼푠푡푎푡푖푐 ∗ 푉푑푑 • Transistors are not switching • Leakage current due to tunneling through thin gate oxide 2 3 • Dynamic Power: 푃푑푦푛푎푚푖푐 = 퐶퐿 ∗ 퐶 푉푑푑 ∗ 푓 ∗ 푁 Where: ▪ 퐶퐿: Load capacitance ▪ 퐶: Chip internal capacitance ▪ 푓: Frequency of operation ▪ 푁: Number of bits that are switching ▪ 푉푑푑: Supply voltage 23 Multi-Core Energy Consumption Processor Broadcom BMC 2711 Intel Pentium 4 Number of Cores 4 1 Frequency [MHz] 1500 2992 RAM LPDDR4 (4GB) DDR (2GB) L1 Cache [kb] 32 16 Compiler Version GCC 8.3.0 GCC 4.1.3 Power Consumption [W] 4 84 CoreMark Score 33072.76 5007.06 24 Multi-core Architecture • Multiple processors in a single die. Core 1 Core 2 Core 3 Core 4 25 Source: www.cs.cmu.edu Multi-core Architecture • Multiple processors in a single die. Core 1 Core 2 Core 3 Core 4 Thread 1 Thread 2 Thread 3 Thread 4 26 Source: www.cs.cmu.edu Can we run several threads on each core? 27 Multi-core Architecture • Multiple processors in a single die. Core 1 Core 2 Core 3 Core 4 several threads several threads several threads Thread 8 28 Source: www.cs.cmu.edu Multiprocessor Memory Types: Shared Memory • Creates a single group of memory modules and multiple processors. • Any processor is able to directly access any memory module by means of an interconnection network. • The group of memory modules outlines a universal address space that is shared between the processors. • Advantage: easy to program => communications among processors through the global memory store. 29 Multiprocessor Memory Types: Distributed Memory • Clones the memory/processor pairs, known as a processing element (PE), and links them by using an interconnection network. • Each PE can communicate with others by sending messages. • Advantage: each processor has its own memory => no memory concurrency across processors. 30 Interaction with the Operating System • OS perceives each core as a separate processor • OS scheduler maps threads/processes to different cores • Most major OS support multi-core today: o Windows, Linux, Mac OS X, 31 Outline • Flynn’s Taxonomy • The Scalar Processor • Multicore Architectures • Pipelining • Instruction-Level Parallelism (ILP) • Thread- Level Parallelism (TLP) 32 Pipelining • During the execution of instructions, the data path has idle hardware sections. • Pipelining focus on extending the data path in order to allow more than one instruction execution using the idle hardware. • Latches are used between each data path stage to hold the data for each instruction 33 Pipelining: Reservation Table • Consider 5 different tasks named: A, B, C, D and E. • At clock 9, we have: 34 Pipeline Hazards • There are two main pipeline hazards on the example exposed: o Data hazard o Control hazard 35 Pipeline Hazards: Data Hazard • Consider the following stage of the reservation table: What happen if B required the registers values wrote by A? 36 Pipeline Hazards: Data Hazard • There are different methods to overcome this situation: • Explicit pipeline: 37 Pipeline Hazards: Data Hazard • Interlock: 38 Pipeline Hazards: Data Hazard • Out-of-order execution: 39 Pipeline Hazards: Control Hazard • Consider the following case: ➢ If A is a conditional instruction it should reach the memory stage before the next instruction executes to decide whether it should be executed or not. 40 Pipeline Hazards: Control Hazard • Common solutions for control hazards: o Delayed branch: ➢ Add no-ops or other instructions at compile time until the condition information is available to decide if the next instruction should be executed or not. o Speculative execution: ➢ Hardware executes the instruction it expects to execute. If it was wrong, it undoes any side effects. 41 Outline • Flynn’s Taxonomy • The Scalar Processor • Multicore Architectures • Pipelining • Instruction-Level Parallelism (ILP) • Thread- Level Parallelism (TLP) 42 Instruction-level Parallelism • Parallelism at the machine-instruction level => Emphasis on hardware (low-level parallelization). • The processor can re-order, pipeline instructions, split them into microinstructions, do aggressive branch prediction, etc. • Instruction-level parallelism enabled rapid increases in processor speeds over the last (20) years. 43 Instruction-level Parallelism • A processor supporting ILP is able to perform multiple independent operations in each instruction cycle.

Load more