Superscalar Processor

High Performance Embedded Systems Parallelism Electronics Engineering Department Electronics Master Program June 2020 Parallel Programming Analogy 2 Source: Wikipedia.org Amdahl’s Law • Basic Exercise: Your program takes 20 days to run o 95% can be parallelized o 5% cannot (serial) o What is the fastest this code can run? ✓ As many CPU’s as you want! 1 1 day! Or a speedup of = 20 1−푝 3 Outline • Flynn’s Taxonomy • The Scalar Processor • Multicore Architectures • Pipelining • Instruction-Level Parallelism (ILP) • Thread- Level Parallelism (TLP) 4 Flynn’s Taxonomy • It is a classification of computer architectures, proposed by Michael J. Flynn in 1966. • He proposed four different classes for processors architectures: o Single-instruction, single-data (SISD) o Single-instruction, multiple-data (SIMD) o Multiple-instruction, single-data (MISD) o Multiple-instruction, multiple-data (MIMD) 5 Single-instruction, Single-data (SISD) • Sequential => no parallelism in either the instruction or data streams. • A single stream of instructions operates on a single set of data. • SISD can have concurrent processing characteristics. 6 Single-instruction, Multiple-data (SIMD) • Same simultaneous instructions, own data! • All PU perform the same operations on their data in a lockstep. • A single program counter is possible for all processing elements. 7 SIMD vs SISD Example • Multiply registers R1 (variable) and R2 (constant) • Storage result in RAM SISD SIMD 8 Multiple-instruction, Single-data (MISD) • Different simultaneous instructions, same data! • Pipelined architectures fit this category under the condition above. • Computer vision example: Multiple filters applied to same input image. 9 Multiple-instruction, Multiple-data (MIMD) • Different simultaneous instructions, different data! • PU are asynchronous and independent. • Application fields: Computer-Aided Design (CAD), simulation, modeling, Embedded Systems, etc. • Multi-core processors are MIMD!!! 10 Outline • Flynn’s Taxonomy • The Scalar Processor • Multicore Architectures • Pipelining • Instruction-Level Parallelism (ILP) • Thread- Level Parallelism (TLP) 11 Scalar Processor • The most basic implementation for processors is the scalar processor. • Executes at most one single instruction at a time. The instruction can involve either an integer or floating-point operation. • A scalar processor is classified as a SISD processor (Single Instructions, Single Data) in Flynn's taxonomy. 12 Scalar Processor • Not used nowadays on its basic conceptual definition. • Does it have any advantage? • Simplicity! • Do not have data or control hazards i.e. there is not hardware conflicts trying to execute simultaneous instructions. • Compiler is straightforward. 13 Can we do better than the scalar processor? Yes! Superscalar Processor 14 Superscalar Processor • Allows implementation of Instruction Level Parallelism (ILP) with a single processor. • Executes more than one instruction during a clock cycle by simultaneously dispatching multiple instructions to different execution units on the processor. • Allows more throughput at a given clock rate than a scalar processor. 15 Superscalar Processor • Allows implementation of Instruction Level Parallelism (ILP) with a single processor. • All general-purpose CPUs developed since about 1998 are superscalar. Even our RPI4’s Processor, the ARM Cortex-A72. • Characteristics: ➢ CPU dynamically checks for data dependencies between instructions at run time. ➢ Instructions are issued from a sequential instruction stream. ➢ CPU can execute multiple instructions per clock cycle 16 But… How can we do even better? Parallelization! • The overall objective is to speed up execution of the program. • There is another reason on embedded systems: concurrency of physical processes. • Parallelization enables a better energy efficiency => Interesting for Embedded Systems! 17 Outline • Flynn’s Taxonomy • The Scalar Processor • Multicore Architectures • Pipelining • Instruction-Level Parallelism (ILP) • Thread- Level Parallelism (TLP) 18 Why do we care about the Architecture? Parallel speed up is affected by: • Hardware configuration for each application. • Memory or CPU architecture. • Numbers of cores/processor. • Network speed and architecture. 19 Why do we care about the Architecture? • A multicore machine is a combination of several processors on a single chip. • Homogeneous multicore processor: composed of multiple instances of the same processor type on a single chip. • Heterogeneous multicore processor: composed of a variety of processor types on a single chip. • Multicore architectures are a powerful ally on embedded systems since they allow to provide real-time requirements more easily. 20 Single-core Architecture 21 Source: www.cs.cmu.edu Why multi-core ? • Difficult to make single-core clock frequencies even higher (physical restrictions) • Deeply pipelined circuits: – heat problems – difficult design and verification – large design teams necessary – server farms need expensive air-conditioning • Many new applications are multithreaded • General trend in computer architecture (shift towards more parallelism) 22 CMOS General Power Consumption • There are 2 main components: • Static power: 푃푠푡푎푡푖푐 = 퐼푠푡푎푡푖푐 ∗ 푉푑푑 • Transistors are not switching • Leakage current due to tunneling through thin gate oxide 2 3 • Dynamic Power: 푃푑푦푛푎푚푖푐 = 퐶퐿 ∗ 퐶 푉푑푑 ∗ 푓 ∗ 푁 Where: ▪ 퐶퐿: Load capacitance ▪ 퐶: Chip internal capacitance ▪ 푓: Frequency of operation ▪ 푁: Number of bits that are switching ▪ 푉푑푑: Supply voltage 23 Multi-Core Energy Consumption Processor Broadcom BMC 2711 Intel Pentium 4 Number of Cores 4 1 Frequency [MHz] 1500 2992 RAM LPDDR4 (4GB) DDR (2GB) L1 Cache [kb] 32 16 Compiler Version GCC 8.3.0 GCC 4.1.3 Power Consumption [W] 4 84 CoreMark Score 33072.76 5007.06 24 Multi-core Architecture • Multiple processors in a single die. Core 1 Core 2 Core 3 Core 4 25 Source: www.cs.cmu.edu Multi-core Architecture • Multiple processors in a single die. Core 1 Core 2 Core 3 Core 4 Thread 1 Thread 2 Thread 3 Thread 4 26 Source: www.cs.cmu.edu Can we run several threads on each core? 27 Multi-core Architecture • Multiple processors in a single die. Core 1 Core 2 Core 3 Core 4 several threads several threads several threads Thread 8 28 Source: www.cs.cmu.edu Multiprocessor Memory Types: Shared Memory • Creates a single group of memory modules and multiple processors. • Any processor is able to directly access any memory module by means of an interconnection network. • The group of memory modules outlines a universal address space that is shared between the processors. • Advantage: easy to program => communications among processors through the global memory store. 29 Multiprocessor Memory Types: Distributed Memory • Clones the memory/processor pairs, known as a processing element (PE), and links them by using an interconnection network. • Each PE can communicate with others by sending messages. • Advantage: each processor has its own memory => no memory concurrency across processors. 30 Interaction with the Operating System • OS perceives each core as a separate processor • OS scheduler maps threads/processes to different cores • Most major OS support multi-core today: o Windows, Linux, Mac OS X, 31 Outline • Flynn’s Taxonomy • The Scalar Processor • Multicore Architectures • Pipelining • Instruction-Level Parallelism (ILP) • Thread- Level Parallelism (TLP) 32 Pipelining • During the execution of instructions, the data path has idle hardware sections. • Pipelining focus on extending the data path in order to allow more than one instruction execution using the idle hardware. • Latches are used between each data path stage to hold the data for each instruction 33 Pipelining: Reservation Table • Consider 5 different tasks named: A, B, C, D and E. • At clock 9, we have: 34 Pipeline Hazards • There are two main pipeline hazards on the example exposed: o Data hazard o Control hazard 35 Pipeline Hazards: Data Hazard • Consider the following stage of the reservation table: What happen if B required the registers values wrote by A? 36 Pipeline Hazards: Data Hazard • There are different methods to overcome this situation: • Explicit pipeline: 37 Pipeline Hazards: Data Hazard • Interlock: 38 Pipeline Hazards: Data Hazard • Out-of-order execution: 39 Pipeline Hazards: Control Hazard • Consider the following case: ➢ If A is a conditional instruction it should reach the memory stage before the next instruction executes to decide whether it should be executed or not. 40 Pipeline Hazards: Control Hazard • Common solutions for control hazards: o Delayed branch: ➢ Add no-ops or other instructions at compile time until the condition information is available to decide if the next instruction should be executed or not. o Speculative execution: ➢ Hardware executes the instruction it expects to execute. If it was wrong, it undoes any side effects. 41 Outline • Flynn’s Taxonomy • The Scalar Processor • Multicore Architectures • Pipelining • Instruction-Level Parallelism (ILP) • Thread- Level Parallelism (TLP) 42 Instruction-level Parallelism • Parallelism at the machine-instruction level => Emphasis on hardware (low-level parallelization). • The processor can re-order, pipeline instructions, split them into microinstructions, do aggressive branch prediction, etc. • Instruction-level parallelism enabled rapid increases in processor speeds over the last (20) years. 43 Instruction-level Parallelism • A processor supporting ILP is able to perform multiple independent operations in each instruction cycle.

Superscalar Processor

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support