<<

The University of Texas at Arlington

Lecture 3

CSE 5343/4342 Embedded Systems II Based heavily on slides by Dr. Roger Walker

Introduction to Multi-Core Architecture (Textbook - Chapter 1)

2

1 Concurrency in

• Most naïve users think of behavior as simple (e.g., just browsing the network) • However and meanwhile, the computer OS has many-many processes running, many of them multi-threaded ( in foreground processing in the back). • Concurrency allows multiple processes or threads (codes representing different subsystems ) to be in execution state

3

Parallel Computing

• Allows simultaneous execution of threads or processes – not same thing as concurrent execution. • Concurrency once it is performed simultaneously is . • Computer Architectures can be classified in two different dimensions, the number of instruction streams that can be processed at any given time, and the number of data streams that can be processed at any given

time. 4

2 Flynn’s Taxonomy

• Single Instruction Single Data (SISD)

• Multiple Instruction Single Data (MISD) – not practical

• Single Instruction Multiple Data (SIMD)

• Multiple Instruction Multiple Data (MIMD) – most modern architectures 5

Moore’s Law

The number of on integrated circuits is doubling every (18) 24 months.

6

3 Parallel Computing in

• Many people mistakenly think the law is a predictor of clock speed as it used to follow Moore’s law (0.1MHz to 3GHz). • Parallelism can more easily (than increasing clock rates) use more transistors: – Instruction Level Parallelization (ILP) – dynamic, out of order processing at the hardware level avoiding stalls – Superscalar processors can use non-overlapping components of the CPU to do ILP (multiple pipelines) • Multiple processes or threads are running on a today (at the software level): – concurrent processing (preemptive) – simultaneous thread processing (multiple processors, HT) which can easily use of hardware parallelism or ILP

7

Threads

• A Thread is a discrete sequence of related instructions that is executed independently of other instruction sequences. • Hardware Level Definition: A thread is an execution path that remains independent of other hardware execution paths. • OS maps software threads to hardware execution • A thread only needs the architecture space – registers, execution units, etc. • Logical processors can be created by duplicating the architecture space.

8

4 What Are Threads Good For?

• Improving responsiveness of GUIs • Making programs easier to understand • Overlapping computation and I/O • Improving performance through parallel execution – if allows

9

Thread Concurrency vs. Parallelism

Concurrency: two or more threads are in progress at the same time: Thread 1 Thread 2

Parallelism: two or more threads are executing at the same time

Thread 1 Thread 2

Multiple cores needed

10

5 Thread Level Parallelism

• Time-sliced multi-threading – single processor • Simultaneous multithreading: with superscalar support (see next slide) • Multiple processors – multiple threads or processes run simultaneously on multiple processors • Physical processor – includes many resources including architecture state (registers, caches, execution units, etc.)

11

Simultaneous Multi-Threading

• Simultaneous multi-threading or SMT - The actual execution units shared by the different logical processors. • ’s implementation called Hyper- threading or HT • To the OS (e.g., or Windows) the computing unit appears as multiple physical processors and threads scheduled accordingly.

12

6 Intel’s HT Technology Implementation

• Two logical processors: – Architecture state and On-DieOn-Die APIC1 duplicated CacheCache • Shared execution units: , branch ArchitectureArchitecture StateState ArchitectureArchitecture StateState predictors, control logic Adv.Adv. ProgrammableProgrammable Adv.Adv. ProgrammableProgrammable and buses InterruptInterrupt ControlControl InterruptInterrupt ControlControl

ProcessorProcessor

1 APIC: Advanced Programmable ExecutionExecution Controller; handles Resource sent to a specified logical processor. Resource

System 13

CPU Architectures

From Text-book

14

7 HT, Single Processor, Multiple Processor

Single Processor Thread 1

Idle Dual Processors Thread 2

HT Processor

15

Understanding Performance

• Parallel processing : Speed Up = Time_best_sequential / Time_parallel_implementation

• Amdahl’s Law on speeding up a fraction of execution: Total Speed Up = 1 / [1-Fraction_Enhanced + (Fraction_Enchanced/Speedup )]

16

8 Amdahl Example

• Amdahl’s Law on speeding up a fraction of execution: Total Speed Up = 1 / [1-Fraction_Enhanced + (Fraction_Enchanced/Speedup)] • Examples: Speedup half the program by 15% using parallel processing, then Total Speedup = 1/[(1-0.5)+(.5/1.15)] = 1.08 Thus the whole program speedup is 8 percent.

17

Amdahl to Parallel

• Expressing in terms of the serial and parallel portions: Speedup = 1/[S + (1-S)/n] Where S is the time spent executing the serial portion of program and n is the number of execution cores • If n = 1, then there is no speedup • As n increases, the speedup approaches 1/S (e.g., if S is 10% then maximum speedup is 10 fold)

18

9 Parallel Code vs. Parallel Processors

• For 2 cores and a 30% parallelized program: • Speedup with 2 cores: 1/(.7 + .3/2) ~ 85% of execution time • Speedup with 4 cores: 1/(.7+.3/4) ~ 77.5% of execution time • Doubling the parallel portion with two cores: 1/(0.4+ .6/2) ~ 70% of execution time • Thus only when the program is highly parallelized does adding more processors really help • (Calculations assume that the parallel portion is arbitrarily parallel)

19

Amdahl’s Law with Multi-core Threading Overhead • H(n) – actual OS system overhead + inter-thread activities ( and other forms of communication). • Speedup = 1/(S+(1-S)/n + H(n)) – can be less than 1 if H(n) high • Keep H(n) due to threading small

20

10 Amdahl’s Law with HT

• In HT, although multiple threads are running simultaneously each one of them runs slower. If each thread runs approximately k percent slower then: – Speedup=1/[S+(1-k)(1-S)/n+H(n)] • E.g., if threads run 30% slower then: – Speedup=1/[S+0.67(1-S)/n+H(n)]

21

Amdahl’s Law – Assumptions Made

• Best performing serial is limited by available of CPU cycles (no I/O). However, a multi-core processor with separate caches, will reduce memory latency. • Serial algorithm is best possible solution. However, some problems may work better (larger speed up) with parallel . • As number of core increases, the problem stays the same. However, with more resources, problem grows to meet available resources

22

11 Gustafson’s Law

• Gustafson’s achieved near linear speedup using 1,024 hypercube at Sandia in late 1980’s. • Gustafson’s Law: – Scaled speedup = N - (N-1)*s s is the non-parallelizable part of the Addresses last assumption and fixed problem size in Amdahl’s law by assuming fixed time concept which leads to scaled speed up.

23

Summary

• Concurrency refers to the notion of multiple threads in progress at the same time. • Parallelism refers to the concept of multiple threads executing simultaneously. • Modern software applications often consist of multiple processes or threads that can be executed in parallel. • Most modern computing platforms are multiple instruction, multiple data (MIMD) . These machines allow programmers to process multiple instruction and data streams simultaneously. 24

12 Summary

• In practice, Amdahl's Law does not accurately reflect the benefit of increasing the number of processor cores on a given platform. • Linear speedup is achievable by expanding the problem size with the number of processor cores.

25

13