An Introduction to Parallel Architectures Impact of Parallel Architectures

• From cell phones to • In regular CPUs as well as GPUs Parallel HW Processing

• Why? • Which types? • Which novel issues? Why Multicores? The SPECint performance of the hottest chip grew by 52% per year from 1986 to 2002, and then grew only 20% in the next three years (about 6% per year).

[from Patterson & Hennessy]

Diminishing returns from uniprocessor designs Power Wall [from Patterson & Hennessy] • The design goal for the late 1990’s and early 2000’s was to drive the up. • by adding more transistors to a smaller chip.

• Unfortunately, this increased the power dissipation of the CPU chip beyond the capacity of inexpensive cooling techniques The Multicore Approach

Multiple cores on the same chip – Simpler – Slower – Less power demanding Transition to Multicore Intel Xeon (18cores)

Sequential App Performance

5/4/2018 Spring 2011 ‐‐ Lecture #15 7 Flynn Taxonomy of parallel computers Data streams Single Parallel Single SISD SIMD Instruction Streams Multiple MISD MIMD

5/4/2018 Spring 2011 ‐‐ Lecture #13 8 Alternative Kinds of Parallelism: Single Instruction/Single Data Stream • Single Instruction, Single Data stream (SISD) – Sequential computer that exploits no parallelism in either the instruction or data Processing Unit streams. Examples of SISD architecture are traditional uniprocessor machines

5/4/2018 Spring 2011 ‐‐ Lecture #13 9 Alternative Kinds of Parallelism: Multiple Instruction/Single Data Stream • Multiple Instruction, Single Data streams (MISD) – Computer that exploits multiple instruction streams against a single data stream for data operations that can be naturally parallelized. For example, certain kinds of array processors. – No longer commonly encountered, mainly of 5/4/2018 Spring 2011 ‐‐ Lecture #13historical interest only 10 Alternative Kinds of Parallelism: Single Instruction/Multiple Data Stream • Single Instruction, Multiple Data streams (SIMD) – Computer that exploits multiple data streams against a single instruction stream to operations that may be naturally parallelized, e.g., SIMD instruction extensions or (GPU)

5/4/2018 Spring 2011 ‐‐ Lecture #13 11 Alternative Kinds of Parallelism: Multiple Instruction/Multiple Data Streams • Multiple Instruction, Multiple Data streams (MIMD) – Multiple autonomous processors simultaneously executing different instructions on different data. – MIMD architectures include multicore and Warehouse Scale Computers (datacenters)

5/4/2018 Spring 2011 ‐‐ Lecture #13 12 Flynn Taxonomy of parallel computers Data streams Single Parallel Single SISD: Intel Pentium 4 SIMD: SSE , ARM neon, GPGPUs, … Instruction Streams Multiple MISD: No examples today MIMD: SMP (Intel, ARM, …)

• From 2011, SIMD and MIMD most common parallel computers • Most common parallel processing programming style: Single Program Multiple Data (“SPMD”) – Single program that runs on all processors of an MIMD – Cross‐ execution coordination through conditional expressions ( parallelism) • SIMD (aka hw‐level ): specialized function units, for handling lock‐step calculations involving arrays – Scientific , signal processing, multimedia (audio/video processing) 5/4/2018 Spring 2011 ‐‐ Lecture #13 13 SIMD Architectures • Data parallelism: executing one operation on multiple data streams – Single – Multiple (processing elements – PEs) running in parallel • PEs are interconnected and exchange/share data as directed by the control unit • Each PE performs the same operation on its own local data

• Example to provide context: – Multiplying a coefficient vector by a data vector (e.g., in filtering) y[i] := c[i]  x[i], 0  i < n

5/4/2018 Spring 2011 ‐‐ Lecture #13 Slide 14 “Advanced Digital Media

• To improve performance, SIMD instructions – Fetch one instruction, do the work of multiple instructions

5/4/2018 Spring 2011 ‐‐ Lecture #13 15 Example: SIMD Array Processing

for each f in array f = sqrt(f)

for each f in array { load f to the floating-point register SISD calculate the square root write the result from the register to memory }

for each 4 members in array { load 4 members to the SIMD register SIMD calculate 4 square roots in one operation write the result from the register to memory }

5/4/2018 Spring 2011 ‐‐ Lecture #13 16 MIMD Architectures • Multicore architectures • At least 2 processors interconnected via a communication network – abstractions (HW/SW interface) – organizational structure to realize abstraction efficiently

5/4/2018 ACA H.Corporaal 17 MIMD Architectures

• Thread‐Level parallelism – Have multiple program counters – Targeted for tightly‐coupled shared‐memory multiprocessors

• For n processors, need n threads

• Amount of computation assigned to each thread = grain size – Threads can be used for data‐level parallelism, but the overheads may outweigh the benefit

Copyright © 2012, Elsevier Inc. All rights reserved. MIMD Architectures

• And what about memory? • How is data to be accessed by multiple cores? • How to design accordingly the memory system? • How do traditional solutions from uniprocessor systems affect multicores?

Copyright © 2012, Elsevier Inc. All rights reserved. Recall: The Memory Gap

Memory CPU

100000

10000

1000

100

Performance 10

1 H&P 0 5 Fig. 5.2 985 990 198 1 1 1995 2000 200

• Bottom‐line: memory access is increasingly expensive and CA must devise new ways of hiding this cost The memory wall

• How do multicores attach the memory wall? • Same issue as uniprocessor systems – By building on‐chip hierarchies

5/4/2018 Spring 2011 ‐‐ Lecture #1 22 Communication models:

Shared Memory

(read, write) (read, write)

Process P1 P2

• Coherence problem • Memory consistency issue • Synchronization problem

5/4/2018 ACA H.Corporaal 23 Communication models: Shared memory • Shared address space • Communication primitives: – load, store, atomic swap

Two varieties: • Physically shared => Symmetric Multi- Processors (SMP) – usually combined with local caching • Physically distributed => Distributed Shared Memory (DSM) 5/4/2018 ACA H.Corporaal 24 SMP: Symmetric Multi-Processor • Memory: centralized with uniform access time (UMA) and interconnect, I/O • Examples: Sun Enterprise 6000, SGI Challenge, Intel

can be 1 bus,Processor N Processor Processor Processor busses, or any network One or One or One or One or more cache more cache more cache more cache levels levels levels levels

Main memory I/O System

5/4/2018 ACA H.Corporaal 25 The memory wall

• How do multicores attach the memory wall? • Same issue as uniprocessor systems – By building on‐chip cache hierarchies

• What novel issues arise? – Coherence – Consistency – COMPUTING SYSTEMS PERFORMANCE

Chapter 1 — Computer Abstractions and Technology — 27 Response Time and Throughput

 Response time

 How long it takes to do a task

 Throughput

 Total work done per unit time

 e.g., tasks/transactions/… per hour

 How are response time and throughput affected by

 Replacing the processor with a faster version?

 Adding more processors?

 We’ll focus on response time for now…

Chapter 1 — Computer Abstractions and Technology — 28 Relative Performance

 Define Performance = 1/Execution Time

 “X is n time faster than Y”

PerformanceX PerformanceY

 Execution time Y Execution time X  n

 Example: time taken to run a program

 10s on A, 15s on B

 Execution TimeB / Execution TimeA = 15s / 10s = 1.5

 So A is 1.5 times faster than B

Chapter 1 — Computer Abstractions and Technology — 29 Measuring Execution Time

 Elapsed time

 Total response time, including all aspects

 Processing, I/O, OS overhead, idle time

 Determines system performance  CPU time

 Time spent processing a given job

 Discounts I/O time, other jobs’ shares

 Comprises user CPU time and system CPU time

 Different programs are affected differently by CPU and system performance

Chapter 1 — Computer Abstractions and Technology — 30 CPU Clocking

 Operation of digital hardware governed by a constant-rate clock

Clock period

Clock (cycles)

Data transfer and computation Update state

 Clock period: duration of a clock cycle –12  e.g., 250ps = 0.25ns = 250×10 s

 Clock frequency (rate): cycles per second 9  e.g., 4.0GHz = 4000MHz = 4.0×10 Hz

Chapter 1 — Computer Abstractions and Technology — 31 CPU Time

CPU Time  CPU Clock CyclesClock Cycle Time CPU Clock Cycles  Clock Rate

 Performance improved by

 Reducing number of clock cycles

 Increasing clock rate

 Hardware designer must often trade off clock rate against cycle count

Chapter 1 — Computer Abstractions and Technology — 32 CPU Time Example

 Computer A: 2GHz clock, 10s CPU time

 Designing Computer B

 Aim for 6s CPU time

 Can do faster clock, but causes 1.2 × clock cycles

 How fast must Computer B clock be?

Clock CyclesB 1.2Clock CyclesA Clock RateB   CPU TimeB 6s

Clock CyclesA  CPU Time A Clock RateA  10s2GHz  20109 1.220109 24109 Clock Rate    4GHz B 6s 6s

Chapter 1 — Computer Abstractions and Technology — 33 Instruction Count and CPI

Clock Cycles  Instruction Count  CPU Time  Instruction Count CPIClock Cycle Time Instruction Count CPI  Clock Rate

 Instruction Count for a program

 Determined by program, ISA and compiler

 Average cycles per instruction

 Determined by CPU hardware

 If different instructions have different CPI

 Average CPI affected by instruction mix

Chapter 1 — Computer Abstractions and Technology — 34 CPI Example

 Computer A: Cycle Time = 250ps, CPI = 2.0  Computer B: Cycle Time = 500ps, CPI = 1.2  Same ISA  Which is faster, and by how much? CPU Time  Instruction Count CPI Cycle Time A A A  I 2.0 250ps  I500ps A is faster… CPU Time  Instruction Count CPI Cycle Time B B B  I1.2500ps  I 600ps CPU Time I 600ps B   1.2 CPU Time I500ps …by this much A

Chapter 1 — Computer Abstractions and Technology — 35 CPI in More Detail

 If different instruction classes take different numbers of cycles

n Clock Cycles  (CPIi Instruction Counti ) i1

 Weighted average CPI

n Clock Cycles  Instruction Counti  CPI   CPIi   Instruction Count i1  Instruction Count 

Relative frequency

Chapter 1 — Computer Abstractions and Technology — 36 CPI Example

 Alternative compiled code sequences using instructions in classes A, B, C

Class A B C CPI for class 1 2 3 IC in sequence 1 2 1 2 IC in sequence 2 4 1 1

 Sequence 1: IC = 5  Sequence 2: IC = 6

 Clock Cycles  Clock Cycles = 2×1 + 1×2 + 2×3 = 4×1 + 1×2 + 1×3 = 10 = 9

 Avg. CPI = 10/5 = 2.0  Avg. CPI = 9/6 = 1.5

Chapter 1 — Computer Abstractions and Technology — 37 Performance Summary

The BIG Picture

Instructions Clock cycles Seconds CPU Time    Program Instruction Clock cycle

 Performance depends on

 Algorithm: affects IC, possibly CPI

 Programming language: affects IC, CPI

 Compiler: affects IC, CPI

 Instruction set architecture: affects IC, CPI, Tc

Chapter 1 — Computer Abstractions and Technology — 38 §1.7 The PowerWall Power

 In CMOS IC technology

Pdyn  Capacitive load Voltage2 Frequency

×30 5V → 1V ×1000

 Dynamic power is dominant when active

 Static power: Pstat=Voltage × Istatic  Total power P=Pstat+Pdyn

Chapter 1 — Computer Abstractions and Technology — 39 Reducing Power

 Suppose a new CPU has

 85% of capacitive load of old CPU

 15% voltage and 15% frequency reduction

2 Pnew Cold 0.85(Vold 0.85) Fold 0.85 4  2  0.85  0.52 Pold Cold  Vold Fold

 The power wall

 We can’t reduce voltage further

 We can’t remove more heat

 How else can we improve performance?

Chapter 1 — Computer Abstractions and Technology — 40 Multiprocessors

 Multicore

 More than one processor per chip

 Requires explicitly parallel programming

 Compare with instruction level parallelism

 Hardware executes multiple instructions at once

 Hidden from the programmer

 Hard to do

 Programming for performance

 Load balancing

 Optimizing communication and synchronization

Chapter 1 — Computer Abstractions and Technology — 41 §1.10 FallaciesandPitfalls Pitfall: Amdahl’s Law

 Improving an aspect of a computer and expecting a proportional improvement in overall performance T T  affected  T improved improvement factor unaffected

 Example: multiply accounts for 80s/100s

 How much improvement in multiply performance to get 5× overall? 80 20   20  Can’t be done! n  Corollary: make the common case fast

Chapter 1 — Computer Abstractions and Technology — 42 Fallacy: Low Power at Idle

 i7 power benchmark

 At 100% load: 258W

 At 50% load: 170W (66%)

 At 10% load: 121W (47%)

 Consider designing processors to make power proportional to load

 However today P= Pdyn+Pstat

 Pstat is due to leakage power and residual activity

 If Pstat >0  careful with extreme parallelism!

Chapter 1 — Computer Abstractions and Technology — 43 Exercise

 Processor core IPC=1:

 @1.5V, Pdyn= 0.1W, Pstat=0.01W, f=0.5GHz

 @1.0V, f=0.4GHz

 Uncore

 @1.5V, Pdyn+Pstat=0.1W

 Compare 2,4,8,16 core architecture with a 1 core architecture for running appl with 100GInstr

 50% sequential parr

 10% sequential part

 1% sequential part

 Compute power, energy

Chapter 1 — Computer Abstractions and Technology — 44