An Introduction to Parallel Architectures Impact of Parallel Architectures
• From cell phones to supercomputers • In regular CPUs as well as GPUs Parallel HW Processing
• Why? • Which types? • Which novel issues? Why Multicores? The SPECint performance of the hottest chip grew by 52% per year from 1986 to 2002, and then grew only 20% in the next three years (about 6% per year).
[from Patterson & Hennessy]
Diminishing returns from uniprocessor designs Power Wall [from Patterson & Hennessy] • The design goal for the late 1990’s and early 2000’s was to drive the clock rate up. • by adding more transistors to a smaller chip.
• Unfortunately, this increased the power dissipation of the CPU chip beyond the capacity of inexpensive cooling techniques The Multicore Approach
Multiple cores on the same chip – Simpler – Slower – Less power demanding Transition to Multicore Intel Xeon (18cores)
Sequential App Performance
5/4/2018 Spring 2011 ‐‐ Lecture #15 7 Flynn Taxonomy of parallel computers Data streams Single Parallel Single SISD SIMD Instruction Streams Multiple MISD MIMD
5/4/2018 Spring 2011 ‐‐ Lecture #13 8 Alternative Kinds of Parallelism: Single Instruction/Single Data Stream • Single Instruction, Single Data stream (SISD) – Sequential computer that exploits no parallelism in either the instruction or data Processing Unit streams. Examples of SISD architecture are traditional uniprocessor machines
5/4/2018 Spring 2011 ‐‐ Lecture #13 9 Alternative Kinds of Parallelism: Multiple Instruction/Single Data Stream • Multiple Instruction, Single Data streams (MISD) – Computer that exploits multiple instruction streams against a single data stream for data operations that can be naturally parallelized. For example, certain kinds of array processors. – No longer commonly encountered, mainly of 5/4/2018 Spring 2011 ‐‐ Lecture #13historical interest only 10 Alternative Kinds of Parallelism: Single Instruction/Multiple Data Stream • Single Instruction, Multiple Data streams (SIMD) – Computer that exploits multiple data streams against a single instruction stream to operations that may be naturally parallelized, e.g., SIMD instruction extensions or Graphics Processing Unit (GPU)
5/4/2018 Spring 2011 ‐‐ Lecture #13 11 Alternative Kinds of Parallelism: Multiple Instruction/Multiple Data Streams • Multiple Instruction, Multiple Data streams (MIMD) – Multiple autonomous processors simultaneously executing different instructions on different data. – MIMD architectures include multicore and Warehouse Scale Computers (datacenters)
5/4/2018 Spring 2011 ‐‐ Lecture #13 12 Flynn Taxonomy of parallel computers Data streams Single Parallel Single SISD: Intel Pentium 4 SIMD: SSE x86, ARM neon, GPGPUs, … Instruction Streams Multiple MISD: No examples today MIMD: SMP (Intel, ARM, …)
• From 2011, SIMD and MIMD most common parallel computers • Most common parallel processing programming style: Single Program Multiple Data (“SPMD”) – Single program that runs on all processors of an MIMD – Cross‐processor execution coordination through conditional expressions (thread parallelism) • SIMD (aka hw‐level data parallelism): specialized function units, for handling lock‐step calculations involving arrays – Scientific computing, signal processing, multimedia (audio/video processing) 5/4/2018 Spring 2011 ‐‐ Lecture #13 13 SIMD Architectures • Data parallelism: executing one operation on multiple data streams – Single control unit – Multiple datapaths (processing elements – PEs) running in parallel • PEs are interconnected and exchange/share data as directed by the control unit • Each PE performs the same operation on its own local data
• Example to provide context: – Multiplying a coefficient vector by a data vector (e.g., in filtering) y[i] := c[i] x[i], 0 i < n
5/4/2018 Spring 2011 ‐‐ Lecture #13 Slide 14 “Advanced Digital Media Boost”
• To improve performance, SIMD instructions – Fetch one instruction, do the work of multiple instructions
5/4/2018 Spring 2011 ‐‐ Lecture #13 15 Example: SIMD Array Processing
for each f in array f = sqrt(f)
for each f in array { load f to the floating-point register SISD calculate the square root write the result from the register to memory }
for each 4 members in array { load 4 members to the SIMD register SIMD calculate 4 square roots in one operation write the result from the register to memory }
5/4/2018 Spring 2011 ‐‐ Lecture #13 16 MIMD Architectures • Multicore architectures • At least 2 processors interconnected via a communication network – abstractions (HW/SW interface) – organizational structure to realize abstraction efficiently
5/4/2018 ACA H.Corporaal 17 MIMD Architectures
• Thread‐Level parallelism – Have multiple program counters – Targeted for tightly‐coupled shared‐memory multiprocessors
• For n processors, need n threads
• Amount of computation assigned to each thread = grain size – Threads can be used for data‐level parallelism, but the overheads may outweigh the benefit
Copyright © 2012, Elsevier Inc. All rights reserved. MIMD Architectures
• And what about memory? • How is data to be accessed by multiple cores? • How to design accordingly the memory system? • How do traditional solutions from uniprocessor systems affect multicores?
Copyright © 2012, Elsevier Inc. All rights reserved. Recall: The Memory Gap
Memory CPU
100000
10000
1000
100
Performance 10
1 H&P 0 5 Fig. 5.2 985 990 198 1 1 1995 2000 200
• Bottom‐line: memory access is increasingly expensive and CA must devise new ways of hiding this cost The memory wall
• How do multicores attach the memory wall? • Same issue as uniprocessor systems – By building on‐chip cache hierarchies Memory Hierarchy
5/4/2018 Spring 2011 ‐‐ Lecture #1 22 Communication models: Shared Memory
Shared Memory
(read, write) (read, write)
Process P1 Process P2
• Coherence problem • Memory consistency issue • Synchronization problem
5/4/2018 ACA H.Corporaal 23 Communication models: Shared memory • Shared address space • Communication primitives: – load, store, atomic swap
Two varieties: • Physically shared => Symmetric Multi- Processors (SMP) – usually combined with local caching • Physically distributed => Distributed Shared Memory (DSM) 5/4/2018 ACA H.Corporaal 24 SMP: Symmetric Multi-Processor • Memory: centralized with uniform access time (UMA) and bus interconnect, I/O • Examples: Sun Enterprise 6000, SGI Challenge, Intel
can be 1 bus,Processor N Processor Processor Processor busses, or any network One or One or One or One or more cache more cache more cache more cache levels levels levels levels
Main memory I/O System
5/4/2018 ACA H.Corporaal 25 The memory wall
• How do multicores attach the memory wall? • Same issue as uniprocessor systems – By building on‐chip cache hierarchies
• What novel issues arise? – Coherence – Consistency – Scalability COMPUTING SYSTEMS PERFORMANCE
Chapter 1 — Computer Abstractions and Technology — 27 Response Time and Throughput
Response time
How long it takes to do a task
Throughput
Total work done per unit time
e.g., tasks/transactions/… per hour
How are response time and throughput affected by
Replacing the processor with a faster version?
Adding more processors?
We’ll focus on response time for now…
Chapter 1 — Computer Abstractions and Technology — 28 Relative Performance
Define Performance = 1/Execution Time
“X is n time faster than Y”
PerformanceX PerformanceY
Execution time Y Execution time X n
Example: time taken to run a program
10s on A, 15s on B
Execution TimeB / Execution TimeA = 15s / 10s = 1.5
So A is 1.5 times faster than B
Chapter 1 — Computer Abstractions and Technology — 29 Measuring Execution Time
Elapsed time
Total response time, including all aspects
Processing, I/O, OS overhead, idle time
Determines system performance CPU time
Time spent processing a given job
Discounts I/O time, other jobs’ shares
Comprises user CPU time and system CPU time
Different programs are affected differently by CPU and system performance
Chapter 1 — Computer Abstractions and Technology — 30 CPU Clocking
Operation of digital hardware governed by a constant-rate clock
Clock period
Clock (cycles)
Data transfer and computation Update state
Clock period: duration of a clock cycle –12 e.g., 250ps = 0.25ns = 250×10 s
Clock frequency (rate): cycles per second 9 e.g., 4.0GHz = 4000MHz = 4.0×10 Hz
Chapter 1 — Computer Abstractions and Technology — 31 CPU Time
CPU Time CPU Clock CyclesClock Cycle Time CPU Clock Cycles Clock Rate
Performance improved by
Reducing number of clock cycles
Increasing clock rate
Hardware designer must often trade off clock rate against cycle count
Chapter 1 — Computer Abstractions and Technology — 32 CPU Time Example
Computer A: 2GHz clock, 10s CPU time
Designing Computer B
Aim for 6s CPU time
Can do faster clock, but causes 1.2 × clock cycles
How fast must Computer B clock be?
Clock CyclesB 1.2Clock CyclesA Clock RateB CPU TimeB 6s
Clock CyclesA CPU Time A Clock RateA 10s2GHz 20109 1.220109 24109 Clock Rate 4GHz B 6s 6s
Chapter 1 — Computer Abstractions and Technology — 33 Instruction Count and CPI
Clock Cycles Instruction Count Cycles per Instruction CPU Time Instruction Count CPIClock Cycle Time Instruction Count CPI Clock Rate
Instruction Count for a program
Determined by program, ISA and compiler
Average cycles per instruction
Determined by CPU hardware
If different instructions have different CPI
Average CPI affected by instruction mix
Chapter 1 — Computer Abstractions and Technology — 34 CPI Example
Computer A: Cycle Time = 250ps, CPI = 2.0 Computer B: Cycle Time = 500ps, CPI = 1.2 Same ISA Which is faster, and by how much? CPU Time Instruction Count CPI Cycle Time A A A I 2.0 250ps I500ps A is faster… CPU Time Instruction Count CPI Cycle Time B B B I1.2500ps I 600ps CPU Time I 600ps B 1.2 CPU Time I500ps …by this much A
Chapter 1 — Computer Abstractions and Technology — 35 CPI in More Detail
If different instruction classes take different numbers of cycles
n Clock Cycles (CPIi Instruction Counti ) i1
Weighted average CPI
n Clock Cycles Instruction Counti CPI CPIi Instruction Count i1 Instruction Count
Relative frequency
Chapter 1 — Computer Abstractions and Technology — 36 CPI Example
Alternative compiled code sequences using instructions in classes A, B, C
Class A B C CPI for class 1 2 3 IC in sequence 1 2 1 2 IC in sequence 2 4 1 1
Sequence 1: IC = 5 Sequence 2: IC = 6
Clock Cycles Clock Cycles = 2×1 + 1×2 + 2×3 = 4×1 + 1×2 + 1×3 = 10 = 9
Avg. CPI = 10/5 = 2.0 Avg. CPI = 9/6 = 1.5
Chapter 1 — Computer Abstractions and Technology — 37 Performance Summary
The BIG Picture
Instructions Clock cycles Seconds CPU Time Program Instruction Clock cycle
Performance depends on
Algorithm: affects IC, possibly CPI
Programming language: affects IC, CPI
Compiler: affects IC, CPI
Instruction set architecture: affects IC, CPI, Tc
Chapter 1 — Computer Abstractions and Technology — 38 §1.7 The PowerWall Power
In CMOS IC technology
Pdyn Capacitive load Voltage2 Frequency
×30 5V → 1V ×1000
Dynamic power is dominant when active
Static power: Pstat=Voltage × Istatic Total power P=Pstat+Pdyn
Chapter 1 — Computer Abstractions and Technology — 39 Reducing Power
Suppose a new CPU has
85% of capacitive load of old CPU
15% voltage and 15% frequency reduction
2 Pnew Cold 0.85(Vold 0.85) Fold 0.85 4 2 0.85 0.52 Pold Cold Vold Fold
The power wall
We can’t reduce voltage further
We can’t remove more heat
How else can we improve performance?
Chapter 1 — Computer Abstractions and Technology — 40 Multiprocessors
Multicore microprocessors
More than one processor per chip
Requires explicitly parallel programming
Compare with instruction level parallelism
Hardware executes multiple instructions at once
Hidden from the programmer
Hard to do
Programming for performance
Load balancing
Optimizing communication and synchronization
Chapter 1 — Computer Abstractions and Technology — 41 §1.10 FallaciesandPitfalls Pitfall: Amdahl’s Law
Improving an aspect of a computer and expecting a proportional improvement in overall performance T T affected T improved improvement factor unaffected
Example: multiply accounts for 80s/100s
How much improvement in multiply performance to get 5× overall? 80 20 20 Can’t be done! n Corollary: make the common case fast
Chapter 1 — Computer Abstractions and Technology — 42 Fallacy: Low Power at Idle
i7 power benchmark
At 100% load: 258W
At 50% load: 170W (66%)
At 10% load: 121W (47%)
Consider designing processors to make power proportional to load
However today P= Pdyn+Pstat
Pstat is due to leakage power and residual activity
If Pstat >0 careful with extreme parallelism!
Chapter 1 — Computer Abstractions and Technology — 43 Exercise
Processor core IPC=1:
@1.5V, Pdyn= 0.1W, Pstat=0.01W, f=0.5GHz
@1.0V, f=0.4GHz
Uncore
@1.5V, Pdyn+Pstat=0.1W
Compare 2,4,8,16 core architecture with a 1 core architecture for running appl with 100GInstr
50% sequential parr
10% sequential part
1% sequential part
Compute power, energy
Chapter 1 — Computer Abstractions and Technology — 44