Parallel Architectures Impact of Parallel Architectures

An Introduction to Parallel Architectures Impact of Parallel Architectures • From cell phones to supercomputers • In regular CPUs as well as GPUs Parallel HW Processing • Why? • Which types? • Which novel issues? Why Multicores? The SPECint performance of the hottest chip grew by 52% per year from 1986 to 2002, and then grew only 20% in the next three years (about 6% per year). [from Patterson & Hennessy] Diminishing returns from uniprocessor designs Power Wall [from Patterson & Hennessy] • The design goal for the late 1990’s and early 2000’s was to drive the clock rate up. • by adding more transistors to a smaller chip. • Unfortunately, this increased the power dissipation of the CPU chip beyond the capacity of inexpensive cooling techniques The Multicore Approach Multiple cores on the same chip – Simpler – Slower – Less power demanding Transition to Multicore Intel Xeon (18cores) Sequential App Performance 5/4/2018 Spring 2011 ‐‐ Lecture #15 7 Flynn Taxonomy of parallel computers Data streams Single Parallel Single SISD SIMD Instruction Streams Multiple MISD MIMD 5/4/2018 Spring 2011 ‐‐ Lecture #13 8 Alternative Kinds of Parallelism: Single Instruction/Single Data Stream • Single Instruction, Single Data stream (SISD) – Sequential computer that exploits no parallelism in either the instruction or data Processing Unit streams. Examples of SISD architecture are traditional uniprocessor machines 5/4/2018 Spring 2011 ‐‐ Lecture #13 9 Alternative Kinds of Parallelism: Multiple Instruction/Single Data Stream • Multiple Instruction, Single Data streams (MISD) – Computer that exploits multiple instruction streams against a single data stream for data operations that can be naturally parallelized. For example, certain kinds of array processors. – No longer commonly encountered, mainly of 5/4/2018 Spring 2011 ‐‐ Lecture #13historical interest only 10 Alternative Kinds of Parallelism: Single Instruction/Multiple Data Stream • Single Instruction, Multiple Data streams (SIMD) – Computer that exploits multiple data streams against a single instruction stream to operations that may be naturally parallelized, e.g., SIMD instruction extensions or Graphics Processing Unit (GPU) 5/4/2018 Spring 2011 ‐‐ Lecture #13 11 Alternative Kinds of Parallelism: Multiple Instruction/Multiple Data Streams • Multiple Instruction, Multiple Data streams (MIMD) – Multiple autonomous processors simultaneously executing different instructions on different data. – MIMD architectures include multicore and Warehouse Scale Computers (datacenters) 5/4/2018 Spring 2011 ‐‐ Lecture #13 12 Flynn Taxonomy of parallel computers Data streams Single Parallel Single SISD: Intel Pentium 4 SIMD: SSE x86, ARM neon, GPGPUs, … Instruction Streams Multiple MISD: No examples today MIMD: SMP (Intel, ARM, …) • From 2011, SIMD and MIMD most common parallel computers • Most common parallel processing programming style: Single Program Multiple Data (“SPMD”) – Single program that runs on all processors of an MIMD – Cross‐processor execution coordination through conditional expressions (thread parallelism) • SIMD (aka hw‐level data parallelism): specialized function units, for handling lock‐step calculations involving arrays – Scientific computing, signal processing, multimedia (audio/video processing) 5/4/2018 Spring 2011 ‐‐ Lecture #13 13 SIMD Architectures • Data parallelism: executing one operation on multiple data streams – Single control unit – Multiple datapaths (processing elements – PEs) running in parallel • PEs are interconnected and exchange/share data as directed by the control unit • Each PE performs the same operation on its own local data • Example to provide context: – Multiplying a coefficient vector by a data vector (e.g., in filtering) y[i] := c[i] x[i], 0 i < n 5/4/2018 Spring 2011 ‐‐ Lecture #13 Slide 14 “Advanced Digital Media Boost” • To improve performance, SIMD instructions – Fetch one instruction, do the work of multiple instructions 5/4/2018 Spring 2011 ‐‐ Lecture #13 15 Example: SIMD Array Processing for each f in array f = sqrt(f) for each f in array { load f to the floating-point register SISD calculate the square root write the result from the register to memory } for each 4 members in array { load 4 members to the SIMD register SIMD calculate 4 square roots in one operation write the result from the register to memory } 5/4/2018 Spring 2011 ‐‐ Lecture #13 16 MIMD Architectures • Multicore architectures • At least 2 processors interconnected via a communication network – abstractions (HW/SW interface) – organizational structure to realize abstraction efficiently 5/4/2018 ACA H.Corporaal 17 MIMD Architectures • Thread‐Level parallelism – Have multiple program counters – Targeted for tightly‐coupled shared‐memory multiprocessors • For n processors, need n threads • Amount of computation assigned to each thread = grain size – Threads can be used for data‐level parallelism, but the overheads may outweigh the benefit Copyright © 2012, Elsevier Inc. All rights reserved. MIMD Architectures • And what about memory? • How is data to be accessed by multiple cores? • How to design accordingly the memory system? • How do traditional solutions from uniprocessor systems affect multicores? Copyright © 2012, Elsevier Inc. All rights reserved. Recall: The Memory Gap Memory CPU 100000 10000 1000 100 Performance 10 1 H&P 0 5 Fig. 5.2 985 990 198 1 1 1995 2000 200 • Bottom‐line: memory access is increasingly expensive and CA must devise new ways of hiding this cost The memory wall • How do multicores attach the memory wall? • Same issue as uniprocessor systems – By building on‐chip cache hierarchies Memory Hierarchy 5/4/2018 Spring 2011 ‐‐ Lecture #1 22 Communication models: Shared Memory Shared Memory (read, write) (read, write) Process P1 Process P2 • Coherence problem • Memory consistency issue • Synchronization problem 5/4/2018 ACA H.Corporaal 23 Communication models: Shared memory • Shared address space • Communication primitives: – load, store, atomic swap Two varieties: • Physically shared => Symmetric Multi- Processors (SMP) – usually combined with local caching • Physically distributed => Distributed Shared Memory (DSM) 5/4/2018 ACA H.Corporaal 24 SMP: Symmetric Multi-Processor • Memory: centralized with uniform access time (UMA) and bus interconnect, I/O • Examples: Sun Enterprise 6000, SGI Challenge, Intel can be 1 bus,Processor N Processor Processor Processor busses, or any network One or One or One or One or more cache more cache more cache more cache levels levels levels levels Main memory I/O System 5/4/2018 ACA H.Corporaal 25 The memory wall • How do multicores attach the memory wall? • Same issue as uniprocessor systems – By building on‐chip cache hierarchies • What novel issues arise? – Coherence – Consistency – Scalability COMPUTING SYSTEMS PERFORMANCE Chapter 1 — Computer Abstractions and Technology — 27 Response Time and Throughput Response time How long it takes to do a task Throughput Total work done per unit time e.g., tasks/transactions/… per hour How are response time and throughput affected by Replacing the processor with a faster version? Adding more processors? We’ll focus on response time for now… Chapter 1 — Computer Abstractions and Technology — 28 Relative Performance Define Performance = 1/Execution Time “X is n time faster than Y” PerformanceX PerformanceY Execution time Y Execution time X n Example: time taken to run a program 10s on A, 15s on B Execution TimeB / Execution TimeA = 15s / 10s = 1.5 So A is 1.5 times faster than B Chapter 1 — Computer Abstractions and Technology — 29 Measuring Execution Time Elapsed time Total response time, including all aspects Processing, I/O, OS overhead, idle time Determines system performance CPU time Time spent processing a given job Discounts I/O time, other jobs’ shares Comprises user CPU time and system CPU time Different programs are affected differently by CPU and system performance Chapter 1 — Computer Abstractions and Technology — 30 CPU Clocking Operation of digital hardware governed by a constant-rate clock Clock period Clock (cycles) Data transfer and computation Update state Clock period: duration of a clock cycle –12 e.g., 250ps = 0.25ns = 250×10 s Clock frequency (rate): cycles per second 9 e.g., 4.0GHz = 4000MHz = 4.0×10 Hz Chapter 1 — Computer Abstractions and Technology — 31 CPU Time CPU Time CPU Clock CyclesClock Cycle Time CPU Clock Cycles Clock Rate Performance improved by Reducing number of clock cycles Increasing clock rate Hardware designer must often trade off clock rate against cycle count Chapter 1 — Computer Abstractions and Technology — 32 CPU Time Example Computer A: 2GHz clock, 10s CPU time Designing Computer B Aim for 6s CPU time Can do faster clock, but causes 1.2 × clock cycles How fast must Computer B clock be? Clock CyclesB 1.2Clock CyclesA Clock RateB CPU TimeB 6s Clock CyclesA CPU Time A Clock RateA 10s2GHz 20109 1.220109 24109 Clock Rate 4GHz B 6s 6s Chapter 1 — Computer Abstractions and Technology — 33 Instruction Count and CPI Clock Cycles Instruction Count Cycles per Instruction CPU Time Instruction Count CPIClock Cycle Time Instruction Count CPI Clock Rate Instruction Count for a program Determined by program, ISA and compiler Average cycles per instruction Determined by CPU hardware If different instructions have different CPI Average CPI affected by instruction mix Chapter 1 — Computer Abstractions and Technology — 34 CPI Example Computer A: Cycle Time = 250ps, CPI = 2.0 Computer B: Cycle Time

Parallel Architectures Impact of Parallel Architectures

2.5 Classification of Parallel Computers

Computer Hardware Architecture Lecture 4

Threading SIMD and MIMD in the Multicore Context the Ultrasparc T2

Computer Architecture: Parallel Processing Basics

Parallel Processing! 1! CSE 30321 – Lecture 23 – Introduction to Parallel Processing! 2! Suggested Readings! •! Readings! –! H&P: Chapter 7! •! (Over Next 2 Weeks)!

Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures

Instruction Level Parallelism Example

Task Level Parallelism

An Introduction to CUDA/Opencl and Graphics Processors

Lecture 19: SIMD Processors

Chapter 1 Introduction

INTRODUCTION to PARALLEL and GPU COMPUTING Carlo Nardone Sr