Lecture 19/20: Shared Memory SMP and Cache Coherence

Parallel Computers Definition: “A parallel computer is a collection of processiong elements that cooperate and Lecture 19/20: Shared Memory communicate to solve large problems fast.” Almasi and Gottlieb, Highly Parallel Computing ,1989 SMP and Cache Coherence Questions about parallel computers: How large a collection? How powerful are processing elements? How do they cooperate and communicate? How are data transmitted? What type of interconnection? What are HW and SW primitives for programmer? Does it translate into performance? 1 2 Adapted from UCB CS252 S01 Parallel Processors “Religion” What Level of Parallelism? The dream of computer architects since 1950s: Bit level parallelism: 1970 to ~1985 replicate processors to add performance vs. 4 bits, 8 bit, 16 bit, 32 bit microprocessors design a faster processor Instruction level parallelism (ILP): Led to innovative organization tied to particular ~1985 through today programming models since “uniprocessors can’t Pipelining keep going” Superscalar e.g., uniprocessors must stop getting faster due to VLIW limit of speed of light: 1972, … , 1989 Out-of-Order execution Borders religious fervor: you must believe! Limits to benefits of ILP? Fervor damped some when 1990s companies went out of business: Thinking Machines, Kendall Square, ... Process Level or Thread level parallelism; “Pull” of opportunity of scalable performance, mainstream for general purpose computing? not the “push” of uniprocessor performance Servers are parallel plateau? High-end desktop dual processor PC soon? 3 4 Why Multiprocessors? Parallel Processing Introduction 1. Microprocessors as the fastest CPUs Long term goal of the field: scale number processors to • Collecting several much easier than redesigning 1 size of budget, desired performance 2. Complexity of current microprocessors Machines today: Sun Enterprise 10000 (8/00) 64 400 MHz UltraSPARC® II CPUs,64 GB SDRAM memory, 868 • Do we have enough ideas to sustain 1.5X/yr? 18GB disk,tape • Can we deliver such complexity on schedule? $4,720,800 total 3. Slow (but steady) improvement in parallel 64 CPUs 15%, 64 GB DRAM 11%, disks 55%, cabinet 16% software (scientific apps, databases, OS) ($10,800 per processor or ~0.2% per processor) 4. Emergence of embedded and server markets Machines today: Dell Workstation 220 (2/01) driving microprocessors in addition to desktops 866 MHz Intel Pentium® III (in Minitower) 0.125 GB RDRAM memory, 1 10GB disk, 12X CD, 17” monitor, • Embedded functional parallelism, producer/consumer model nVIDIA GeForce 2 GTS,32MB DDR Graphics card, 1yr service $1,600; for extra processor, add $350 (~20%) • Server figure of merit is tasks per hour vs. latency 5 6 1 Popular Flynn Categories for Where is Supercomputing heading? Parallel Computers 1997, 500 fastest machines in the world: SISD (Single Instruction Single Data) 319 MPPs, 73 bus-based shared memory (SMP), 106 Uniprocessors parallel vector processors (PVP) MISD (Multiple Instruction Single Data) multiple processors on a single data stream SIMD (Single Instruction Multiple Data) 2000, 381 of 500 fastest: 144 IBM SP (~cluster), 121 Early Examples: Illiac-IV, CM-2 Phrase reused by Intel marketing for media instructions ~ vector Sun (bus SMP), 62 SGI (NUMA SMP), 54 Cray (NUMA SMP) MIMD (Multiple Instruction Multiple Data) Examples: Sun Enterprise 5000, Cray T3D, SGI Origin Flexible Parallel computer architecture : a hardware/ software approach, Use off-the-shelf micros David E. Culler, Jaswinder Pal Singh, with Anoop Gupta. San MIMD current winner: Concentrate on major design emphasis <= 128 Francisco : Morgan Kaufmann, c1999. processor MIMD machines http://www.top500.org/ 7 8 Major MIMD Styles Parallel Architecture 1. Centralized shared memory ("Uniform Memory Parallel Architecture extends traditional Access" time or "Shared Memory Processor") computer architecture with a communication architecture 2. Decentralized memory (memory module with abstractions (HW/SW interface) CPU) organizational structure to realize abstraction efficiently Shared Memory with "Non Uniform Memory Access" time (NUMA) Message passing "multicomputer" with separate address space per processor 9 10 Parallel Framework Shared Address/Memory Processor Model Layers: Each processor can name every physical location in the machine Programming Model: Each process can name all data it shares with other Multiprogramming : lots of jobs, no communication processes Shared address space: communicate via memory Message passing: send and recieve messages Data transfer via load and store Data Parallel: one operation, multiple data sets Data size: byte, word, ... or cache blocks Communication Abstraction: Uses virtual memory to map virtual to local or remote Shared address space: e.g., load, store, etc => multiprocessors physical Message passing: e.g., send, recieve library calls Memory hierarchy model applies: now communication Debate over this topic (ease of programming, scaling) moves data to local processor cache (as load moves data from memory to cache) May mix shared address space and message passing at Latency, BW, scalability when communicate? different layers 11 12 2 Shared Address/Memory Shared-Memory Programming Examples Multiprocessor Model struct alloc_t { int first; int last } alloc[MAX_THR]; pthread_t tid[MAX_THR]; Communicate via Load and Store main() { Oldest and most popular model … Based on timesharing: processes on multiple for (int i=0; i<num_thr; i++) { alloc[i].first = i*N/num_thr; processors vs. sharing single processor alloc[i].last = (i!=num_thr)?((i+1)*(N/num_thr)-1):N; pthread_create(&tid[i]/*thread id pointer*/, process: a virtual address space NLLL/*detach method*/, dmm_func/*thread function*/, and > 1 thread of control (void *)&alloc[i]/*parameters*/); ALL threads of a process share a process address } for (i=0; i<num_thr; i++) { space pthread_join(tid[i]/*thread id*/, NULL/*return value*/); Example: Pthread } } Writes to shared address space by one thread are visible to reads of other threads dmm_func(struct alloc_t *alloc) { for (int i=alloc->first; i<alloc->last; i++) for (int k=0; k<N; k++) for (int j=0; j<N; j++) Z[i][j] += X[i][k]*Y[k][j]; } 13 14 SMP Interconnect Advantages of Shared-memory Model Processors to Memory AND to I/O Ease of programming when communication Bus based: all memory locations equal access patterns are complex or vary dynamically during time so SMP = “Symmetric MP” execution Sharing limited BW as add processors, I/O Lower communication overhead, good utilization P P P P of communication bandwidth for small data items, no expensive I/O operations Cache Cache Cache Cache Hardware-controlled caching to reduce remote Main memory I/O system commutations when remote data is cached 15 16 Advantages of Message-Passing Message Passing Model Communication Whole computers (CPU, memory, I/O devices), explicit The hardware can be much simpler and is usually send/receive as explicit I/O operations standard Send specifies local buffer + receiving process on remote computer Explicit communication => simpler to understand, help make effort to reduce communication cost Receive specifies sending process on remote computer + local buffer to place data Synchronization is naturally associated with sending/receiving messages Send+receive => memory-memory copy, where each each supplies local address Easier to use sender-initiated communication, which may have some advantages in performance Important, but will not be discussed in details 17 18 3 Three Types of Parallel Applications Amdahl’s Law and Parallel Computers Commercial Workload Amdahl’s Law: speedup is limited by the fraction TPC-C, TPC-D, Altavista (Web search engine) of the portions that can be parallelized Multiprogramming and OS Workload Speedup ≤ 1 / (1-f), where f is the fraction of sequential computation Scientific/Technical Applications FFT Kernel: 1D complex number FFT How large can be f if we want 80X speedup from LU Kernel: dense matrix factorization 100 processors? Barnes App: Barnes-Hut n-body algorithm, galaxy evolution 1 / (f+(1-f)/100) = 80 Ocean App: Gauss-Seidel multigrid technique to solve a set of elliptical partial differential eq.s’ f = 0.25% ! 19 20 What Does Coherency Mean? Potential HW Coherency Solutions Informally: Snooping Solution (Snoopy Bus): “Any read must return the most recent write” Send all requests for data to all processors Too strict and too difficult to implement Processors snoop to see if they have a copy and respond Better: accordingly “Any write must eventually be seen by a read” Requires broadcast, since caching information is at processors All writes are seen in proper order (“serialization”) Two rules to ensure this: Works well with bus (natural broadcast medium) “If P writes x and P1 reads it, P’s write will be seen by P1 if the Dominates for small scale machines (most of the market) read and write are sufficiently far apart” Directory-Based Schemes (discuss later) Writes to a single location are serialized: seen in one order Latest write will be seen Keep track of what is being shared in 1 centralized place Otherwise could see writes in illogical order (logically) (could see older value after a newer value) Distributed memory => distributed directory for scalability Cache coherency in multiprocessors: How does a (avoids bottlenecks) processor know changes in the caches of other Send point-to-point requests to processors via network processors? How do other processors know changes in this cache? Scales better than Snooping

Lecture 19/20: Shared Memory SMP and Cache Coherence

2.5 Classification of Parallel Computers

Chapter 5 Multiprocessors and Thread-Level Parallelism

Chapter 5 Thread-Level Parallelism

Trafficdb: HERE's High Performance Shared-Memory Data Store

Scheduling of Shared Memory with Multi - Core Performance in Dynamic Allocator Using Arm Processor

The Impact of Exploiting Instruction-Level Parallelism on Shared-Memory Multiprocessors

Lecture 17: Multiprocessors

Use of Shared Memory in the Context of Embedded Multi-Core Processors: Exploration of the Technology and Its Limits

GPU Computing with CUDA Lecture 3 - Efficient Shared Memory Use

Instruction Level Parallelism Example

CIS 501 Computer Architecture This Unit: Shared Memory

Programming with Shared Memory PART I