Multi-Threading for Multi-Core Architectures

Intel Core Duo AMD Athlon 64 X2 Multithreading and Parallel Microprocessors Stephen Jenks Electrical Engineering and Computer Science [email protected] Mostly Worked on Clusters UCI EECS Scalable Parallel and Distributed Systems Lab 2 Also Build Really Big Displays HIPerWall: 200 Million Pixels 50 Displays 30 Power Mac G5s UCI EECS Scalable Parallel and Distributed Systems Lab 3 Outline 9 Parallelism in Microprocessors 9 Multicore Processor Parallelism 9 Parallel Programming for Shared Memory 9 OpenMP 9 POSIX Threads 9 Java Threads 9 Parallel Microprocessor Bottlenecks 9 Parallel Execution Models to Address Bottlenecks 9 Memory interface 9 Cache-to-cache (coherence) interface 9 Current and Future CMP Technology UCI EECS Scalable Parallel and Distributed Systems Lab 4 Parallelism in Microprocessors Fetch 9 Pipelining is most Buffer prevalent Decode 9Developed in 1960s 9Used in everything Buffer 9Even microcontrollers Register Access 9Decreases cycle time Buffer Buffer 9Allows up to 1 instruction per cycle (IPC) 9No programming changes ALU 9Some Pentium 4s have more than 30 stages! Buffer Write Back UCI EECS Scalable Parallel and Distributed Systems Lab 5 More Microprocessor Parallelism 9 Superscalar allows Instruction Level Parallelism (ILP) ALU 9Replace ALU with multiple functional units 9Dispatch several Becomes instructions at once 9 Out of Order Execution Load/ 9Execute based on data FP INT INT availability Store 9Requires reorder buffer 9 More than 1 IPC 9 No program changes UCI EECS Scalable Parallel and Distributed Systems Lab 6 Thread-Level Parallelism 9 Simultaneous Multi- 9 Chip Multi- threading (SMT) processors (CMP) 9Execute instructions 9More than 1 CPU per from several threads at chip same time 9AMD Athlon 64 X2, 9Intel Hyperthreading, Intel Core Duo, IBM IBM Power 5/6, Cell Power 4/5/6, Xenon, Cell Int L2 Cache Mem I/F Thread 1 System/ CPU1 FP Thread 2 L/S CPU2 UCI EECS Scalable Parallel and Distributed Systems Lab 7 Chip Multiprocessors 9 Several CPU Cores Intel 9Independent execution 9Symmetric (for now) Core 9 Share Memory Hierarchy Duo 9Private L1 Caches 9Shared L2 Cache (Intel Core) 9Private L2 Caches (AMD) (kept coherent via crossbar) 9Shared Memory Interface AMD 9Shared System Interface Athlon 64 9 Lower clock speed X2 Shared Resources Can Help or Hurt! Images from Intel and AMD UCI EECS Scalable Parallel and Distributed Systems Lab 8 Quad Cores Today Memory HyperTransport Memory Controller Mem Link Mem Controller Frontside Bus System/ System/ System/ System/ System/ System/ Mem I/F Mem I/F Mem I/F Mem I/F Mem I/F Mem I/F L2 Cache L2 Cache L2 L2 L2 L2 L2 Cache L2 Cache CPU1 CPU2 CPU1 CPU2 CPU2 CPU1 CPU2 CPU1 CPU3 CPU4 CPU1 CPU2 Core 2 Xeon (Mac Pro) Dual-Core Opteron Core 2 Quad/Extreme UCI EECS Scalable Parallel and Distributed Systems Lab 9 Shared Memory Parallel Programming 9 Could just run multiple programs at once 9 Multiprogramming 9 Good idea, but long tasks still take long 9 Need to partition work among processors 9 Implicitly (Get the compiler to do it) 9 Intel C/C++/Fortran compilers do pretty well 9 OpenMP code annotations help 9 Not reasonable for complex code 9 Explicitly (Thread programming) 9 Primary needs 9 Scientific computing 9 Media encoding and editing 9 Games UCI EECS Scalable Parallel and Distributed Systems Lab 10 Multithreading 9 Definitions 9 Thread operations 9Process - a program in 9Create / spawn execution 9Join / destroy 9CPU state (Regs, PC) 9Suspend & resume 9Resources 9Address space 9 Uses 9Thread - lightweight 9Solve problem together process (Divide & Conquer) 9CPU state 9Do different things 9Shares resources and 9Manage game economy address space with other 9NPC actions threads in same process 9Manage screen drawing 9Stack 9Sound 9Input handling UCI EECS Scalable Parallel and Distributed Systems Lab 11 OpenMP Programming Model 9 Implicit Parallelism with Source Code Annotations #pragma omp parallel for private (i,k) for (i = 0; i < nx; i++) for (k = 0; k < nz; k++) { ez[i][0][k] = 0.0; ez[i][1][k] = 0.0; … 9 Compiler reads pragma and parallelizes loop 9 Partitions work among threads (1 per CPU) 9 Vars i and k are private to each thread 9 Other vars (ez array, for example) are shared across all threads 9 Can force parallelization of “unsafe” loops UCI EECS Scalable Parallel and Distributed Systems Lab 12 Thread pitfalls 9 Shared data 9 False sharing 92 threads perform 9Non-shared data packed A = A + 1 into same cache line Thread 1: Thread 2: int thread1data; 1) Load A into R1 1) Load A into R1 int thread2data; 2) Add 1 to R1 2) Add 1 to R1 3) Store R1 to A 3) Store R1 to A 9Cache line ping-pongs between CPUs when 9Mutual exclusion preserves threads access their data correctness 9 Locks for heap access 9Locks/mutexes 9Semaphores 9malloc() is expensive because of mutual exclusion 9Monitors 9Java “synchronized” 9Use private heaps UCI EECS Scalable Parallel and Distributed Systems Lab 13 POSIX Threads 9 IEEE 1003.4 (Portable Operating System Interface) Committee 9 Lightweight “threads of control”/processes operating within a single address space 9 A Typical “Process” contains a single thread in its address space 9 Threads run concurrently and allow 9 Overlapping I/O and computation 9 Efficient use of multiprocessors 9 Also called pthreads UCI EECS Scalable Parallel and Distributed Systems Lab 14 Concept of Operation 1. When program starts, main thread is running 2. Main thread spawns child threads as needed 3. Main thread and child threads run concurrently 4. Child threads finish and join with main thread 5. Main thread terminates when process ends UCI EECS Scalable Parallel and Distributed Systems Lab 15 Approximate Pi with pthreads /* the thread control function */ void* PiRunner(void* param) { int threadNum = (int) param; int i; double h, sum, mypi, x; printf("Thread %d starting.\n", threadNum); h = 1.0 / (double) iterations; sum = 0.0; for (i = threadNum + 1; i <= iterations; i += threadCount) { x = h * ((double)i - 0.5); sum += 4.0 / (1.0 + x*x); } mypi = h * sum; /* now store the result into the result array */ resultArray[threadNum] = mypi; printf("Thread %d exiting.\n", threadNum); pthread_exit(0); } UCI EECS Scalable Parallel and Distributed Systems Lab 16 More Pi with pthreads: main() /* get the default attributes and set up for creation */ for (i = 0; i < threadCount; i++) { pthread_attr_init(&attrs[i]); /* system-wide contention */ pthread_attr_setscope(&attrs[i], PTHREAD_SCOPE_SYSTEM); } /* create the threads */ for (i = 0; i < threadCount; i++) { pthread_create(&tids[i], &attrs[i], PiRunner, (void*)i); } /* now wait for the threads to exit */ for (i = 0; i < threadCount; i++) pthread_join(tids[i], NULL); pi = 0.0; for (i = 0; i < threadCount; i++) pi += resultArray[i]; UCI EECS Scalable Parallel and Distributed Systems Lab 17 Java Threads 9 Threading and synchronization built in 9 An object can have associated thread 9 Subclass Thread or Implement Runnable 9 “run” method is thread body 9 “synchonized” methods provide mutual exclusion 9 Main program 9 Calls “start” method of Thread objects to spawn 9 Calls “join” to wait for completion UCI EECS Scalable Parallel and Distributed Systems Lab 18 Parallel Microprocessor Problems L2 Cache Mem I/F CPU1 System/ CPU MemMem CPU2 NowThen 9 Memory interface too slow for 1 core/thread 9 Now multiple threads access memory simultaneously, overwhelming memory interface 9 Parallel programs can run as slowly as sequential ones! UCI EECS Scalable Parallel and Distributed Systems Lab 19 Our Solution: Producer/Consumer Parallelism Using The Cache Data in Memory Data in Memory Memory Communications Bottleneck Through Cache Thread 1 Thread 2 Producer Consumer Half the Work Half the Work Thread Thread UCI EECS Scalable Parallel and Distributed Systems Lab 20 Converting to Producer/Consumer for (i = 1; i < nx - 1; i++){ for (j = 1; j < ny - 1; j++){ /* Update Magnetic Field */ for (k = 1; k < nz - 1; k++){ double invmu = 1.0/mu[i][j][k]; double tmpx = rx*invmu; double tmpy = ry*invmu; double tmpz = rz*invmu; hx[i][j][k] += tmpz * (ey[i][j][k+1] - ey[i][j][k]) - tmpy * (ez[i][j+1][k] - ez[i][j][k]); hy[i][j][k] += tmpx * (ez[i+1][j][k] - ez[i][j][k]) - tmpz * (ex[i][j][k+1] - ex[i][j][k]); hz[i][j][k] += tmpy * (ex[i][j+1][k] - ex[i][j][k]) - tmpx * (ey[i+1][j][k] - ey[i][j][k]); } /* Update Electric Field */ for (k = 1; k < nz - 1; k++){ double invep = 1.0/ep[i][j][k]; double tmpx = rx*invep; double tmpy = ry*invep; double tmpz = rz*invep; ex[i][j][k] += tmpy * (hz[i][j][k] - hz[i][j-1][k]) - tmpz * (hy[i][j][k] - hy[i][j][k-1]); ey[i][j][k] += tmpz * (hx[i][j][k] - hx[i][j][k-1]) - tmpx * (hz[i][j][k] - hz[i-1][j][k]); ez[i][j][k] += tmpx * (hy[i][j][k] - hy[i-1][j][k]) - tmpy * (hx[i][j][k] - hx[i][j-1][k]); } } } UCI EECS Scalable Parallel and Distributed Systems Lab 21 Synchronized Pipelined Parallelism Model (SPPM) Conventional Producer/Consumer (Spatial Decomposition) (SPPM) UCI EECS Scalable Parallel and Distributed Systems Lab 22 SPPM Features 9 Benefits 9 Drawbacks 9Memory bandwidth 9Complex same as sequential programming version 9Some synchronization 9Performance overhead improvement (usually) 9Not always faster than 9Easy in concept SDM (or sequential) UCI EECS Scalable Parallel and Distributed Systems Lab 23 SPPM Performance (Normalized) FDTD Red-Black Eqn Solver UCI EECS Scalable Parallel and Distributed Systems Lab 24 So What’s Up With AMD CPUs? 9 How can SPPM be slower than Seq? 9Fetching from other core’s cache is slower than fetching from memory! 9Makes consumer slower than producer! UCI EECS Scalable Parallel and Distributed

Multi-Threading for Multi-Core Architectures

Shared Memory Multiprocessors

Introduction to Multi-Threading and Vectorization Matti Kortelainen Larsoft Workshop 2019 25 June 2019 Outline

Detecting False Sharing Efficiently and Effectively

Decompose and Conquer: Addressing Evasive Errors in Systems on Chip

A Design Methodology for Soft-Core Platforms on FPGA with SMP Linux

Improving Multiprocessor Performance with Coarse-Grain Coherence Tracking

The POWER4 Processor Introduction and Tuning Guide

Intel Hyper-Threading Technology

PREDATOR: Predictive False Sharing Detection

Embedded Multicore: an Introduction

AMC: Advanced Multi-Accelerator Controller

Coherent Shared Memories for Fpgas by David Woods a Thesis