GPU Basics GPU Basics S. Sundar & M. Panchatcharam
Types of Parallelism Task-based parallelism
Data-Based Parallelism S. Sundar and M. Panchatcharam FLYNN’s Taxonomoy Other Parallel Patterns
Classes of Parallel Computers
Simple Coding
August 9, 2014
47 Outline
GPU Basics
S. Sundar & M. Panchatcharam
1 Task-based parallelism Task-based parallelism Data-Based Parallelism
2 Data-Based Parallelism FLYNN’s Taxonomoy
Other Parallel Patterns
Classes of Parallel 3 FLYNN’s Taxonomoy Computers
Simple Coding 4 Other Parallel Patterns
5 Classes of Parallel Computers
6 Simple Coding
47 Pipeline parallelism
GPU Basics
S. Sundar & M. Panchatcharam
3 Task-based parallelism
Data-Based Parallelism A typical OS, exploit a type of parallelism called task-based FLYNN’s Taxonomoy parallelism Other Parallel Patterns Classes of Parallel Example: A user read an article on a website while playing Computers music from media player Simple Coding In terms of parallel programming, the linux operator uses pipe commands to execute this
47 Coarse/fine-grained parallelism
GPU Basics
S. Sundar & M. Panchatcharam
4 Task-based parallelism Applications are often classified according to how often their Data-Based Parallelism subtasks need to synchronize or communicate with each other FLYNN’s Taxonomoy
If subtasks communicate many times per second, then it is Other Parallel Patterns called fine-grained parallelism Classes of Parallel Computers
If they do not communicate many times per second, then it is Simple Coding called coarser-grained parallelism If they rarely or never have to communicate, then it is called embarrassingly parallel Embarrassingly parallel applications are easiest to parallelize.
GPU Basics
S. Sundar & M. Panchatcharam
5 Task-based parallelism
Data-Based Parallelism
Task Parallelism FLYNN’s Taxonomoy In a multiprocessor system, task parallelism is achieved when each Other Parallel Patterns Classes of Parallel processor executes different thread on the same or different data. Computers
Simple Coding Threads may execute same or different code communication between threads takes place by passing data from one thread to next
GPU Basics
S. Sundar & M. Panchatcharam
Task-based parallelism
6 Data-Based Parallelism
Data Parallelism FLYNN’s Taxonomoy In a multiprocessor system, each processor performs the same task Other Parallel Patterns on different pieces of distributed data Classes of Parallel Computers
Simple Coding Consider adding two matrices. Addition is the task on different piece of data (each element) It is a fine-grained parallelism
47 Instruction Level Parallelism
GPU Basics
S. Sundar & M. Panchatcharam
Task-based parallelism
7 Data-Based Parallelism
Instruction level parallelism is a measure of how many of the FLYNN’s Taxonomoy operations in a computer program can be performed Other Parallel Patterns simultaneously. The potential overlap among instructions is called Classes of Parallel Computers instruction level parallelism. Simple Coding Hardware level works on dynamic parallelism Software level works on static parallelism
47 Taxonomy
GPU Basics
S. Sundar & M. Panchatcharam
Task-based parallelism
Data-Based Parallelism
Flynn’s Taxonomy is a classification of different computer 8 FLYNN’s Taxonomoy architectures. Various types are as follows: Other Parallel Patterns SIMD - Single Instruction Multiple Data Classes of Parallel Computers
MIMD - Multiple Instruction Multiple Data Simple Coding SISD - Single Instruction Single Data MISD - Multiple Instruction Single Data
47 SISD
GPU Basics
S. Sundar & M. Panchatcharam Standard serial programming Task-based parallelism
Single instruction stream on a single data Data-Based Parallelism Single-core CPU is enough 9 FLYNN’s Taxonomoy
Other Parallel Patterns
Classes of Parallel Computers
Simple Coding
47 MIMD
GPU Basics
S. Sundar & Today’s dual or quad-core desktop machines M. Panchatcharam
Work allocated in one of N CPU cores Task-based parallelism
Data-Based Each thread has independent stream of instructions Parallelism Hardware has the control logic for decoding separate 10 FLYNN’s Taxonomoy instruction streams Other Parallel Patterns Classes of Parallel Computers
Simple Coding
47 SIMD
GPU Basics
S. Sundar & Type of Data parallelism M. Panchatcharam Single instruction stream at any one point of time Task-based parallelism Data-Based Single set of logic to decode and execute the instruction Parallelism stream 11 FLYNN’s Taxonomoy Other Parallel Patterns
Classes of Parallel Computers
Simple Coding
47 MISD
GPU Basics Many functional units perform different operations on the S. Sundar & same data M. Panchatcharam Task-based parallelism
Pipeline architectures belong to this type Data-Based Example: Horner’s Rule is an example Parallelism 12 FLYNN’s Taxonomoy
Other Parallel Patterns y = (··· (((an∗x+an−1)∗x+an−2)∗x+an−3)∗x+···+a1)∗x+a0 Classes of Parallel Computers
Simple Coding
47 Loop based Pattern
GPU Basics
S. Sundar & M. Panchatcharam
Task-based parallelism
Data-Based Parallelism
FLYNN’s Taxonomoy
Are you familiar with loop structures? 13 Other Parallel Patterns
Classes of Parallel Types of loop? Computers Entry level loop Simple Coding Exit level loop
47 Loop based Pattern
GPU Basics
S. Sundar & M. Panchatcharam
Task-based parallelism
Data-Based Easy pattern to parallelize Parallelism With inter-loop dependencies removed, decide splitting or FLYNN’s Taxonomoy 14 Other Parallel Patterns partition the work between available processors Classes of Parallel Optimize communication between processors and the use of Computers on chip resources Simple Coding Communication overhead is the bottleneck Decompose based on the number of logical hardware threads available
47 Loop based Pattern
GPU Basics
S. Sundar & M. Panchatcharam
Task-based parallelism
Data-Based However, oversubscribing the number of threads leads to poor Parallelism FLYNN’s Taxonomoy performance 15 Other Parallel Patterns
Reason: Context switching performed in software by the OS Classes of Parallel Computers
Aware of hidden dependencies while doing existing serial Simple Coding implementation Concentrate on inner loops and one or more outer loops Best approach, parallelize only the outer loops
47 Loop based Pattern
GPU Basics
S. Sundar & M. Panchatcharam
Task-based parallelism
Data-Based Parallelism
Note: Most loop can be flattened FLYNN’s Taxonomoy Reduce inner loop and outer loop to a single loop, if possible 16 Other Parallel Patterns Classes of Parallel Example: Image processing algorithm Computers X pixel axis in the inner loop and Y axis in the outer loop Simple Coding Flatten this loop by considering all pixels as a single dimensional array and iterate over image coordinates
47 Fork/Join Pattern
GPU Basics
S. Sundar & M. Panchatcharam
Task-based parallelism
It is a common pattern in serial programming where there are Data-Based synchronization points and only certain aspects of the Parallelism program are parallel. FLYNN’s Taxonomoy 17 Other Parallel Patterns
The serial code reaches the work that can be distributed to P Classes of Parallel processors in some manner Computers Simple Coding It then Forks or spawns N threads/processes that perform the calculation in parallel. These processes execute independently and finally converge or join once all the calculations are complete A typical approach found in OpenMP
47 Fork/Join Pattern
GPU Basics
Code splits into N threads and later converges to a single S. Sundar & thread again M. Panchatcharam See figure, we see a queue of data items Task-based parallelism Data-Based Data items are split into three processing cores Parallelism
Each data item is processed independently and later written FLYNN’s Taxonomoy
to appropriate destination place 18 Other Parallel Patterns
Classes of Parallel Computers
Simple Coding
47 Fork/Join Pattern
GPU Basics
S. Sundar & M. Panchatcharam
Typically implemented with partitioning of the data Task-based parallelism Data-Based Serial code launches N threads and divide the dataset equally Parallelism between the N threads FLYNN’s Taxonomoy 19 Other Parallel Patterns
Works well, if each packet of data takes same time to process Classes of Parallel Computers If one thread takes too much time to work, it becomes single Simple Coding factor determining the local time Why did we choose three threads, Instead of six threads? Reality: Millions of data items attempting to fork million threads will cause almost all OS to fail OS applies "fair scheduling policy"
47 Fork/Join Pattern
GPU Basics
S. Sundar & M. Panchatcharam
Task-based parallelism
Data-Based Programmer and many multithreaded libraries will use the Parallelism number of logical processor threads available as number of FLYNN’s Taxonomoy processes to fork 20 Other Parallel Patterns Classes of Parallel CPU threads are so expensive to create, destroy and utilize Computers Simple Coding Fork/join pattern is useful when there is an unknown amount of concurrency available in a problem Traversing a tree structure may fork additional threads when it encounters another node
47 Divide and Conquer Pattern
GPU Basics
S. Sundar & M. Panchatcharam
Task-based parallelism
Data-Based Parallelism
A pattern for breaking down(divide) large problems into FLYNN’s Taxonomoy smaller sections each of which can be conquered 21 Other Parallel Patterns Classes of Parallel Useful with recursion Computers Example: Quick Sort Simple Coding Quick sort recursively partitions the data into two sets, above pivot point and below pivot point
47 Divide and Conquer Pattern
GPU Basics
S. Sundar & M. Panchatcharam
Task-based parallelism
Data-Based Parallelism
FLYNN’s Taxonomoy
22 Other Parallel Patterns
Classes of Parallel Computers
Simple Coding
GPU Basics
S. Sundar & M. Panchatcharam
Task-based parallelism
Data-Based Parallelism
Implicit Parallelism FLYNN’s Taxonomoy It is characteristic of a programming language that allows a 23 Other Parallel Patterns Classes of Parallel compiler or interpreter to automatically exploit the parallelism Computers inherent to the computations expressed by some of the language’s Simple Coding constructs In other words, it is a compiler level parallelism, such as openmp flags or -O3 flags
47 Implicit Parallelism
GPU Basics
S. Sundar & M. Panchatcharam
Task-based parallelism
Data-Based Parallelism Advantages: FLYNN’s Taxonomoy Need not worry about task division or communication 24 Other Parallel Patterns Classes of Parallel Focus on the problem rather than parallelization Computers Simple Coding Substantial improvement of programmer’s productivity Add simplicity and clarity to the program
47 Implicit Parallelism
GPU Basics
S. Sundar & M. Panchatcharam
Task-based parallelism
Data-Based Parallelism
Disadvantages: FLYNN’s Taxonomoy
Programmer loose the control over the parallel execution of 25 Other Parallel Patterns the program Classes of Parallel Computers
Gives less than optimal parallel efficiency Simple Coding Debugging is difficult Every program has some parallel and serial logic
47 Implicit Parallelism
GPU Basics
S. Sundar & M. Panchatcharam
Implicit Parallelism Task-based parallelism It is the representation of concurrent optimizations by means of Data-Based Parallelism primitive in the form of special-purpose directives or function calls. FLYNN’s Taxonomoy
26 Other Parallel Patterns
Related to process synchronization, communication, Classes of Parallel partitioning Computers Simple Coding Absolute programmer control over the parallel execution Skilled programmer takes advantage of explicit parallelism to produce very efficient code However, it is difficult to do explicit parallelism Reason: Extra work in planning decomposition/partition, synchronization of concurrent process
47 Classes of Parallel Computers
GPU Basics
S. Sundar & M. Panchatcharam
Task-based parallelism
Data-Based Parallelism
Multicore computing FLYNN’s Taxonomoy Symmetric multiprocessing Other Parallel Patterns 27 Classes of Parallel Distributed computing Computers Cluster Computing Simple Coding Massive parallel computing Grid computing
47 Classes of Parallel Computers
GPU Basics
S. Sundar & M. Panchatcharam
Task-based parallelism
Data-Based Parallelism Reconfigurable computing with field programmable gate FLYNN’s Taxonomoy arrays (FPGA) Other Parallel Patterns 28 Classes of Parallel General Purpose computing on Graphics Processing Units Computers (GPGPU) Simple Coding Application Specific integrated circuits Vector Processors
47 Multicore computing
GPU Basics
S. Sundar & M. Panchatcharam Contains multiple execution units on the same chip Task-based parallelism
Multiple instructions per cycle from multiple instruction Data-Based streams Parallelism FLYNN’s Taxonomoy Example: Intel Quad-Core Other Parallel Patterns Advantages: 29 Classes of Parallel Computers
Multiple CPU cores on the same die allows the cache Simple Coding coherency to operate at much higher clock rate It allows higher performance at lower energy Disadvantages: Difficult to manage thermal power Two processing cores shares the same bus and memory limiting the real-world performance
47 Symmetric Multiprocessing (SMP)
GPU Basics
S. Sundar & M. Panchatcharam
Task-based parallelism
Data-Based Parallelism
SMP is a system with multiple identical processors that share FLYNN’s Taxonomoy memory and connect via bus Other Parallel Patterns 30 Classes of Parallel SMPs don’t compromise more than 32 processors Computers SMPs are cost effective if sufficient amount of memory Simple Coding bandwidth exists Example: Intel’s Xeon machine
47 Distributed Computing
GPU Basics
S. Sundar & M. Panchatcharam
Task-based parallelism
Distributed Computing Data-Based Parallelism It is a distributed memory computer system in which the FLYNN’s Taxonomoy processing elements are connected by a network Other Parallel Patterns
31 Classes of Parallel Examples: Computers Telephone networks Simple Coding World wide web/Internet Aircraft control systems Multiplayer online games Grid Computing
47 Cluster Computing
GPU Basics
S. Sundar & M. Panchatcharam
Task-based parallelism
Cluster Data-Based Parallelism Cluster is a group of loosely coupled computers that work together FLYNN’s Taxonomoy closely, considered as a single computer Other Parallel Patterns
32 Classes of Parallel Consists of multiple standalone computers connected by a Computers network Simple Coding Load balancing is difficult if the cluster machines are not symmetric Examples: Beowulf cluster, TCP/IP LAN Top500 supercomputers are clusters
47 Massively Parallel Computing
GPU Basics
S. Sundar & M. Panchatcharam
Task-based parallelism MPP Data-Based Parallelism
It is a single computer with many networked processors FLYNN’s Taxonomoy
Other Parallel Patterns
Similar to clusters, but specialized with interconnect networks 33 Classes of Parallel Computers
Larger than clusters Simple Coding Has more than 100 processors Each CPU has its own memory and OS/applications Communicates via high-speed interconnect Example: Blue Gene/L, the fifth fastest super computer
47 NUMA
GPU Basics
S. Sundar & M. Panchatcharam
Task-based parallelism
NUMA Data-Based Parallelism It stands for Non-Uniform Memory Access and is a special type of FLYNN’s Taxonomoy shared memory architecture where access times of different Other Parallel Patterns memory locations by a processor may vary as may also access 34 Classes of Parallel times to the same memory location by different processors Computers Simple Coding It is known as a tightly coupled form of cluster computing Influences the memory access performance Most of the current OS such windows 7, 8 support NUMA Linux kernel 2.5 supports
47 NUMA
GPU Basics
S. Sundar & Modern CPUs work faster than the main memory they use M. Panchatcharam All of the processors have equal access to the memory I/O in Task-based parallelism Data-Based the system (see figure) Parallelism Special attention require to write a software run on NUMA FLYNN’s Taxonomoy Other Parallel Patterns
35 Classes of Parallel Computers
Simple Coding
47 Parallel Random Access Machine
GPU Basics
S. Sundar & M. Panchatcharam
Task-based parallelism
Data-Based It is a shared memory abstract machine Parallelism
Read/write conflicts in accessing the same shared memory FLYNN’s Taxonomoy location simultaneously are resolved by the following Other Parallel Patterns categories 36 Classes of Parallel Computers EREW-Exclusive Read, exclusive write. Every memory cell Simple Coding can read/write by only one processor at a time CREW-Concurrent Read, exclusive write. Multiple processor can read a memory but only one can write at a time CRCW-Concurrent Read, concurrent write. Multiple processor can read and write
47 GPU Basics
S. Sundar & M. Panchatcharam
Task-based parallelism
Data-Based FAQ on Parallel Parallelism FLYNN’s Taxonomoy Computing Other Parallel Patterns 37 Classes of Parallel Computers
Simple Coding
47 FAQ
GPU Basics
S. Sundar & M. Panchatcharam
Task-based parallelism
Shared Memory architecture: Single address space is visible to Data-Based all execution threads Parallelism FLYNN’s Taxonomoy Task-latency:The time taken for a task to complete since a Other Parallel Patterns
request for it is made 38 Classes of Parallel Task-throughput:The number of tasks completed in a given Computers Simple Coding time Speed up: The ratio of some performance metric obtained using a single processor with that obtained using a set of parallel processors Parallel Efficiency: The speed-up per processor
47 FAQ
GPU Basics
S. Sundar & M. Panchatcharam
Maximum time speed up possible according to Amdhal’s law: Task-based parallelism 1 , where fs is inherently sequential fraction of the time taken Data-Based fs Parallelism
by the best sequential of the task (Prove!) FLYNN’s Taxonomoy Cache Coherence: Different processors maintain their own Other Parallel Patterns local caches. That is multiple copies of the same data 39 Classes of Parallel Computers
available. Coherence implies that access to the local copies Simple Coding behave similarly to access from the local copy–apart from the time to access. False Sharing: Sharing of a cache line by distinct variables. If such variables are not accessed together, the un-accessed variable is unnecessarily brought into cache along with the accessed variable.
47 NC=P Problem
GPU Basics
S. Sundar & M. Panchatcharam
Task-based parallelism
Finally, let us look at an unsolved problem (open Problem) in Data-Based Parallel Computing Parallelism FLYNN’s Taxonomoy NC=P Other Parallel Patterns 40 Classes of Parallel The class NC is the set of decision problems decidable in Computers polylogarithmic time on parallel computer with a polynomial Simple Coding number of processor. That is, a problem is in NC, if there exists constants c and k such that it can be solved in time O(log c n) using O(nk ) parallel processors
47 NC=P Problem
GPU Basics
S. Sundar & M. Panchatcharam
Task-based parallelism
Data-Based NC=P Parallelism FLYNN’s Taxonomoy P : tractable or efficiently solvable problem or decision problems Other Parallel Patterns that can be solved by a deterministic turing machine using 41 Classes of Parallel polynomial time Computers NC : problems can be efficiently solved on a parallel computer Simple Coding NC ⊂ P as polylogarithmic parallel computations can be simulated by polynomial time sequential ones
Open Problem: Does NC = P
47 NC=P Problem
GPU Basics
S. Sundar & M. Panchatcharam
Task-based parallelism
Data-Based Parallelism
FLYNN’s Taxonomoy
NC problem is known to include many problems including Other Parallel Patterns
Integer addition, multiplication and division 42 Classes of Parallel Computers
Matrix multiplication, determinant, inverse, rank Simple Coding Polynomial GCD, Sylvester Matrix
47 GPU Basics
S. Sundar & M. Panchatcharam
Task-based parallelism
Data-Based Parallelism Simple Coding FLYNN’s Taxonomoy Other Parallel Patterns
Classes of Parallel Computers
43 Simple Coding
47 Simple Vector Addition
GPU Basics
S. Sundar & M. Panchatcharam
Task-based parallelism
Data-Based Explanation Parallelism FLYNN’s Taxonomoy
Vector addition Step 1: c[0]=a[0]+b[0] Other Parallel Patterns Step 2: c[1]=a[1]+b[1] Classes of Parallel for (i=0;i 47 Parallelization in CPU GPU Basics S. Sundar & M. Panchatcharam Assume you have 4 processors and N is divisible by 4 Task-based parallelism Data-Based Parallelism Explanation FLYNN’s Taxonomoy Step 1: Other Parallel Patterns c[0:3]=a[0:3]+b[0:3] Classes of Parallel Vector addition Computers Step 2: 45 Simple Coding //4 is the number of processors for(j=0;j<4;j++) c[4:7]=a[4:7]+b[4:7] for (i=0;i 47 Parallelization in Intel OpenMP GPU Basics S. Sundar & M. Panchatcharam Task-based parallelism Data-Based OpenMP code Parallelism FLYNN’s Taxonomoy #include # pragma omp for schedule (static, chunk) Classes of Parallel for (i=0;i Intel OpenMP cares the parallelization Autoparallelization in Intel 47 GPU Basics S. Sundar & M. Panchatcharam Task-based parallelism Data-Based Parallelism THANK YOU FLYNN’s Taxonomoy Other Parallel Patterns Classes of Parallel Computers 47 Simple Coding 47