Programming with CUDA

Programming with CUDA Jens K. Mueller [email protected] Department of Mathematics and Computer Science Friedrich-Schiller-University Jena Tuesday 31st May, 2011 Today’s lecture: Analyzing Parallel Algorithms Sequential Machine Models Well established models sequential models of computability I Turing machine I Random Access Machine (RAM) I µ-recursive functions I λ-calculus All these notions of computability are Turing equivalent. Church hypothesis: "Every function that is computable is computable by a Turing machine." 2011-05-31 – CUDA 3 / 20 Parallel Machine Models I Directed acyclic graph I Networking models CPUs directly connected I Parallel Random Access Machine (PRAM) RAMs only connected via infinite shared memory I Exclusive Read Exclusive Write (EREW) I Concurrent Read Exclusive Write (CREW) I Concurrent Read Concurrent Write (CRCW) I Common Concurrent Read Concurrent Write I Arbitrary Concurrent Read Concurrent Write I Priority Concurrent Read Concurrent Write I Associative Concurrent Read Concurrent Write I ... Algorithm analysis for one model may not adapt to another model. 2011-05-31 – CUDA 4 / 20 Algorithm Analysis I Run-Time analysis I Space analysis in dependence to input size I Worst Case I Average Case I Best Case I Amortized Average time per operation over a worst-case sequence of operations. 2011-05-31 – CUDA 5 / 20 Big O Notation I f (n) ∈ O(g(n)) There exists a c for all n > n0 such that f (n) ≤ c · g(n). f is asymptotically bounded above by g (up to a constant factor). I f (n) ∈ Ω(g(n)) There exists a c for all n > n0 such that f (n) ≥ c · g(n). f is asymptotically bounded below by g (up to a constant factor). 2011-05-31 – CUDA 6 / 20 Big O Notation (cont.) I f (n) ∈ Θ(g(n)) There exists a c1, c2 for all n > n0 such that c1 · g(n) ≤ f (n) ≤ c2 · g(n). f is asymptotically bounded above and below by g (up to a constant factors). Shorthand for: f (n) ∈ O(g(n)) and f (n) ∈ Ω(g(n)). 2011-05-31 – CUDA 7 / 20 Parallel RAM I Infinite processors I Infinite memory I Constant Random Access I Processors clock is synchronized I Instructions: Read, Compute, Write I Run-time analysis dependent on input size and number of processors 2011-05-31 – CUDA 8 / 20 Running Time Definition (Sequential Running Time) Given a problem P. Then T is the best known running time of an sequential algorithm that solves P. Definition (Parallel Running Time) Given a problem P. Then Tp is the best known running time of an parallel algorithm using p processors that solves P. 2011-05-31 – CUDA 9 / 20 Parallel Speedup Definition (Parallel Speedup) T Sp := Tp where T is the running time of the best known sequential algorithm and Tp is the running time of a parallel algorithm using p processors. Lemma Sp ≤ p The parallel speedup can never be larger than the number of processors. Ideal speedup is p (linear speedup; n-times faster with n-times more processors). 2011-05-31 – CUDA 10 / 20 Parallel Efficiency Definition (Parallel Efficiency) The efficiency of an parallel algorithm is S E = p p p Corollary Ep ≤ 1 2011-05-31 – CUDA 11 / 20 Parallel Cost Definition (Parallel Cost) The cost of an parallel algorithm is Cp := p · Tp i.e., the product of parallel running time and the number of processors. Lemma The costs of an parallel algorithm are greater than the sequential running time: T ≤ Cp 2011-05-31 – CUDA 12 / 20 Parallel Work Definition (Parallel Work) The work of an parallel algorithm W is the total number of executed operations over all processors. Lemma W ≤ Cp Proof. PTp PTp W = i=1 Wi ≤ i=1 p = p · Tp = Cp Wi ≤p 2011-05-31 – CUDA 13 / 20 Work Optimality Definition (Work optimal) A parallel algorithm is work optimal if W = T . 2011-05-31 – CUDA 14 / 20 Cost Optimality Definition (Cost optimal) A parallel algorithm is cost optimal if Cp = T . Corollary For a parallel cost optimal algorithm it follows Sp = p and Ep = 1. Cost optimality implies linear speedup. Example (Summation) I Sequential summation is cost optimal since C1 = 1 · T = 1 · n = O(n) I Parallel summation (using a binary tree) is not cost optimal since Cn = n · log n = O(n · log n) 6= O(n) 2011-05-31 – CUDA 15 / 20 Sorting I T = O(n log n) I Cost optimal algorithms I p = O(n) and Tp = O(log n) n 2 I p = O( log n ) and Tp = O(log n) 2 I Using O(n ) and Tp = O(1) is not cost optimal. I Cost optimality takes care of the number of used processors. Important when there are fewer processors. 2011-05-31 – CUDA 16 / 20 Brent’s Theorem Theorem (Brent) A parallel algorithm using p processors with time Tp and work W can be executed on p0 < p processors with time W 0 Tp = Tp + b p0 c. Brent’s Theorem allows changing the number of processors for a given parallel algorithm. Example (Parallel Reduction (using a binary tree)) 0 n Tp = O(log n). Set p = O( log ). Then O(n) Tp0 = O(log n) + b n c = O(log n) + O(log n) = O(log n). O( log n ) Then the parallel reduction is cost optimal, i.e. n 0 Cp = O(log n) ·O( log n ) = O(n). 2011-05-31 – CUDA 17 / 20 Amdahl’s Law Assume T = 1 and fraction x of the sequential algorithm can be parallelized using p processors. x Then Tp = (1 − x) + . | {z } p sequential part |{z} parallel part Overall speedup is 1 x (1 − x) + p I Based on fixed input 2011-05-31 – CUDA 18 / 20 Amdahl’s Law (cont.) 10 x = 0.5 9 x = 0.6 x = 0.7 8 x = 0.8 x = 0.9 7 6 5 speedup 4 3 2 1 1 4 16 64 256 1024 4096 16384 65536 # processors Figure: Amdahl’s Law 2011-05-31 – CUDA 19 / 20 Questions?.

Programming with CUDA

Chapter 4 Algorithm Analysis

Introduction to Parallel Computing, 2Nd Edition

Oblivious Network RAM and Leveraging Parallelism to Achieve Obliviousness

An External-Memory Work-Depth Model and Its Applications to Massively Parallel Join Algorithms

Introduction to Parallel Computing

Parallel Functional Arrays

Oblivious Parallel RAM and Applications

Parallel Programming

Oblivious Network

Effective Data Parallel Computing on Multicore Processors

Parallel Computation Models Parallel Computation Models

A Practical Hierarchial Model of Parallel Computation: the Model